header

Social Psychological and Personality Science

3.3 Impact Factor more »

Gender Differences in Personality Traits and Average Personality States: Using Experience Sampling to Circumvent Bias in Self-Reports

First Published  July  2025

Article Information

Issue published:   01  

DOI:10.1177/19485506251347722

Lilly Buck , Larissa Doran , Karolina Kolodziejczak-Krupp , Fabian Gander , Alex Christoph Traut , Maximiliane Uhlich , Alexander Grob , Kai T. Horstmann ,

,

University of Siegen, Germany
,


,

Humboldt-Universität zu Berlin, Germany
,


,

Medical School Berlin, Germany
,


,

University of Basel, Switzerland

Kai T. Horstmann, Department of Psychology, University of Siegen, Obergraben 23, 57072 Siegen, Germany. Email: kai.horstmann@uni-siegen.de

Lilly Buck and Larissa Doran share first authorship.

Abstract

Gender differences in personality traits are consistently found in personality research. Common explanations discuss evolutionary causes, social roles and gender norms, and assessment biases. The latter corresponds with findings and models from research on emotional self-reports, which suggest that different judgment processes take place in self-reports depending on the time frame that is being reported on. Gender differences in personality traits could thus stem from people relying on their self-image, identity, and gender norms when answering global trait questionnaires. Results from two German-speaking samples of N = 324 older adults and N = 1,685 younger adults revealed gender differences in global trait scores for emotional stability, agreeableness, and conscientiousness, but nonsignificant and smaller differences in corresponding average state scores. The findings are in line with theory suggesting that average state reports might be less biased, whereas global trait reports might include stereotypes and identity.

Keywords

gender differences, personality states, personality traits, experience sampling, self-report bias

Self-report questionnaires are indispensable, yet not uncontroversial tools in personality psychology and beyond. One area of debate is the consistent observation of gender 1 differences in self-reported personality trait scores (Helgeson, 2015; Löckenhoff et al., 2014). While often seen as reflecting true gender differences in personality, alternative explanations suggest they may be influenced by assessment biases (Helgeson, 2015; Vianello et al., 2013). This aligns with Robinson and Clore’s (2002a, 2002b) evidence that global self-reports in emotion research may be biased by social stereotypes. They argued that such biases are most pronounced in global self-reports, where participants recall their behavior over longer periods of time, but not in self-reports of momentary behavior. Thus, one could hypothesize that gender differences in global self-reports of personality traits might not appear in self-reports of momentary manifestations of personality, i.e., personality states. While gender differences in personality traits are well documented, no studies have directly compared these differences at the trait versus state level, despite their potentially different susceptibility to gender stereotypes.

Personality at the Trait and State Level

Personality traits are broadly defined as stable characteristics (Funder, 2001; Roberts & Yoon, 2022) that relate to a person’s behaviors, thoughts, and feelings (McCrae & Costa, 2008a). Personality is most often described in terms of five core dimensions, the Big Five (Funder, 2001; McCrae & Costa, 2008a): Openness, Conscientiousness, Extraversion, Agreeableness, and Emotional Stability.

Personality traits manifest in daily life as personality states (Horstmann & Ziegler, 2020). Personality traits are defined as a propensity for the expression of certain behaviors, thoughts, feelings, and desires, which should therefore also show in specific situations or at specific events (Fleeson, 2001; Fleeson & Jayawickreme, 2015). Still, individuals do not behave identically across all situations (Horstmann et al., 2018, 2021a), so those momentary expressions of personality traits—that is, personality states—vary intra-individually (Jones et al., 2017; Wilson et al., 2017). For example, a highly agreeable person may occasionally behave aggressively or selfishly, but less so on average compared with the general population. Whole Trait Theory (Fleeson, 2001; Fleeson & Jayawickreme, 2015) posits that personality trait manifestations are distributed around a person’s trait level; with the average score across states reflecting the person’s trait score (Fleeson & Gallagher, 2009; Horstmann et al., 2021a). Consequently, gender differences in traits should also be observable in average state scores.

Gender Differences in Personality

Small but consistent gender differences in self-reported personality traits have been observed across age groups and cultures (Costa et al., 2001; Feingold, 1994; Helgeson, 2015; Lippa et al., 2010; Löckenhoff et al., 2014; Murphy et al., 2021; Soto & John, 2017b; Weisberg et al., 2011). At the domain level, women frequently report lower emotional stability and higher agreeableness, and sometimes higher conscientiousness (Helgeson, 2015). For extraversion and openness, findings are mixed, with gender differences sometimes absent at domain level but detectable at the facet level (Löckenhoff et al., 2014; Weisberg et al., 2011). These facet-level gender differences sometimes diverge and cancel each other out at domain level. For example, women seem to score higher on the extraversion facets “positive emotions” and “warmth,” while men seem to score higher on “assertiveness” (Löckenhoff et al., 2014; Weisberg et al., 2011).

Despite the consistently reported gender differences in personality traits, their explanation remains unclear. The most prominent explanations attribute these differences either to evolutionary and biological causes, or to the (social) environments that men and women experience.

Biological and evolutionary explanations capitalize on evidence for linkages between sex-related hormones and behaviors that relate to personality traits (Helgeson, 2015), suggesting that men and women exhibit different trait levels to maximize reproductive success. For instance, Buss (2007) proposed that women developed communal characteristics for child-rearing, whereas men evolved aggressive and dominant behaviors to acquire resources. Along these lines, personality differences between men and women have been argued to be natural, as they are found across cultures and methodologies (Löckenhoff et al., 2014).

Social-environmental explanations reject biological causes, arguing that gender differences in personality largely arise from social phenomena. Socialization and gender norms are assumed to contribute to shaping a person’s personality by prescribing certain behaviors for certain genders, with these expectations conveyed through the social environment, including parents, teachers, peers, and the media (Helgeson, 2015).

Social constructionism and situational explanations offer an alternative explanation for how social roles shape behavior: Occupying a social role (e.g., being a mother) increases the likelihood of encountering specific situations (e.g., being at the playground; Helgeson, 2015). Thus, men and women may find themselves in different situations due to social and gender roles, which require particular behaviors. Repeatedly enacting these behaviors may lead to long-lasting trait changes (Wrzus, 2021; Wrzus & Roberts, 2017).

Explaining Gender Differences With the Accessibility Model of Emotional Self-Report

The explanations presented above suggest that gender differences in personality trait reports reflect genuine differences in tendencies of men and women to think, feel, or act. However, it is also possible that personality trait self-reports are only in part indicative of manifested personality in daily life, and instead reflect to a certain proportion gender stereotypes (Helgeson, 2015). To examine this, one could compare different measurements of personality traits that are differently affected by gender stereotypes.

For the assessment of self-reported personality traits, participants evaluate how well statements describe their general behavior, thoughts, or feelings (Dunning et al., 2004). Items refer to long periods, such as “I am someone who …,” or “Generally, I am …,” aiming to capture a person’s self-knowledge regarding their personality manifestations in the past. In contrast, personality states are usually assessed by asking participants to rate their momentary thoughts, feelings, or behaviors several times per day. Items refer to a short time frame, for example, “since the last measurement” or “in the last situation” (e.g., Horstmann & Ziegler, 2020). Thus, the difference between trait and state assessments is the long vs. short period over which participants are required to reflect and report on their behavior.

However, one-time retrospective reports, like global self-reports of personality, rely on quick self-evaluations (Paulhus & Vazire, 2007) that may not exclusively reflect past behaviors, thoughts, or feelings. Instead, they can also capture variance unrelated to the manifestation of targeted personality traits in daily life (Dunning et al., 2004). This variance has been termed “identity” (Connelly et al., 2021; McAbee & Connelly, 2016). Assuming that global self-reports are influenced by gender stereotypes, but momentary self-reports are not or less so, gender differences—as commonly reported in the literature—would be expected to occur more in global self-reports than in momentary self-reports.

In their accessibility model of emotional self-report, Robinson and Clore (2002a, 2002b) propose that self-reports of emotions over extended periods rely on semantic memory (general knowledge) rather than episodic memory (experiential knowledge from specific events). As the reference period lengthens and the cognitive demand of aggregating information from numerous specific experiences increases, individuals shift from using episodic to semantic memory. For single events or short periods, individuals use episodic memory to recall emotions, but rely on semantic memory for evaluating emotions over longer periods. This semantic memory includes situational beliefs (about emotions in certain situations such as at Monday mornings or while on vacation) and identity-related beliefs, which encompass beliefs about one’s own emotions but also social stereotypes, such as the gender stereotype that women are more emotional than men. Robinson and Clore (2002a) conclude that if global self-reports assess a person’s general self-knowledge, they are also influenced by internalized gender stereotypes.

However, for gender stereotypes to affect global self-reports, individuals must hold gender-specific personality trait stereotypes. Indeed, stereotypical women are rated higher than stereotypical men on openness, agreeableness, and conscientiousness, and lower on emotional stability, consistent with actual gender differences in self- and informant reports (Löckenhoff et al., 2014). The accessibility model of emotional self-report (Robinson & Clore, 2002a, 2002b), combined with these empirical findings on gender stereotypes (Löckenhoff et al., 2014), implies that gender differences will appear in global personality trait self-reports. However, personality state reports assessed repeatedly and referring to shorter periods (Horstmann & Ziegler, 2020) should be less biased, resulting in smaller expected gender differences, and any remaining differences more likely reflecting genuine variations in behaviors, thoughts, and feelings. Consistent with previous research, we hypothesize that gender differences in personality trait scores will occur for conscientiousness, agreeableness and emotional stability, but not for openness and extraversion. For personality states, we hypothesize smaller or no gender differences. Table 1 outlines the specific hypotheses.

Gender Differences in Big Five Trait Scores: Hypotheses and Results for Studies 1 and 2
Gender Differences in Big Five Trait Scores: Hypotheses and Results for Studies 1 and 2

Note. Hypotheses for Study 2 were pre-registered, whereas those for Study 1 were not. T M and T F denote the average trait scores for men and women, respectively, on each Big Five dimension, while S M and S F represent means of the average state scores. Δ T M F indicates the mean difference in trait scores between men and women, Δ S M F indicates the mean differences in average state scores. Hypotheses supported by our analyses are bolded. A checkmark (✓) signifies hypothesis support, a cross (×) indicates lack of support, and a tilde (~) denotes inconclusive results. In Study 1, we analyzed manifest means, while Study 2 analyses included both latent and manifest means. For Study 2, boldface and checkmarks refer to analyses using latent means; results using manifest means led to the same conclusions for all dimensions except for emotional stability, for which the third hypothesis was rejected. Hypotheses were added during revision. State openness was assessed only in Study 2.

View larger version

Study 1

Method

Participants

N = 324 participants in older heterosexual couples were drawn from the German Socio-Economic Panel (SOEP; Richter & Schupp, 2015; Wagner et al., 2007). Between 2016 and 2018, 174 couples were recruited. Six couples withdrew, data from two couples and two individuals were lost, and six participants were excluded due to missing data. The final sample (N = 324), aged 56 to 88 years (M = 72.41, SD = 5.82), comprised 50% (n = 162) women. Data can be obtained from the SOEP, all analysis scripts and results are available via the OSF (https://osf.io/qdgek/).

Procedure

Interviewers provided participants with tablets and instructions at home. Over five of the seven ESM days, participants completed personality state assessments up to six times daily (median: 34.5 out of 35 possible assessments per person; see frequency distributions in the Online Supplemental Materials [OSM]). Mean item scores across all measurements were aggregated to obtain average state scores for each Big Five dimension. Participants furthermore completed questionnaires on demographics and personality traits. Compensation was up to €100 (~$114), depending on the number of completed assessments. For a more detailed description of the study procedure, see Kolodziejczak et al. (2022).

Measures

Gender was self-reported as “man” or “woman.”

The German version of the Big Five Inventory (BFI-42; Lang et al., 2001) was used to assess self-reported Big Five personality traits with seven to ten items per domain on a 7-point rating scale (1 = strongly disagree to 7 = strongly agree).

Self-reported personality states on the Big Five domains conscientiousness, extraversion, agreeableness, and emotional stability were assessed using single items. Each item asked, “How would you describe your own behavior since the last questionnaire?” and was rated on 0% to 100% sliders with domain-specific anchors (e.g., agreeableness: “critical, combative” to “understanding, warmhearted”). State Openness was not assessed. Participants completed other instruments not used in this study.

Statistical Analyses

Analyses were not pre-registered and used R (v4.4.1; R Core Team, 2024). Average personality state scores were calculated by averaging all state assessments within each person across the assessment period. Significance level was α = .05 for all analyses.

To test for gender differences, t-tests were conducted for each trait and average state. One-sided t-tests were used for agreeableness and conscientiousness (higher scores expected for women) and emotional stability (higher scores expected for men). Two-sided t-tests were used for extraversion and openness, where no gender differences were assumed. Levene’s tests were conducted to check for variance homogeneity (see Table S1). If variances were unequal, Welch’s t-tests were used; otherwise, independent-samples t-tests were applied. Cohen’s d effect sizes with 95% confidence intervals were estimated.

To test for absence of gender differences, we assessed whether t-tests indicated significant or nonsignificant differences and whether effect sizes were equivalent to zero using equivalence tests (Lakens, 2017; Lakens et al., 2018), with the smallest effect size of interest (SESOI) set at |d| = .10. This SESOI was selected as gender differences in personality traits below d = .10 have often been deemed practically meaningless, while gender differences of around d = .10 and above have been interpreted as “small,” but existing (Hirnstein et al., 2023; Hofmann et al., 2025; Hyde, 2005; Mehl et al., 2007; Tidwell et al., 2024). However, results from equivalence tests should be interpreted cautiously, as the SESOI was set after the first analyses. We furthermore compared differences in traits with differences in states: Point-biserial correlation coefficients were calculated between gender and each trait and state. Hittner's tests (Hittner et al., 2003) were then employed to compare correlations of gender and traits with correlations of gender and states.

Results

Personality trait and average state score descriptives are shown in Table 2 (upper half) for the total sample and in Table 3 (upper half), separately by gender. Correlation coefficients are shown in Table 4 (below the diagonal). Trait score intercorrelations are similar to those reported by Lang et al. (2001). Average state score intercorrelations (ranging from r = .43 to r = .84) are as high as or higher than correlations between corresponding trait and state scores (r = .39 for conscientiousness, r = .37 for extraversion, r = .34 for agreeableness, and r = .42 for emotional stability), which is not unusual in experience sampling data (e.g., Abrahams et al., 2021; Horstmann et al., 2021b).

Descriptive Statistics for Global Trait and Average State Scores
Descriptive Statistics for Global Trait and Average State Scores

Note. Study 1: N = 324. Study 2: N = 1685. ω = McDonald’s ω reliability estimates (men/women). In Study 1, state openness was not assessed. The theoretical score range for trait scores in Study 1 was 1 to 7, and for state scores 0 to 100. In Study 2, the theoretical score range for trait scores was 1 to 5, and for state scores 1 to 7.

View larger version

Descriptives for Global Trait and Average State Scores by Gender
Descriptives for Global Trait and Average State Scores by Gender

Note. Study 1: Nwomen = 162, Nmen = 162. Study 2: Nwomen = 1207, Nmen = 478. In Study 1, state openness was not assessed. The theoretical score range for trait scores in Study 1 was 1 to 7, and for state scores, 0 to 100. In Study 2, the theoretical score range for trait scores was 1 to 5, and for state scores 1 to 7.

View larger version

Bivariate Correlations With 95% Confidence Intervals for Trait and Average State Scores
Bivariate Correlations With 95% Confidence Intervals for Trait and Average State Scores

Note. O = openness, C = conscientiousness, E = extraversion, A = agreeableness, ES = emotional stability. Correlations below the diagonal are derived from Study 1, while those above the diagonal are from Study 2. Values in square brackets indicate the 95% confidence interval for each correlation. Associations between trait scores and average state scores of the same Big Five domain are bolded. p < .05. **p < .01.

View larger version

Results from t-tests and effect sizes of gender differences are reported in Table 5 (upper half) and displayed in Figure 1 (left column). As hypothesized, women scored higher than men on trait conscientiousness (d = −0.26 [−0.48, −0.04]) and trait agreeableness (d = −0.33 [−0.55, −0.11]), while men scored higher on trait emotional stability (d = 0.47 [0.25, 0.69]). No other gender differences were significant.

Gender Differences in Trait and State Score Means
Gender Differences in Trait and State Score Means

Note. Study 1: N = 324. Study 2: N = 1685. O = openness, C = conscientiousness, E = extraversion, A = agreeableness, ES = emotional stability. Values in square brackets indicate 95% confidence intervals. For Study 1, d represents Cohen’s d for the mean-level difference between men and women, with positive values indicating higher scores for men. Results from t-tests for mean differences are shown in the columns t(df) and p, where the direction of the t-tests was determined by the respective hypotheses for each Big Five dimension. For Study 2, d represents Hancock’s d for the difference in latent means between men and women in a multi-group CFA, with positive values indicating higher scores for men. p values for Study 2 refer to the significance of men’s latent means, with women’s latent means fixed to zero. In Study 1, openness was not assessed at state level. p < .05. **p < .01. ***p < .001.

View larger version

Comparison of Self-Reported Personality Traits and Average Personality States by Gender Note. Openness was not assessed as a personality state in Study 1. The figure shows density distributions and boxplots for trait and average state scores in Study 1, and for trait and average state factor scores derived from the final multi-group CFA models in Study 2. Values in brackets display 95% confidence intervals of Cohen’s d (Study 1) and Hancock’s d (Study 2).

Equivalence tests for the absence of gender differences in state measures were inconclusive due to insufficient power (see OSM). However, as hypothesized, absolute point-biserial correlations of gender with trait scores were larger than correlations of gender with state scores for conscientiousness, agreeableness, and emotional stability (see Table 6, upper half).

Hittner’s Tests and Correlations between Gender and Big Five Personality Traits / States
Hittner’s Tests and Correlations between Gender and Big Five Personality Traits / States

Note. Study 1: N = 324. Study 2: N = 1685. O = openness, C = conscientiousness, E = extraversion, A = agreeableness, ES = emotional stability. T = trait, S = state, G = gender. For both studies, r represents the point-biserial correlation with gender (coded as 1 = female, 2 = male), calculated using manifest means in Study 1 and factor scores in Study 2. Direction indicators (T > S, T < S, or T ≠ S) show whether correlations of trait scores with gender were hypothesized to be stronger than, weaker than, or different (without directional specification) from absolute correlations of state scores with gender in Hittner’s tests. Negative Z values indicate that correlations of trait scores with gender are more negative than correlations of state scores with gender. In Study 1, openness was not assessed at state level. p < .05. **p < .01. ***p < .001.

View larger version

Discussion Study 1

Study 1 showed the expected gender differences in global self-reported personality traits, while gender differences in average self-reported states were not significant. Although equivalence tests were inconclusive about average states’ equivalence to zero, effect sizes of gender differences were consistently smaller in state scores compared with trait scores across all domains where trait differences were observed.

However, this study had several limitations: First, the sample consisted of older couples, many retired, with potentially similar daily lives, reducing variability in experiences between partners. If shared daily situations influence personality states (Horstmann et al., 2021b), pre-existing personality differences may no longer manifest in everyday life. Second, trait and state measures differed substantially: Traits were assessed using multi-item scales on a 7-point Likert-type scale, while states were measured with single items on a 0 to 100 scale. These differences may have resulted in differences regarding both reliability and content of the instruments, potentially affecting differences in gender effects. Furthermore, the high intercorrelations among state scores might question their validity. Finally, the study was not pre-registered. To address these concerns, we conducted a pre-registered Study 2.

Study 2

Method

Participants

Data for Study 2 were obtained from the first wave of the longitudinal PINIE study conducted at the University of Basel (Grob et al., 2025). A total of 2,344 participants started the study in 2022 and provided informed consent. We used data provided by Gander et al. (2025), which excluded participants who did not complete the personality trait baseline questionnaire or completed surveys faster than a third of median completion time. We additionally excluded 40 participants who did not identify as either women or men or did not specify their gender. The resulting sample consisted of 1,685 participants (71.6% women) with a median of 12 out of 15 possible state assessments per person (see frequency distributions in the OSM). Participants were aged 18 to 40.62 years (M = 26.54, SD = 5.66), residing predominantly in Germany (58.2%), Switzerland (34.2%), and Austria (6.5%). Data, analysis scripts, and results are available via the OSF (https://osf.io/qdgek/).

Procedure

The online data collection procedure has more extensively been documented elsewhere (Grob et al., 2025). Participants completed a baseline trait questionnaire, followed by a 3-day ESM phase with five notifications daily at random times within predefined time windows. Participants received either monetary compensation (~$10) or course credit.

Measures

Gender was assessed by asking “What is your gender?” (1 = no response, 2 = female, 3 = male, 4 = other).

Personality traits were assessed using the 30-item short version of the Big Five Inventory 2 (BFI-2-S; Rammstedt et al., 2018; Soto & John, 2017a), which assesses the Big Five dimensions openness, conscientiousness, extraversion, agreeableness, and emotional stability.

Personality states were assessed using the Five-Factor Model Personality States Inventory (FFM-PSI; Gander et al., 2025), which consists of 15 bipolar adjective pairs on which participants rated their behavior and feelings on a 7-point bipolar rating scale (e.g., carefree vs. nervous). Average state scores were calculated by averaging item scores per participant across all measurements and aggregating them for each Big Five dimension. Participants completed additional measures not used in this study.

Statistical Analyses

Analyses for Study 2 were pre-registered (https://osf.io/zm46j). All deviations from pre-registration are listed in Table S2. Analyses were conducted in R (v4.4.1). Descriptive statistics and reliability estimates for trait scores and aggregated state scores were calculated as in Study 1.

Measurement Invariance

As a prerequisite for latent mean comparisons, we assessed measurement invariance for trait and average state scores across genders using multiple-group confirmatory factor analysis. These analyses are described in Supplement B.

Testing for Gender Differences

Gender differences in personality traits and states were evaluated by comparing latent mean scores across genders for each Big Five dimension. Latent mean comparisons provide more precise estimates of group differences by accounting for measurement error. The female model’s mean was fixed to zero, while the male participants’ mean was freely estimated. Hancock’s d effect sizes (Hancock, 2001) for latent mean differences, interpretable similarly to Cohen’s d effect sizes for manifest means, were estimated and tested for significance in the expected direction. To test for absence of gender differences, we assessed whether the effect sizes were equivalent to zero using equivalence tests as in Study 1, with the SESOI set at |d| = 0.10. As in Study 1, point-biserial correlation coefficients and Hittner's tests (Hittner et al., 2003) were used to compare gender differences, utilizing correlations of trait and state factor scores with gender, instead of manifest scores. In addition to the latent mean comparisons, all manifest analyses from Study 1 were also conducted for comparability.

Results

Descriptive Statistics

Descriptive statistics for trait and average state scores are presented in Table 2 (lower half), with gender-specific descriptives provided in Table 3 (lower half). Both trait and state scores showed approximately normal distributions, with a broad range of scores on all scales. Correlations between corresponding trait and state measures were moderate (r = .38 to .52), with the highest correlations for emotional stability and conscientiousness (r = .52 for both) and the lowest for openness (r = .38). State measures showed higher intercorrelations (r = .45 to .66) compared with trait measures (r = .01 to .35).

Measurement Invariance

All trait measurement models achieved acceptable fit after implementing theoretically justified modifications. All state measurement models showed significant variance of the latent variable and significant loadings. Most trait and state models achieved partial scalar invariance, with some exceptions. Detailed results with specific modifications are available in the OSM.

Testing for Gender Differences

Latent mean comparisons are displayed in Table 5, lower half. Analyses revealed significant gender differences in trait scores for openness, conscientiousness, agreeableness, and emotional stability, with women scoring higher on openness, conscientiousness, and agreeableness, and men scoring higher on emotional stability. For average state scores, gender differences were smaller and mostly nonsignificant, except for emotional stability and openness. Notably, the direction of the gender difference for openness states was opposite to that observed in trait scores. Apart from emotional stability, where state scores were unequivalent to zero, equivalence tests for absence of gender differences in state measures were inconclusive due to insufficient power (see OSM). However, for those Big Five dimensions that showed trait differences, absolute point-biserial correlations between gender and state factor scores were consistently smaller than those between gender and trait factor scores (see Table 6, lower half).

Additional Manifest Analyses

Manifest analyses using observed scale scores replicated the pattern of generally larger and more often significant gender differences in trait versus average state scores. However, results regarding specific dimensions slightly differed from latent analyses (see OSM for detailed results).

General Discussion

As expected, gender differences similar to those often observed in personality research were found in the current study for global self-reported personality trait scores, but less so for corresponding aggregated self-reported state scores. This finding was consistent across both older (Study 1) and younger (Study 2) participants. The results from Study 2, which was aimed to support the results from Study 1 and address its limitations, provided a consistent and conclusive picture (see Table 1). Specifically, gender differences in global self-reported traits were found in the hypothesized direction for conscientiousness, agreeableness, and emotional stability. For the corresponding average state scores, gender differences were smaller. For extraversion, no differences in trait or average state scores were expected. Although the mean differences did not differ significantly from zero, the equivalence tests lacked power to test whether the effect sizes were smaller than the SESOI of d = .10. Finally, for openness, we expected no differences in traits and average states. However, gender differences in openness trait scores were in the opposite direction to those in average state scores, though small.

Most previous explanations of gender differences in global self-reports of personality traits suggest these differences represent variations in personality dispositions and corresponding behaviors, thoughts, and feelings (Buss, 2007; Eagly & Wood, 1991, 1999; Helgeson, 2015). This research offers an alternative explanation, attributing gender differences in self-reports to the assessment process, as proposed by Robinson and Clore’s (2002a, 2002b) accessibility model of emotional self-report.

Explaining Gender Differences in Personality

The idea that stereotypes may bias personality self-reports has been suggested before, but with different assumed processes. For instance, Vianello et al. (2013) employed implicit measures of Big Five traits, finding substantially smaller gender differences compared to explicit self-reports. However, this approach presupposes a judgment process where a person’s initial self-assessment is subsequently adjusted to fit gender stereotypes—which should lead to gender differences in self-reported personality states as well. In contrast, our results align with the accessibility model of emotional self-report. One cautious interpretation of these findings would be that differences in personality trait scores may be caused by recall bias affecting global self-reports. That is, gender differences in self-reported personality traits may not occur due to an adjustment of initial self-assessments to fit social gender norms, but instead may result from a response process that accesses semantic self-knowledge. According to Robinson and Clore (2002a), the semantic self-knowledge is in turn influenced by gender stereotypes and norms. As self-reports on personality states are assumed to rely on episodic rather than semantic memory due to the shorter timeframe reported on, answers are derived from remembered behavior, thoughts, and feelings rather than self-knowledge and identity beliefs. This explanation aligns with our finding that gender differences in personality were smaller when measured via aggregated state scores than when measured as global, self-reported traits.

Theoretical and Practical Implications

Global self-reports of personality traits have proven useful for various purposes, including the prediction of relevant life outcomes (Roberts et al., 2007; Soto, 2019), and are, contrary to the assessment of personality states (Horstmann & Ziegler, 2020), very well understood from a psychometric perspective. One could argue potential biases in global personality trait self-reports are not problematic per se, given their predictive validity. However, personality trait scores may achieve high predictive validity for the wrong reasons: If, for example, a stereotype is present in both the self-report of a person and the outcome that is predicted—for example, a supervisor’s report—this inflates predictive validity. As with the difference between self-reports and informant reports, the problem is that no ground truth exists (Paulhus & Vazire, 2007; Stachowski & Kulas, 2021). Strong theoretical arguments are needed to solve this problem. For example, is it reasonable that men and women differ by d = .47 on emotional stability (Study 1)? If personality trait assessments represent interindividual differences manifesting in daily life, this difference should be visible in daily life. Alternatively, one could assume that no difference between personality traits exists, and that men and women are equally emotionally stable. Whatever the theoretical standpoint, it should inform expectations toward measurement instruments (e.g., Cook et al., 2015). Consequently, it must be ensured that the outcome used to examine the predictive validity of personality assessments is not influenced by biases. Average state scores may present such an additional criterion for future test score validations.

Importantly, measurement tools should not be designed to hide actual group differences. For example, building on suggestions by Bäckström and Björklund (2013), choosing less gender-stereotypical wording in questionnaire items could be useful to avoid excluding items that reflect actual group differences without narrowing the construct.

Finally, biases influencing a person’s self-report can of course be examined in their own right. For example, it might be interesting to examine how these biases affect how a person navigates their life (e.g., chooses a profession), where these biases originate from or if they lead to some form of self-fulfilling prophecy. Yet, especially in these situations, it is important to separate the bias (or any variance unrelated to the trait) from the trait variance (McAbee & Connelly, 2016). Average personality state scores may just serve as the right variable for such purpose. Taken together, the results of the current study align with calls not to equate personality self-reports with personality traits (McCrae & Costa, 2008b; Rauthmann, 2024) and to consider the pros and cons of different approaches to personality assessment.

Limitations

One central limitation affecting both studies concerns the state assessment: Average state scores correlated highly among each other (r = .43 to r = .84 in Study 1; r = .45 to r = .66 in Study 2), and these correlations were as high as or higher than correlations between corresponding average state and trait scores (r = .34 to r = .42 in Study 1; r = .38 to r = .52 in Study 2). Although these correlations are similar to what has been observed in the literature before (e.g., Augustine & Larsen, 2012; Fleeson & Gallagher, 2009; Matz & Harari, 2021), they may be seen as indicating low discriminant validity of the average state scores. Discarding our findings due to potentially limited state assessment validity would require rethinking current approaches to personality state assessments (e.g., Abrahams et al., 2021; Horstmann et al., 2021b; Sherman et al., 2015). However, the findings align with our theoretical framework about distinct memory processes in state versus trait judgments.

A second limitation concerns the assessment of gender: Participants not identifying as men or women were either not recruited (Study 1) or excluded from analyses due to their small sample size (Study 2). Including these participants—or assessing gender identity continuously—might improve understanding of how gender stereotypes relate to personality self-reports.

Future Research Directions

Under the assumption that differences in global self-reports of personality traits are based on stereotypes, measures should be developed that do not tap into such stereotypes and are not identity-related. In addition, findings on gender differences should be re-examined with respect to observable behavior in real-world settings. This also highlights the importance of assessing interindividual differences with other methods that provide less biased scores. For example, Mehl and colleagues used recordings from electronically activated recorders (Mehl et al., 2007; Tidwell et al., 2024) to investigate if women talk more than men. They found that women do not (Mehl et al., 2007) or only to a very small extent (Tidwell et al., 2024) speak more than men on average. Importantly, the authors also reported that women tend to describe themselves as much more talkative than men, compared with the observed differences in words spoken. This research aligns with our findings and substantiates that gender differences in global self-reports are likely inflated.

The second, potentially more critical evaluation of the current research would be that personality state assessments do not capture the manifestations of the examined personality traits. If this was the case, then future research should continue to develop measures for personality states that also reflect mean differences in the average state score computed across time. That is, personality state and trait items should cover the same aspects of the respective construct (Horstmann & Ziegler, 2020), including theoretically plausible group differences.

Conclusion

This study proposed and examined an alternative explanation for frequently found gender differences in self-reported personality trait scores, building on the accessibility model for emotional self-report (Robinson & Clore, 2002a). Results align with the proposition that people use different judgment processes and access different information—associated with episodic or semantic memory—when answering personality items, depending on the timeframe to be reported on. Gender differences in conscientiousness, agreeableness, and emotional stability were observed in self-reported trait scores, but less so in corresponding aggregated state scores. These results align with the accessibility model, suggesting that observed gender differences in personality trait scores can partially be caused by and inflated due to recall bias in global self-reports.

Supplemental Material

sj-pdf-1-spp-10.1177_19485506251347722 – Supplemental material for Gender Differences in Personality Traits and Average Personality States: Using Experience Sampling to Circumvent Bias in Self-Reports Supplemental material, sj-pdf-1-spp-10.1177_19485506251347722 for Gender Differences in Personality Traits and Average Personality States: Using Experience Sampling to Circumvent Bias in Self-Reports by Lilly Buck, Larissa Doran, Karolina Kolodziejczak-Krupp, Fabian Gander, Alex Christoph Traut, Maximiliane Uhlich, Alexander Grob and Kai T. Horstmann in Social Psychological and Personality Science

Supplemental Material

sj-pdf-2-spp-10.1177_19485506251347722 – Supplemental material for Gender Differences in Personality Traits and Average Personality States: Using Experience Sampling to Circumvent Bias in Self-Reports Supplemental material, sj-pdf-2-spp-10.1177_19485506251347722 for Gender Differences in Personality Traits and Average Personality States: Using Experience Sampling to Circumvent Bias in Self-Reports by Lilly Buck, Larissa Doran, Karolina Kolodziejczak-Krupp, Fabian Gander, Alex Christoph Traut, Maximiliane Uhlich, Alexander Grob and Kai T. Horstmann in Social Psychological and Personality Science

Notes

    1. The terms “gender” and “sex” are used in accordance with the current APA definitions and guidelines: “Sex refers to the biological status of being male, female, or intersex, whereas gender implies the psychological, behavioral, social, and cultural aspects of gender (i.e., masculinity, femininity, nonbinary, nonconforming, or other gender).” See https://dictionary.apa.org/gender.

References

  • Abrahams L., Rauthmann J. F., De Fruyt F., (2021). Person-situation dynamics in educational contexts: A self- and other-rated experience sampling study of teachers’ states, traits, and situations. European Journal of Personality, 35(4), 598622. https://doi.org/10.1177/08902070211005621
  • Augustine A. A., Larsen R. J., (2012). Is a trait really the mean of states? Journal of Individual Differences, 33(3), 131137. https://doi.org/10.1027/1614-0001/a000083
  • Bäckström M., Björklund F., (2013). Social desirability in personality inventories: Symptoms, diagnosis and prescribed cure. Scandinavian Journal of Psychology, 54(2), 152159. https://doi.org/10.1111/sjop.12015
  • Buss D. M., (2007). Evolutionary psychology: The new science of the mind (3rd ed.). Allyn & Bacon.
  • Connelly B. S., McAbee S. T., Oh I.-S., Jung Y., Jung C. W., (2021). A multi-rater perspective on personality and performance: An empirical examination of the Trait-Reputation-Identity Model. Journal of Applied Psychology. https://doi.org/10.1037/apl0000732
  • Cook D. A., Brydges R., Ginsburg S., Hatala R., (2015). A contemporary approach to validity arguments: A practical guide to Kane’s framework. Medical Education, 49(6), 560575. https://doi.org/10.1111/medu.12678
  • Costa P. T., Terracciano A., McCrae R. R., (2001). Gender differences in personality traits across cultures: Robust and surprising findings. Journal of Personality and Social Psychology, 81(2), 322331. https://doi.org/10.1037/0022-3514.81.2.322
  • Dunning D., Heath C., Suls J. M., (2004). Flawed self-assessment. Psychological Science in the Public Interest, 5(3), 69106. https://doi.org/10.1111/j.1529-1006.2004.00018.x
  • Eagly A. H., Wood W., (1991). Explaining sex differences in social behavior: A meta-analytic perspective. Personality and Social Psychology Bulletin, 17(3), 306315. https://doi.org/10.1177/0146167291173011
  • Eagly A. H., Wood W., (1999). The origins of sex differences in human behavior: Evolved dispositions versus social roles. American Psychologist, 54(6), 408423. https://doi.org/10.1037/0003-066X.54.6.408
  • Feingold A., (1994). Gender differences in personality: A meta-analysis. Psychological Bulletin, 116(3), 429456. https://doi.org/10.1037/0033-2909.116.3.429
  • Fleeson W., (2001). Toward a structure- and process-integrated view of personality: Traits as density distributions of states. Journal of Personality and Social Psychology, 80(6), 10111027. https://doi.org/10.1037/0022-3514.80.6.1011
  • Fleeson W., Gallagher P., (2009). The implications of Big Five standing for the distribution of trait manifestation in behavior: Fifteen experience-sampling studies and a meta-analysis. Journal of Personality and Social Psychology, 97(6), 10971114. https://doi.org/10.1037/a0016786
  • Fleeson W., Jayawickreme E., (2015). Whole Trait Theory. Journal of Research in Personality, 56, 8292. https://doi.org/10.1016/j.jrp.2014.10.009
  • Funder D. C., (2001). Personality. Annual Review of Psychology, 52(1), 197221. https://doi.org/10.1146/annurev.psych.52.1.197
  • Gander F., Traut A. C., Uhlich M., Horstmann K. T., Steppan M., Ziegler M., Grob A., (2025). Assessing personality states–the development and validation of a Five-Factor Model personality states inventory. PsyArXiv. https://doi.org/10.31234/osf.io/xydmt
  • Grob A., Weidmann R., Wünsche J., Burriss R. P., Bühler J. L., Traut A. C., Uhlich M., Arnold K., Gander F., (2025). Personality change in the light of relationship transitions in partnered and single individuals: Study protocol of a prospective measurement burst design. Research Square. https://doi.org/10.21203/rs.3.rs-6303113/v1
  • Hancock G. R., (2001). Effect size, power, and sample size determination for structured means modeling and mimic approaches to between-groups hypothesis testing of means on a single latent construct. Psychometrika, 66(3), 373388. https://doi.org/10.1007/BF02294440
  • Helgeson V. S., (2015). Gender and personality. In Mikulincer M., Shaver P. R., Cooper M. L., Larsen R. J., (Eds.), APA handbook of personality and social psychology (Vol. 4). Personality processes and individual differences (pp. 515534). American Psychological Association. https://doi.org/10.1037/14343-023
  • Hirnstein M., Stuebs J., Moè A., Hausmann M., (2023). Sex/gender differences in verbal fluency and verbal-episodic memory: A meta-analysis. Perspectives on Psychological Science, 18(1), 6790. https://doi.org/10.1177/17456916221082116
  • Hittner J. B., May K., Silver N. C., (2003). A Monte Carlo evaluation of tests for comparing dependent correlations. The Journal of General Psychology, 130(2), 149168. https://doi.org/10.1080/00221300309601282
  • Hofmann R., Rozgonjuk D., Soto C. J., Ostendorf F., Mõttus R., (2025). There are a million ways to be a woman and a million ways to be a man: Gender differences across personality nuances and nations. Journal of Research in Personality, 115, Article e104582. https://doi.org/10.1016/j.jrp.2025.104582
  • Horstmann K. T., Rauthmann J. F., Sherman R. A., (2018). Measurement of situational influences. In Zeigler-Hill V., Shackelford T. K., (Eds.), The SAGE handbook of personality and individual differences (pp. 465484). Sage.
  • Horstmann K. T., Rauthmann J. F., Sherman R. A., Ziegler M., (2021a). Distinguishing simple and residual consistencies in functionally equivalent and non-equivalent situations: Evidence from experimental and observational longitudinal data. European Journal of Personality, 35(6), 833860. https://doi.org/10.1177/08902070211014029
  • Horstmann K. T., Rauthmann J. F., Sherman R. A., Ziegler M., (2021b). Unveiling an exclusive link: Predicting behavior with personality, situation perception, and affect in a preregistered experience sampling study. Journal of Personality and Social Psychology, 120(5), 13171343. https://doi.org/10.1037/pspp0000357
  • Horstmann K. T., Ziegler M., (2020). Assessing personality states: What to consider when constructing personality state measures. European Journal of Personality, 34(6), 10371059. https://doi.org/10.1002/per.2266
  • Hyde J. S., (2005). The gender similarities hypothesis. American Psychologist, 60(6), 581592. https://doi.org/10.1037/0003-066x.60.6.581
  • Jones A. B., Brown N. A., Serfass D. G., Sherman R. A., (2017). Personality and density distributions of behavior, emotions, and situations. Journal of Research in Personality, 69, 225236. https://doi.org/10.1016/j.jrp.2016.10.006
  • Kolodziejczak K., Drewelies J., Pauly T., Ram N., Hoppmann C., Gerstorf D., (2022). Physical intimacy in older couples’ everyday lives: Its frequency and links with affect and salivary cortisol. The Journals of Gerontology: Series B, 77(8), 14161430. https://doi.org/10.1093/geronb/gbac037
  • Lakens D., (2017). Equivalence tests: A practical primer for t tests, correlations, and meta-analyses. Social Psychological and Personality Science, 8(4), 355362. https://doi.org/10.1177/1948550617697177
  • Lakens D., Scheel A. M., Isager P. M., (2018). Equivalence testing for psychological research: A tutorial. Advances in Methods and Practices in Psychological Science, 1(2), 259269. https://doi.org/10.1177/2515245918770963
  • Lang F. R., Lüdtke O., Asendorpf J. B., (2001). Testgüte und psychometrische Äquivalenz der deutschen Version des Big Five Inventory (BFI) bei jungen, mittelalten und alten Erwachsenen [Validity and psychometric equivalence of the German version of the Big Five Inventory in young, middle-aged and old adults]. Diagnostica, 47(3), 111121. https://doi.org/10.1026//0012-1924.47.3.111
  • Lippa R. A., (2010). Sex differences in personality traits and gender-related occupational preferences across 53 nations: Testing evolutionary and social-environmental theories. Archives of Sexual Behavior, 39(3), 619636. https://doi.org/10.1007/s10508-008-9380-7
  • Löckenhoff C. E., Chan W., McCrae R. R., De Fruyt F., Jussim L., De Bolle M., Costa P. T., Sutin A. R., Realo A., Allik J., Nakazato K., Shimonaka Y., Hřebıčková M., Graf S., Yik M., Ficková E., Brunner-Sciarra M., Leibovich de Figueroa N., Schmidt V., … Terracciano A., (2014). Gender stereotypes of personality: Universal and accurate? Journal of Cross-Cultural Psychology, 45(5), 675694. https://doi.org/10.1177/0022022113520075
  • Matz S. C., Harari G. M., (2021). Personality–place transactions: Mapping the relationships between Big Five personality traits, states, and daily places. Journal of Personality and Social Psychology, 120(5), 13671385. https://doi.org/10.1037/pspp0000297
  • McAbee S. T., Connelly B. S., (2016). A multi-rater framework for studying personality: The Trait-Reputation-Identity Model. Psychological Review, 123(5), 569591. https://doi.org/10.1037/rev0000035
  • McCrae R. R., Costa P. T., (2008a). Empirical and theoretical status of the Five-Factor Model of personality traits. In Boyle G. J., Matthews G., Saklofske D. H., (Eds.), The Sage handbook of personality theory and assessment: Volume 1—personality theories and models (pp. 273294). Sage. https://doi.org/10.4135/9781849200462.n13
  • McCrae R. R., Costa P. T., (2008b). The Five-Factor Theory of personality. In John O. P., Robins R. W., Pervin L. A., (Eds.), Handbook of personality: Theory and research (3rd ed., pp. 159181). The Guilford Press.
  • Mehl M. R., Vazire S., Ramirez-Esparza N., Slatcher R. B., Pennebaker J. W., (2007). Are women really more talkative than men? Science, 317(5834), 82. https://doi.org/10.1126/science.1139940
  • Murphy S. A., Fisher P. A., Robie C., (2021). International comparison of gender differences in the Five-Factor Model of personality: An investigation across 105 countries. Journal of Research in Personality, 90, 104047. https://doi.org/10.1016/j.jrp.2020.104047
  • Paulhus D. L., Vazire S., (2007). The self-report method. In Robins R. W., Fraley R. C., Krueger R. F., (Eds.), Handbook of research methods in personality (pp. 224239). Guilford Press.
  • Rammstedt B., Danner D., Soto C. J., John O. P., (2018). Validation of the short and extra-short forms of the Big Five Inventory-2 (BFI-2) and their German adaptations. European Journal of Psychological Assessment, 36(1), 149161. https://doi.org/10.1027/1015-5759/a000481
  • Rauthmann J. F., (2024). Personality is (so much) more than just self-reported Big Five traits. European Journal of Personality, 38(6), 863866. https://doi.org/10.1177/08902070231221853
  • R Core Team. (2024). R: A language and environment for statistical computing [Manual]. R Foundation for Statistical Computing. https://www.R-project.org/
  • Richter D., Schupp J., (2015). The SOEP Innovation Sample (SOEP IS). Journal of Contextual Economics–Schmollers Jahrbuch, 135(3), 389399. https://doi.org/10.3790/schm.135.3.389
  • Roberts B. W., Kuncel N. R., Shiner R., Caspi A., Goldberg L. R., (2007). The power of personality: The comparative validity of personality traits, socioeconomic status, and cognitive ability for predicting important life outcomes. Perspectives on Psychological Science, 2(4), 313345. https://doi.org/10.1111/j.1745-6916.2007.00047.x
  • Roberts B. W., Yoon H. J., (2022). Personality psychology. Annual Review of Psychology, 73(1), 489516. https://doi.org/10.1146/annurev-psych-020821-114927
  • Robinson M. D., Clore G. L., (2002a). Belief and feeling: Evidence for an accessibility model of emotional self-report. Psychological Bulletin, 128(6), 934960. https://doi.org/10.1037/0033-2909.128.6.934
  • Robinson M. D., Clore G. L., (2002b). Episodic and semantic knowledge in emotional self-report: Evidence for two judgment processes. Journal of Personality and Social Psychology, 83(1), 198215. https://doi.org/10.1037/0022-3514.83.1.198
  • Sherman R. A., Rauthmann J. F., Brown N. A., Serfass D. G., Jones A. B., (2015). The independent effects of personality and situations on real-time expressions of behavior and emotion. Journal of Personality and Social Psychology, 109(5), 872888. https://doi.org/10.1037/pspp0000036
  • Soto C. J., (2019). How replicable are links between personality traits and consequential life outcomes? The life outcomes of personality replication project. Psychological Science, 30(5), 711727. https://doi.org/10.1177/0956797619831612
  • Soto C. J., John O. P., (2017a). Short and extra-short forms of the Big Five Inventory–2: The BFI-2-S and BFI-2-XS. Journal of Research in Personality, 68, 6981. https://doi.org/10.1016/j.jrp.2017.02.004
  • Soto C. J., John O. P., (2017b). The next Big Five Inventory (BFI-2): Developing and assessing a hierarchical model with 15 facets to enhance bandwidth, fidelity, and predictive power. Journal of Personality and Social Psychology, 113(1), 117143. https://doi.org/10.1037/pspp0000096
  • Stachowski A. A., Kulas J. T., (2021). The persnickety pervasiveness of rating enhancement in personality assessment. European Journal of Psychological Assessment, 37(4), 300312. https://doi.org/10.1027/1015-5759/a000610
  • Tidwell C. A., Danvers A. F., Pfeifer V. A., Abel D. B., Alisic E., Beer A., Bierstetel S. J., Bollich-Ziegler K. L., Bruni M., Calabrese W. R., Chiarello C., Demiray B., Dimidjian S., Fingerman K. L., Haas M., Kaplan D. M., Kim Y. K., Knezevic G., Lazarevic L. B., Mehl M. R., (2025). Are women really (not) more talkative than men? A registered report of binary gender similarities/differences in daily word use. Journal of Personality and Social Psychology, 128(2), 367391. https://doi.org/10.1037/pspp0000534
  • Vianello M., Schnabel K., Sriram N., Nosek B., (2013). Gender differences in implicit and explicit personality traits. Personality and Individual Differences, 55(8), 994999. https://doi.org/10.1016/j.paid.2013.08.008
  • Wagner G. G., Frick J. R., Schupp J., (2007). The German Socio-Economic Panel study (SOEP)—evolution, scope and enhancements. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.1028709
  • Weisberg Y. J., DeYoung C. G., Hirsh J. B., (2011). Gender differences in personality across the ten aspects of the Big Five. Frontiers in Psychology, 2, Article e178. https://doi.org/10.3389/fpsyg.2011.00178
  • Wilson R. E., Thompson R. J., Vazire S., (2017). Are fluctuations in personality states more than fluctuations in affect? Journal of Research in Personality, 69, 110123. https://doi.org/10.1016/j.jrp.2016.06.006
  • Wrzus C., (2021). Processes of personality development: An update of the TESSERA framework. In Rauthmann J. F., (Ed.), The handbook of personality dynamics and processes (pp. 101123). Academic Press. https://doi.org/10.1016/B978-0-12-813995-0.00005-4
  • Wrzus C., Roberts B. W., (2017). Processes of personality development in adulthood: The TESSERA framework. Personality and Social Psychology Review, 21(3), 253277. https://doi.org/10.1177/1088868316652279
footer

Recommended Citation

Gender Differences in Personality Traits and Average Personality States: Using Experience Sampling to Circumvent Bias in Self-Reports

Lilly Buck, Larissa Doran, Karolina Kolodziejczak-Krupp, Fabian Gander, Alex Christoph Traut, Maximiliane Uhlich, Alexander Grob, Kai T. Horstmann


Social Psychological and Personality Science

First Published  July  2025

10.1177/19485506251347722


Request Permissions

View permissions information for this article

View