Abstract
Scholars and institutions commonly use impact factors to evaluate the quality of empirical research. However, a number of findings published in journals with high impact factors have failed to replicate, suggesting that impact alone may not be an accurate indicator of quality. Fraley and Vazire proposed an alternative index, the N-pact factor, which indexes the median sample size of published studies, providing a narrow but relevant indicator of research quality. In the present research, we expand on the original report by examining the N-pact factor of social/personality-psychology journals between 2011 and 2019, incorporating additional journals and accounting for study design (i.e., between persons, repeated measures, and mixed). There was substantial variation in the sample sizes used in studies published in different journals. Journals that emphasized personality processes and individual differences had larger N-pact factors than journals that emphasized social-psychological processes. Moreover, N-pact factors were largely independent of traditional markers of impact. Although the majority of journals in 2011 published studies that were not well powered to detect an effect of ρ = .20, this situation had improved considerably by 2019. In 2019, eight of the nine journals we sampled published studies that were, on average, powered at 80% or higher to detect such an effect. After decades of unheeded warnings from methodologists about the dangers of small-sample designs, the field of social/personality psychology has begun to use larger samples. We hope the N-pact factor will be supplemented by other indices that can be used as alternatives to improve further the evaluation of research.
Imagine that you are an early career researcher who has just completed a major project. You think the results are exciting and that they have the potential to advance the field in a number of ways. You would like to submit your manuscript to a journal that has a reputation for publishing the highest caliber research in your field. How would you know which journals are regarded for publishing high-quality research?
Traditionally, scholars have answered this question by referencing the journal impact factor—a measure of how frequently articles in a given journal are cited. But as critics of impact factors have noted, citation rates per se may not reflect anything informative about the quality of empirical research (Peters, 2017). As recent years have shown, many well-known findings published in reputable journals have been difficult to replicate (e.g., Cheung et al., 2016; Ebersole et al., 2016; Eerland et al., 2016; Hagger et al., 2016; O’Donnell et al., 2018; Open Science Collaboration, 2015; Wagenmakers et al., 2016). Although journals with high impact factors publish articles that are, by definition, widely cited and influential on average, impact alone does not guarantee that the research is of high quality or that it is positioned to provide a strong foundation for cumulative knowledge.
For many purposes, research quality should be evaluated at the level of the article or the researcher rather than the journal. However, as our opening anecdote illustrates, sometimes it is useful to know the typical quality of articles published in a journal. In these situations, it would be useful to have a way of indexing journal quality that is based on the strength of the research published rather than the citation rate of articles in those journals alone. Fraley and Vazire (2014) proposed one way of doing so, called the N-pact factor (NF). The NF is based on the median sample size of the studies published by that journal. As Fraley and Vazire emphasized in 2014, the NF is not meant to be the only alternative to the impact factor. Quality is multidimensional, and the NF captures only one dimension of quality, albeit a fundamental one. One reason the sample size of studies is important for research quality is that sample size is one of the key factors in the statistical power of studies. Statistical power is defined as the probability of detecting an effect of interest when that effect actually exists. Historically, statistical power has been overlooked in the design and evaluation of empirical research (Button et al., 2013; Cohen, 1988, 1990, 1992; Maxwell, 2004; Szucs & Ioannidis, 2017). But in recent years, it has become recognized as an essential component of sound research design and evaluating evidentiary support. Statistical power is relevant for judging the quality of empirical research literatures because compared with lower-powered studies, studies that are high powered are more likely to (a) detect valid effects, (b) buffer the literature against false positives, and (c) produce findings that other researchers can replicate. Fraley and Vazire used the NF to rank six major journals in social/personality psychology using studies published between 2006 and 2010. They reported that there was considerable variation across journals in the sample sizes of the studies they published and that, on average, the typical study did not have a sufficient sample size to be well powered to detect the kinds of effect sizes that are often reported in social/personality psychology.
The purpose of this article is to update and extend the original NF report in three innovative ways. First, we examine the N-pact of social/personality journals from 2011 to 2019. Doing so allows us to examine the extent to which NFs have changed over time, including before and after the onset of the “crisis of confidence” in psychological science (Vazire, 2018). In addition to examining a more recent set of articles than the original report, we also expand our coverage to add three journals not included in previous N-pact rankings (i.e., European Journal of Social Psychology [EJSP], European Journal of Personality [EJP], and Social Psychological and Personality Science [SPPS]). Finally, and perhaps most importantly, we distinguish studies that are based on between-persons designs and studies that are based on other designs (i.e., mixed and repeated-measures designs). Doing so allows us to provide a more appropriate assessment of journals by evaluating them with respect to commensurate research designs.
Why Are Sample Sizes and Statistical Power Important for Evaluating the Quality of Research?
There are many factors to consider when evaluating the quality of a study or the journals that publish them. For example, high-quality research involves taking sampling and generalizability seriously, ruling out salient and alternative explanations for findings, and using recognized and valid ways of manipulating and/or measuring variables.
In this article, we focus on sample size and statistical power. We do so for at least two reasons. Most notably, one of the key ingredients in statistical power—the sample size or N—can be coded objectively. Although scholars can disagree over whether a study rules out obvious confounds, whether the sampling frame is appropriate, or whether the research questions or hypotheses have been framed in a constructive manner, it is not debatable whether researchers used a sample size of 30 or 300 to test their key hypotheses. Second, the lack of statistical power has the potential to be one of the most salient—and potentially easiest to fix—causes of replication failures in psychological science. Historically, many studies in social/personality psychology have not been well powered to detect the kinds of effects that are typical in the field (Fraley & Vazire, 2014). Assuming that those effects are real, the implication of conducting underpowered studies is that future studies with similar or even slightly larger sample sizes will not be capable of replicating published findings. When the effect is not real (i.e., the null hypothesis is true), small sample sizes, in combination with other common research practices, can inflate the prevalence of false positives and lead to replication failures. We elaborate on these ideas below, calling attention specifically to the ways in which statistical power is needed for the discovery and confirmation of new findings and minimizing the false-positive rate in research literatures. 1 We wish to be explicit that sample size and power are not the only features that researchers should attend to when evaluating research quality, and we encourage others to develop complementary indices that track these other features.
Higher-powered studies are more likely to detect effects that truly exist
“Power” refers to the ability to detect effects that truly exist in the population. Although there are a number of ways to increase the power of a study (e.g., using precise measurements, using more powerful experimental manipulations, changing the α threshold of the statistical test), one of the most obvious ways of doing so is to increase the number of people or cases sampled. When sample size increases, the probability of correctly detecting an effect also increases.
Increasing power is desirable for many reasons. From a scientific perspective, researchers are likely to discover new phenomena that truly exist only if they have designs that are powerful enough to do so—especially when corrections for multiple comparisons are in place or when α levels are reduced to minimize Type I errors. However, previous reviews have indicated that the power of studies in psychology tends to be low—close to 50% (e.g., Szucs & Ioannidis, 2017). 2 This implies that a typical study in psychology has only a 50–50 chance of detecting an effect even when that effect truly exists.
There are substantial costs to using underpowered designs. First, in exploratory studies, for example, researchers are likely to overlook or misestimate associations that might be both real and of theoretical importance. In confirmatory research, in which a specific hypothesis is being tested, researchers may not reach the correct conclusions, thereby creating ambiguity about the state of the theory or leading to unnecessary searches for factors that might explain why the effect emerges in some studies but not in others (Schmidt, 1996).
Second, these underpowered but significant results produce inflated effect sizes that bias meta-analysis (Crutzen & Peters, 2017; Nuijten et al., 2015; Turner et al., 2013). Indeed, the fact that published effects tend to diminish when subjected to further tests has received much attention (Schooler, 2011). This phenomenon, dubbed the “decline effect,” can be explained by the combination of selection for significance (i.e., publication bias) and regression to the mean, and this problem is exacerbated when the published studies are underpowered. Thus, even in the absence of concerns about researcher degrees of freedom and p-hacking, underpowered studies alone are capable of creating a replication crisis in psychology.
There are also costs from a human-factors perspective. When early-career researchers use underpowered studies to evaluate or confirm novel ideas, they are essentially gambling—and they are betting against the odds or using questionable research practices to increase their odds. This leads to a culture in which researchers feel “lucky” when their studies work or are rewarded for using questionable research practices (see Bakker et al., 2012). This, in turn, can lead to desperation and alienation or temptations to p-hack when the studies do not work. 3 Given the emphasis on publication and significant findings in the culture of psychological science, the cumulative consequence is that the selection and promotion of early-career researchers is left to chance or willingness to engage in questionable research practices rather than other factors that might otherwise be valued.
Journals that publish underpowered studies publish a greater proportion of false positives
A common misconception is that statistical power is irrelevant for evaluating research studies that produced significant results. According to this perspective, the Type I error rate is controlled exclusively by the α threshold (e.g., 5%). Thus, one might conclude that power does not affect the false-positive rate of studies, which is assumed to be held constant at α (e.g., 5%). However, this misconception is based on a confusion between two kinds of false positive rates: (a) the Type I error rate (i.e., the proportion of false positives among all studies for which the null is true), which is equal to α when the rules of null hypothesis significance testing (NHST) are followed, and (b) the false-discovery rate (FDR), which is the proportion of false positives among all significant results (the inverse of the FDR is the positive predictive value of significant results, which is the probability that a significant result reflects a true effect).
Methodologists have demonstrated that the FDR in a literature is not determined solely by the α level; it is also determined by statistical power (see Ioannidis, 2005). Specifically, when the power of a typical study is low, a greater proportion of significant findings represent false positives because the number of true positives in the pool is smaller when power is low compared with when it is high (see Fig. 1; for an interactive demonstration of these ideas, see Schönbrodt, 2014.) Given that researchers tend to submit only reports based on significant findings and that journals tend to publish only significant results (Fanelli, 2012), underpowered studies likely lead to an overrepresentation of false-positive findings. Journals that publish higher-powered studies are less likely to publish false positives in the long run.

Illustration of how false-discovery rates vary as a function of statistical power. The top row illustrates a situation in which half of the studies are testing a null hypothesis that is true (i.e., no effect) and half of the studies are testing a null hypothesis that is false. All of the studies have a sample size that would lead to a 20% chance to detect the effect if the null is false (i.e., 20% power). The top-left panel illustrates the expected number of significant findings in each set; the top-right panel illustrates the relative distribution of true positives and false positives among the subset of published significant findings. The bottom row illustrates the same process in a situation in which the studies are powered to have an 80% chance to detect the effect if the null is false. In this case, the overall false-discovery rate is much lower (.05) than it is in the previous example (.17).
Method
The purpose of this article is to rank journals in social/personality psychology with respect to their sample size—one of the ingredients in the statistical power of studies. We limit our investigation to journals in social/personality psychology for several reasons. First, much of the research we conduct is in social/personality psychology, and thus, we are well positioned to evaluate it as “insiders.” In addition, many of the replication issues that have been debated over the past few years have their origins in social/personality psychology, and these debates have largely played out during the years we focus on in this investigation (i.e., 2011–2019). That is not to say that other areas are immune to the so-called replication crisis, but that replication issues were widely discussed and debated in social/personality psychology during this time period. And, importantly, there is no consensus on the extent to which these problems exist (see Motyl et al., 2017). Thus, ranking journals in this field may be particularly instructive for uncovering the extent to which potential problems with research design and planning exist in social/personality psychology. For applications of the NF to related fields, see Martin and Martin (2021), Kossmeier et al. (2019), Reardon et al. (2019), and Schweizer and Furley (2016). We report how we determined our sample size, all data exclusions, all manipulations, and all measures in the study.
We operationalize the NF of a journal as the median sample size of the studies published in that journal within a given time frame. We focus on the median sample size rather than the mean because there is a lower limit to sample sizes (i.e., they cannot be less than 1) but not an upper bound; as a result, sample sizes tend to be positively skewed.
Although we have focused much of our discussion up to this point on statistical power rather than sample size per se, we focus on sample size in particular for constructing the NF for a number of reasons. Most importantly, sample size (or N) is an intuitive metric for both scholarly and lay audiences. Although lay audiences may not appreciate the mechanics of statistical power and NHST, they can appreciate the concept of sample size and are likely to have an intuition that the larger the sample size, the more precise statistical estimates will be. Second, because N is a key component in statistical-power calculations, quantifying sample sizes provides us with a straightforward way to estimate the power of the studies under certain assumptions. That is, assuming everything else is the same across studies (e.g., the population effect size, reliability of measures), statistical power and sample size are a perfect monotonic function of one another.
There are three potential criticisms of our focus on sample size that we address up front. First, although it is true that higher-N studies are more powerful for a given effect size, some researchers are studying phenomena that correspond to larger population effect sizes than others. Thus, even if the studies published by a journal tend to use smaller sample sizes, the studies might be well powered because those studies are targeting larger effect sizes. One reason we have limited our investigation to journals in social and personality psychology is that doing so helps to constrain the range of effect sizes that may be of interest. Although there may be differences from one research lab to the next in terms of the size of the effects they are studying, at the level of the journals themselves (where the research across labs is published), there should not be sizable differences in the average effect sizes between journals. That is, there is no obvious reason to expect the EJSP, for example, to be publishing studies that target population effect sizes that are systematically larger or smaller than those published in the Journal of Personality and Social Psychology (JPSP). Moreover, the most salient substantive difference among the journals we have sampled is whether they emphasize social or personality psychology, and as previous studies have shown, the estimates of effect sizes across these domains tend to be comparable (e.g., Funder & Ozer, 1983; Lovakov & Agadullina, 2017; Richard et al., 2003). In short, there is little evidence to date that suggests the population effect sizes of interest systematically vary across subdisciplines or journals in social/personality psychology. We should be clear that although we believe it is appropriate to compare the NF of different journals in a subfield (e.g., social/personality psychology), we do not believe it is appropriate to compare the NF of different journals across fields (e.g., neuroscience vs. social psychology) for this very reason. That is, there may be systematic differences in the effect sizes studied by researchers in different fields of psychology.
Second, some journals may be more inclined than others to publish studies on the basis of within-persons or repeated-measures designs. Because statistical power in within-persons designs is also a function of the number of trials and the covariance among repeated measures, it is potentially misleading to use sample size alone to index the power of studies in such journals. To address this concern, we explicitly coded each study for whether it used a between-persons, within-persons, or mixed design. The N-pact rankings on which we focus are derived from estimates that consider only studies with between-persons designs (we report sample-size information for studies using other designs as well but exclude such studies from our rankings to keep those rankings focused on between-persons designs for which sample size is a clear marker of power).
Finally, we should be explicit that statistical power is a function not only of sample size but also other factors. For example, power is also determined by the α level selected for the test. Alpha, however, does not typically vary from study to study. Most researchers treat p values less than .05 as statistically significant, and if they adjust α at all, they adjust it downward (e.g., .01). Because adjusting α downward decreases statistical power, the power calculations we report here have the potential to be overestimates of the actual power of studies. Power is also determined by the population effect size being studied, which, in itself, is also determined by the reliability of the measurements and the size of any manipulations used. Rather than trying to make assumptions about what the population effect size is in each study, we use our aggregate NF information to instead ask what the power is to detect various effects that might be of interest to researchers. For example, we can ask what the power of a typical study published by a journal may be to detect an effect of ρ = .20, assuming an α of .05 and using the typical sample size of studies published in that journal.
Selection of studies
We examined studies published from 2011 to 2019, inclusive, in nine of the major journals in social and personality psychology: EJSP, EJP, Journal of Experimental Social Psychology (JESP), Journal of Personality (JP), JPSP, Journal of Research in Personality (JRP), Personality and Social Psychology Bulletin (PSPB), Psychological Science (PS), and SPPS. These include the same six journals originally examined by Fraley and Vazire (2014), but we expanded the set to include EJP, EJSP, and SPPS. For each of the 9 years, 20% of the articles from each journal were randomly selected to be coded, and all studies in a selected article were coded. We excluded articles that were editorials, commentaries, meta-analyses, or based on simulation work. We decided not to include meta-analyses simply because we were interested in the kinds of samples that researchers use when they are designing studies. But in meta-analysis, those kinds of decisions are not up to the meta-analyst and are, instead, decided by the researchers conducting the primary research. Overall, 1,812 articles were selected, yielding a total of 4,540 studies/samples.
As disclosures of potential conflicts of interest, we should note that one of the authors (S. Vazire) was an associate editor at JRP (2012–2015), JPSP:Personality Processes and Individual Differences (PPID) section (2013–2014), and SPPS (2013–2015) and editor in chief at SPPS (2015–2019) during the period from which the data were coded and thus had an influence on some editorial decisions at those journals. Another author (R. Chris Fraley) was an associate editor at PSPB during the period from which the Fraley and Vazire (2014) data were coded (2006–2008) and was an associate editor at JPSP:PPID (2017–2020).
Coding sample size
We used the same coding system used by Fraley and Vazire (2014). Specifically, we recorded the sample size separately for each study reported in an article. 4 In situations in which the initial sample size differed from the analytic sample size (e.g., because of failures of participants to follow instructions, debriefing or outlier exclusions, attrition), we coded the initial or intended sample size rather than the final sample size. In most cases, the coding of sample size is straightforward. But in some situations, there were multiple ways to code the sample size used in a study. For example, in behavior-genetic studies using twin samples, we used the number of twin dyads as the sample size of interest rather than the number of people. In longitudinal studies, we coded the sample size at the first assessment. In dyadic studies or studies of groups and teams, we focused on whatever unit of analysis the authors focused on (e.g., individuals or groups).
Two coders independently coded the sample size of each study in the randomly selected articles, and those codes were averaged to create a single sample size for each study. In cases in which the coders differed by more than N = 30, a third coder reviewed the study and resolved the discrepancy. That resolution, rather than the average of multiple coders’ values, was treated as the sample size for the study.
Coding research designs
We also coded each study for whether the research design and key analyses emphasized between-persons comparisons or within-persons comparisons or were mixed (i.e., observations varied within and between persons or the analyses separately emphasized between- and within-persons comparisons). A prototypical between-persons design involved correlational studies, simple and factorial between-persons experiments, and designs based on experimental or naturalistic group comparisons. A prototypical within-persons design involved repeated measures, typically—but not by definition—in experimental settings, in which a person rated stimuli or responded to trials that varied along one or more dimensions and none of the variables were measured or manipulated at the between-persons level. A prototypical mixed design was one in which one factor was manipulated or naturally varied between persons (e.g., experimental condition, sex, age) and another factor was manipulated or varied naturally within a person (e.g., two targets for which the order was counterbalanced, multiple observations nested within persons).
The coding criteria were initially created and refined by having two coders (R. Chris Fraley and J. Y. Chong) code and discuss a common set of studies. Many design classifications were straightforward, but to help resolve less obvious designs, we imposed some additional criteria: In longitudinal studies, if the comparisons were largely focused on between-persons comparisons (e.g., test-retest correlations, cross-lagged path estimates, autoregressive estimates), we classified the study as a between-persons design. If, instead, multilevel models were used that modeled multiple observations of a person over time and treated the data as nested within persons, we classified the design as mixed. Nesting per se, however, was not used to classify a study as mixed or within, given that some cases of nesting (e.g., students nested within classrooms) are used to manage interdependence among observations but still involve between-persons rather than within-persons comparisons. Disagreements were resolved through discussion when one of the coders was R. Chris Fraley or by R. Chris Fraley when that author was not one of the coders. Overall, 3,168 (70%) studies were classified as using between-persons designs, 975 (21%) were classified as using mixed designs, and 397 (9%) were coded as using within-persons designs.
Additional coding
We also coded whether each sampled article published in PS was a social/personality article. In most cases, this was clear-cut. In cases in which there was ambiguity (e.g., cases in which the issues were relevant to social psychology but the questions were framed with respect to neuroscientific models rather than social-psychological ones), we often relied on whether the key authors were identified with social/personality groups in their departments and whether the majority of references were to articles published in social/personality journals. Unless stated otherwise, the results reported below for PS are focused only on articles classified as involving social/personality psychology (denoted as PS:S). In addition, among JPSP articles, we coded the section in which each article was published: Attitudes and Social Cognition (ASC), Interpersonal Relations and Group Processes (IRGP), or PPID. These codes were used in some of our auxiliary analyses.
Results
What is the N-pact for journals in social/personality psychology?
To evaluate which journals tended to have the greatest N-pact, we computed the NF using studies published in each journal between 2011 and 2019. As defined above, the NF of a journal is the median sample size of studies published in a given time frame. Table 1 reports the NFs for each journal by year and an aggregate NF (the median NF for the journal across 2011–2019). The overall ranking of the journals in Table 1 is with respect to the aggregate NF, and that ranking is used to organize the presentation of information in the tables that follow. We included only coded studies (n = 2,908) that used between-persons designs for this analysis, and for PS, we focused only on studies classified as relevant to social/personality psychology.
N-Pact Factors of Social/Personality Journals in Years 2011 to 2019
Note: N-pact factors (NFs) are computed as the median sample size of between-persons studies sampled from a journal for the year in question. We coded all studies from 20% of randomly sampled articles published in each journal for each year. The number of studies coded for each cell are reported in parentheses. All numbers are rounded. Journals are listed in descending order of their aggregate NF, as the median across the yearly NFs. JRP = Journal of Research in Personality; JP = Journal of Personality; EJP = European Journal of Personality; SPPS = Social Psychological and Personality Science; PSPB = Personality and Social Psychology Bulletin; JPSP = Journal of Personality and Social Psychology; EJSP = European Journal of Social Psychology; PS:S = Psychological Science (social/personality); JESP = Journal of Experimental Social Psychology.
As shown in Table 1, there was considerable variation in the sample sizes used in studies published in different social/personality-psychology journals. The typical between-persons study published in JRP in 2011, for example, had a sample size of 238. The typical study using a between-persons design published in JESP in 2011, in contrast, had a sample size that was roughly a third as large (94).
We also examined separately the three sections of JPSP. In JPSP, studies published in PPID tended to have larger sample sizes (218, across all years) than studies published in ASC (155) or IRGP (135). In general, it appears that journals that emphasize personality processes and individual differences had larger NFs than journals that emphasize social-psychological processes. The journals that publish research from social and personality psychology (e.g., PSPB) were in between these extremes. Of course, journals that emphasize social-psychological processes may have other strengths relative to journals that emphasize personality processes and individual differences that are not captured by the NF (e.g., better justification for causal inferences).
Although our rankings are based only on studies that used between-persons designs, it is worth considering how designs matter. In PS, non-social/personality studies using between-persons designs tended to have larger sample sizes (Mdn = 136) than those that used mixed or within-persons designs (Mdn = 25). Although the sample sizes among articles coded as social/personality were higher overall than those coded as non-social/personality, between-persons studies that were coded as social/personality also tended to use larger samples (200) compared with mixed or repeated-measures social/personality studies (125.5). The rank-ordering of journals (across all years) was similar whether we considered only between-persons designs (the focus of our primary analyses) or all designs (r = .88).
What is the estimated median power of studies published in social/personality psychology?
“Statistical power” refers to the probability of correctly rejecting the null hypothesis when that hypothesis is false. Thus, when the null hypothesis is false, a study is more likely to lead to a significant result—and therefore a correct conclusion—when it has higher power. Statistical power is a function of three factors: (a) the α threshold for the test, (b) the population effect size (taking into account the reliability of the measures used to detect the effect or the strength of the manipulation used), and (c) the sample size. Because the first of these factors is typically the same from one article to the next (α = .05), we computed the estimated median power by examining the median sample size (using Ns coded on the basis of between-subjects designs only) and its power to detect a typical published effect size in social/personality psychology: a population effect size of ρ = .20 (or Cohen’s d = 0.41; see Richard et al., 2003). Note, however, that given what is known about publication bias, this is likely an overestimate of the typical size of true effects, so using this value leads to generous estimates of power. (We use this specific value simply to focus the discussion in a helpful way. Readers interested in other population effect sizes should be able to perform the power calculations of interest with the NF data reported here.)
Table 2 summarizes these calculations. Only two journals tended to publish studies with 80% or greater estimated median power to detect associations of ρ = .20 in 2011: JRP and JP. The remaining journals tended to publish studies with substantially lower estimated median power levels. In 2011, six of the journals (i.e., EJP, PSPB, PS:S, JPSP, EJSP, and JESP) tended to publish studies that had roughly 50% estimated power to detect the typical effect size in social/personality psychology. Thus, even if we assume the null hypothesis was false in the studies published in these journals and that true effects are as large as published effect size estimates, the studies published by these journals should have been just as likely to lead to correct findings as would flipping a coin. Of course, most published studies do report significant results, which suggests that there is something missing from our analysis of statistical power (see Fanelli, 2012). We return to this point in the Discussion.
The Estimated Statistical Power of Studies Published in Social/Personality Journals in Years 2011 to 2019 to Detect an Average Published Effect Size (ρ = .20)
Note: Any value above 99% was rounded down to 99. JRP = Journal of Research in Personality; JP = Journal of Personality; EJP = European Journal of Personality; SPPS = Social Psychological and Personality Science; PSPB = Personality and Social Psychology Bulletin; JPSP = Journal of Personality and Social Psychology; EJSP = European Journal of Social Psychology; PS:S = Psychological Science (social/personality); JESP = Journal of Experimental Social Psychology.
Note that the estimated median power of the typical study published in these journals was much higher by 2019. In fact, all but one of the journals we sampled (EJSP) had 80% or higher power to detect a typical effect size by 2019. This indicates that the sample sizes used in the work being published in the field increased considerably over the time period studied.
What are the estimated FDRs for journals in social/personality psychology?
Recall that most journals in psychology publish articles in which the key findings are statistically significant (Fanelli, 2012). Some proportion of these “significant findings” are correct hits (the null hypothesis is false and should be rejected), and some proportion of these findings are false positives (the null hypothesis is actually true and should not have been rejected). As explained previously, when power is low, a greater proportion of significant results will be false positives (i.e., the FDR will be higher, and the positive predictive value of the findings will be lower).
Table 3 reports the estimated FDRs in the journals we sampled under the assumption that there are no researcher degrees of freedom (i.e., if we assume all researchers tested one prespecified hypothesis and obtained one p value, which they reported regardless of the study outcome). Although these conditions are idealized, our false-positive estimates under these conditions nevertheless provide an important baseline: a minimum false-positive rate. For each journal, we used the estimated power in 2011 and 2019 to detect an effect of ρ = .20 (see Table 2; we focused only on these 2 years to keep the presentation more compact). Although there is no agreed-on standard for an acceptably low FDR, one interesting comparison is whether the FDR is similar to the assumed Type I error rate (i.e., α). Although there is no expectation that these should be similar, many researchers misunderstand α to be interchangeable with the FDR, and we think it is a good guess for the FDR that researchers assume exists in the literature.
Estimated False-Discovery Rates of Social/Personality Journals in Years 2011 to 2019, Assuming No p-Hacking or Questionable Research Practices
Note: The false-discovery rate (FDR) is the proportion of significant findings that are false positives. See the introduction for an explanation of this statistic. We calculate the FDR under the assumption that the null hypothesis is true 50% of the time and 80% of the time using sample-size information for the years 2011 and 2019. JRP = Journal of Research in Personality; JP = Journal of Personality; EJP = European Journal of Personality; SPPS = Social Psychological and Personality Science; PSPB = Personality and Social Psychology Bulletin; JPSP = Journal of Personality and Social Psychology; EJSP = European Journal of Social Psychology; PS:S = Psychological Science (social/personality); JESP = Journal of Experimental Social Psychology.
We estimated FDRs across two conditions. First, we calculated these values under the assumption that the null hypothesis is just as likely to be true as it is to be false, P(H0) = .50. This is designed to represent a situation in which uncertainty is at its maximum (i.e., the research hypothesis is just as plausible as the null) and thus the empirical data are maximally informative for adjudicating among them. Under this condition, the estimated FDR was close to the α (.05) in 2011 for some of the journals (i.e., JRP, JP) but was much higher among journals that tended to publish studies with smaller sample sizes. In fact, the estimated FDR was almost twice as high as the nominal α rate in EJSP, JPSP, PS:S, and JESP in 2011.
We next examined a situation in which the base rate of true null hypotheses is relatively high, P(H0) = .80. This condition is designed to simulate a situation in which researchers test risky hypotheses (which they might do, for example, if they are incentivized to produce groundbreaking, transformative work; see Wilson & Wixted, 2018), and therefore the research hypotheses tend to be incorrect in most of the studies conducted. In this situation, the expected FDR is alarmingly high across all journals in 2011. For example, if we assume that the base rate of true null hypotheses is 80%, using the typical power of studies published in EJSP and JESP in 2011 leads to an estimated FDR as high as 29%—almost 6 times as large as the nominal α rate.
Recall that the estimated power of studies published in 2019 was higher than that of studies published in 2011. This has consequences for the estimated FDR, too. In 2019, the estimated FDR tends to be .05 to .06 across all journals in a scenario in which the null hypothesis is true half the time. In riskier research situations in which the null is true 80% of the time, the FDR tends to be much higher than α (range = .17–.21), but importantly, those rates should be higher. That is, when research hypotheses are risky, many “discoveries” are going to be false discoveries even when studies are well powered. Indeed, in a perfectly powered situation, the FDR would still be .17. Several journals are hitting that expected value (i.e., JRP, EJP, SPPS, JPSP), but a few still have FDRs that are higher than that expected value (e.g., EJSP).
Note that these calculations across the 2 years represent a minimal bound on the false-positive rates of the journals. They do not consider the way the FDRs can also be inflated by researcher degrees of freedom, such as the ways in which outliers are filtered, the possibility that the hypothesis of interest was chosen after seeing which outcomes produced significant effects, and optional stopping rules on data collection (see Simmons et al., 2011). Each of these practices has the potential to inflate further the false-positive rate and therefore the FDR. Indeed, if researchers do anything other than conduct a single statistical test and report only that p value (or correctly adjust the p value for the number of tests planned and conducted), the proportion of false positives when the null is true (i.e., the α rate) is inflated, and therefore so is the FDR. However, even in the absence of these considerations, the use of low-powered studies alone is capable of inflating the rate of false positives among significant findings in ways that are not always appreciated. Given the relatively low estimated power of the studies published in some of the journals we surveyed in 2011 and what is known about the prevalence of researcher degrees of freedom (Agnoli et al., 2017; Banks et al., 2016; John et al., 2012), the FDR of significant findings published in those journals during that time could be alarmingly high.
Has N-pact changed over time?
Figure 2 illustrates trends in median sample sizes for each journal across the years. For a point of comparison, we have included the numbers from 2006 to 2010 reported by Fraley and Vazire (2014). Because Fraley and Vazire did not differentiate studies on the basis of their research designs (e.g., between vs. within), for comparability, Figure 2 illustrates the sample sizes without regard to this distinction for the 2011 through 2019 data. That is, Figure 2 reports the median sample sizes regardless of the design of the study.

N-pact factors (NFs) over time across social/personality journals. The reported NFs are based on studies using any design, thereby allowing the Fraley and Vazire (2014) estimates for 2006 to 2010 to be included and compared with our NF calculations. JRP = Journal of Research in Personality; JP = Journal of Personality; EJP = European Journal of Personality; SPPS = Social Psychological and Personality Science; PSPB = Personality and Social Psychology Bulletin; JPSP = Journal of Personality and Social Psychology; EJSP = European Journal of Social Psychology; PS:S = psychological science (social/personality); JESP = Journal of Experimental Social Psychology.
Figure 3 summarizes the same kind of information, but expressed as the estimated statistical power, given the journal NF, to detect a population effect of ρ = .20. Under the assumptions described previously, journals published increasingly powerful research over time, with a noteworthy uptick around 2016 to 2019. By 2019, all journals but one were publishing studies that tended to be powered 80% or higher to detect a typical published effect size in social/personality psychology.

Estimated statistical power to detect a population association (ρ) of .20 over time, based on N-pact factors, across social/personality journals. JRP = Journal of Research in Personality; JP = Journal of Personality; EJP = European Journal of Personality; SPPS = Social Psychological and Personality Science; PSPB = Personality and Social Psychology Bulletin; JPSP = Journal of Personality and Social Psychology; EJSP = European Journal of Social Psychology; PS:S = psychological science (social/personality); JESP = Journal of Experimental Social Psychology.
The stability of the relative rank-ordering of the various journals over time was high: Journals that tended to publish studies using larger sample sizes in 2011, for example, were also likely to be publishing studies using larger sample sizes in 2019. The average test-retest stability coefficient (across all intervals from 2011 to 2019) was r = .65. This was true also for the six journals that were coded for all years between 2006 and 2019 (average test-retest correlation = .61).
What is the relationship between journal impact factors and journal NFs?
Finally, we examined the relationship between the NFs for various journals for the years 2011 through 2019 against the journal impact factors from 2019 (the 2019 impact factors were the most recent at the time of data collection [2020] given that impact factors are based on data from the 2 years preceding the year in question). Figure 4 reveals that the journals with the highest impact factors are not necessarily the journals publishing the highest quality research as indexed by the NF. In fact, the correlation is zero (.0006)

The N-pact factor (NF) of journals plotted against their impact factor (2019). NF is based on the aggregate NFs from 2011 to 2019. JRP = Journal of Research in Personality; JP = Journal of Personality; EJP = European Journal of Personality; SPPS = Social Psychological and Personality Science; PSPB = Personality and Social Psychology Bulletin; JPSP = Journal of Personality and Social Psychology; EJSP = European Journal of Social Psychology; PS:S = Psychological Science (social/personality); JESP = Journal of Experimental Social Psychology.
Discussion
Are psychology journals publishing studies based on small sample sizes? The so-called replicability crisis (or “credibility revolution,” coined by Angrist & Pischke, 2010, and used in reference to psychology in Vazire, 2018) in psychology has sparked renewed interest in the problem of small-N studies and how this leads to a heightened risk of false positives (Schimmack, 2012). However, little is known about the actual sample sizes used in published studies, how they vary across journals, and how they have changed over time. The aim of this study was to provide data on these metascientific questions.
In our study, we examined the median sample size of social/personality studies published in nine journals from 2011 to 2019. Our results show that there is considerable variation across journals in the sample sizes of the between-persons studies they publish and thus the power of the studies they publish. In 2011, for example, the estimated median power to detect an effect size of ρ = .20 in published between-persons studies ranged from 40% (PS:S: NF = 72.5) to 88% (JP: NF = 239). In addition to this variation, the overall power of published studies in social/personality psychology tended to be lower than what is desirable. Indeed, only two journals (JP and JRP) published studies in 2011 that, on average, were powered at levels of 80% or higher.
Underpowered studies can be problematic. They should, in principle, have a low chance of detecting effects that actually exist. In reality, the underpowered studies that make it into the published literature almost always find significant effects. Because underpowered studies cannot consistently produce significant effects without researcher degrees of freedom, many of these effects are likely false positives or, at a minimum, inflated. Thus, relying on underpowered research to understand human behavior is bound to provide a blurry and inaccurate portrait of the phenomena of interest. Most importantly, the FDR among those reports can greatly exceed the conventionally assumed 5% value when those studies are based on low-powered designs. Indeed, our calculations, summarized in Table 3, suggest that before even considering p-hacking and researcher degrees of freedom, the FDR could easily have been as high as 25% in social/personality journals in 2011. When factoring in p-hacking, it is very plausible that the FDR was in fact much higher.
Although the estimated power of studies to detect an effect of ρ = .20 or higher was lower in 2011 than what is desirable, our analyses indicate that things were different between 2016 and 2019. Given the increasing sample sizes in published research, by 2019, all but one of the journals we examined published studies that, on average, were powered .80 or higher to detect an effect of ρ = .20 or higher. At the risk of seeming hyperbolic, these data suggest that the research culture in social/personality psychology has undergone a monumental shift in research practices over the past few years. This change to larger sample sizes from 2011 to 2019 is noteworthy given the previous resistance to change in response to similar calls for greater attention to power and sample size. In 1989, Sedlmeier and Gigerenzer published an article titled “Do Studies of Statistical Power Have an Effect on the Power of Studies?” They conducted a follow-up study to Cohen’s 1962 study reporting low levels of power for studies published in the Journal of Abnormal and Social Psychology (which was later split into two journals, the Journal of Abnormal Psychology and JPSP) in 1960. Sedlmeier and Gigerenzer surveyed one volume of the Journal of Abnormal Psychology published in 1984 and concluded that 24 years later, Cohen’s plea for greater power had had no effect. Power was, if anything, lower in 1984 than in 1960, and there was almost no attention to power in the text of the articles. Rossi (1990) conducted a similar analysis of statistical power in three clinical and social-psychology journals in 1982 and also found results similar to the levels reported by Cohen in 1962. More recently, Singleton Thorn (2020) conducted a meta-analysis of power studies in psychology, with most meta-analyses sampling studies from 1970 to 2000, and found no change over time. In light of the persistence of underpowered research in psychology, the change we observed in the estimated power to detect an effect of ρ = .20 or higher from 2011 to 2019 is striking. The estimates we have reported here indicate that after decades of criticism, commentary, and cautionary tales, the practices used in the field have finally changed in visible and constructive ways.
It is difficult to say what accounts for this positive change. It seems possible that the credibility revolution has played some role in these shifts. Discussions about replication, statistical power, and the need for high-quality, reproducible research have been salient for a decade. These have likely affected the ways in which researchers design studies and the ways reviewers evaluate them. Moreover, these conversations have led to changes in editorial policies. For example, the editorial policies at PSPB now require that authors explain how “sample size was determined” and “all information required to reproduce the power analysis” if one is reported (“Manuscript Submission Guidelines,” 2022). It is also possible that the awareness of such issues combined with an increased use in online samples and archival data sets may have contributed to these changes. For example, Sassenberg and Ditrich (2019) found that the number of studies using online samples was higher in 2016 and 2018 than it was in 2009 and 2011. Moreover, studies that used larger samples were also more likely to use online samples. It is also possible, as one reviewer of this article suggested, that researchers are increasingly performing a priori power analyses and, as a result, using larger sample sizes (although for evidence to the contrary, see Bakker et al., 2020). Or, relatedly, perhaps researchers are deliberately studying smaller sample sizes in recent years and are appropriately powering their studies to do so. This study is not positioned to determine exactly what has led to the changes we have documented here. Regardless of how these changes came about, we welcome them and hope they represent an enduring rather than a transient shift in the ways in which social/personality psychologists design research moving forward. We also hope they will not be limited to research conducted online.
This change is also important because of the implications it has for the incentives to engage in questionable research practices, such as not disclosing flexibility in data collection and analysis (i.e., p-hacking). When sample sizes are small and studies are underpowered, researchers have a strong incentive to p-hack in a world in which journals prefer to publish exciting discoveries (i.e., significant results) even when the researchers’ hypotheses are true. With larger sample sizes, researchers testing true hypotheses have less incentive to p-hack—they have a much greater chance of obtaining evidence for their research hypothesis without p-hacking than they would with smaller samples. Of course, more metascientific research on the rigor and robustness of published findings is needed to more directly test whether p-hacking has decreased overall or the quality of research has increased over time.
Another pattern in our results is that journals that publish personality research tend to have larger median sample sizes and therefore higher NFs than journals that publish social-psychological research or a combination of social and personality research. Although this was noted in the original N-pact work by Fraley and Vazire (2014), the present work allows us to rule out the possibility that this difference is due to differential uses of between- and within-persons designs. Specifically, we found that this difference exists when considering only studies that used between-persons designs.
Why might these subfield differences exist? One possibility is that personality psychologists have a longer history of thinking about sample sizes, effect sizes, and statistical power (Atherton et al., 2021). Although this is obviously speculative, our impression is that the long-standing use of common effect-size metrics in personality psychology, such as correlations, has allowed researchers to better understand what kinds of effects are to be expected and to better appreciate the kinds of sample sizes that are needed to detect them reliably.
Another explanation is that there may be something about personality psychology that makes large samples easier to obtain (e.g., greater use of convenience samples, inexpensive methods/designs), which would also allow personality researchers to more easily increase their sample sizes in response to changing norms. However, there is little evidence that social and personality psychology differ in their reliance on convenience samples, intensive methods, or complex designs, although the specific methods/designs used vary between the two subfields. For example, social-psychological research uses confederates more than personality research, whereas personality research uses longitudinal designs more than social-psychological research (Vazire et al., 2017).
Finally, another possible explanation for the difference between social and personality journals is that published social-psychology articles tend to have more studies per article than do published personality psychology articles. For example, JESP had an average of 3.20 studies per article, whereas JRP had an average of 1.43 studies per article. If labs are treating “total N” per article as a finite resource, that resource has to be divided in multiple ways to build a multistudy paper. Nonetheless, our data suggest that the total sample sizes tend to be similar, if not larger, in personality articles. For example, the total sample sizes in articles were comparable for JESP (325) and JRP (340).
We also found that the NFs for the journals we sampled were not positively correlated with their impact factors, although, of course, the sample size for this analysis (nine journals) is very small. This is noteworthy because impact factors have been the dominant way in which status and prestige have been assigned to journals and the articles that they publish. If journals with higher status are not actually producing higher quality research as indexed by the statistical power of the studies they publish, then one should begin to question the emphasis that has been placed on impact factors for judging, ranking, and assigning status to journals.
We should be clear that we are not suggesting that traditional citation metrics are not of value. There is utility in considering the citation rates of articles and the journals that publish them. What we are proposing, on the basis of these data, is that the information contained in citation rates is distinctive from that captured by NFs. There is no need to abandon impact factors as measures of impact, but our results contribute to the growing evidence that they are not valid measures of the methodological quality of the work published in those journals (McKiernan et al., 2019). If one wishes to consider the ability of the work published by journals to detect and precisely estimate real effects of interest and to minimize the FDR in the field, our findings show that impact factors cannot be used for this purpose. This may be surprising to some readers because it would seem that higher-status journals—publishing research with higher citation and impact rates—would be publishing higher-quality work (i.e., work that is designed to provide valuable data on the questions of interest). However, the research published in high-impact-factor journals is neither more nor less likely to be based on high-powered designs.
We would never argue that NF is enough or even that it captures the most important dimensions of quality, but it captures one dimension of quality that is clearly not captured by impact factor. In the meantime, scholars would be well served by metrics such as the NF so that they can evaluate journals according to metrics that they know capture at least some aspects of research quality, and these evaluations can help guide their choices about which journals to support (e.g., through submissions, reviews, editorial service, membership to professional organizations). Having said that, we also suggest that the use of journal-level metrics is not an ideal substitute for evaluating the quality of research at the level of the scholar or the article itself. When considering a promotion case, for example, it would be much more valuable to read and evaluate the actual empirical work of the scientist in question than to rely on one or more metrics for the journals in which the scientist publishes.
Limitations
There are several limitations of our work. First, our conclusions need to be circumscribed because of the small and relatively narrow set of journals we examined. We believe we have sampled most of what are considered the top journals in social and personality psychology, but we cannot generalize to journals in other subdisciplines or journals outside this group. Another constraint on our conclusions is the time period we examined. Because of the intensive nature of coding journal articles, we have completed coding only through 2019. The credibility revolution in social and personality psychology has continued strong in the years since then, and we suspect that there could be even more rapid change since 2019. We plan to continue coding journal articles to update the NF rankings over time. Third, it bears repeating that although sample size is an important factor in understanding statistical power, it is only one such factor. Researchers can also increase the power of their research by using more reliable measurements, using more powerful manipulations in experimental work, or simply studying phenomena that are easier to detect.
A final limitation of this work is that our calculations concerning statistical power are based on a specific effect size: a population correlation of .20. We selected that value because previous metascientific research suggests that it is typical of effect sizes for published research in social and personality psychology. However, it is possible that the average effect sizes being studied in social and personality psychology have changed over time. For example, if recent studies have been focused on smaller effect sizes, it is possible that despite using larger samples on average, the statistical power of published research has not, in fact, increased over time. This is a possibility that cannot be ruled out with the present data. Thus, although we have concluded that the statistical power of published studies to detect effects of a specific size (i.e., ρ = .20) has increased over time, we should be clear that this does not necessarily imply that the statistical power of published research itself has increased over time.
To more fully evaluate whether statistical power itself has increased in recent years, it would be necessary to know the actual population effect sizes that are being studied. On the surface, this could be estimated by quantifying the effect sizes reported in published research. But because researchers are now using larger samples, it could be the case that the average published effect size has decreased not because researchers are targeting and planning for smaller effect sizes, but because their larger-sample research is better positioned to detect smaller effect sizes. The ideal way to evaluate whether targeted effect sizes have gotten smaller would be to obtain registrations of researcher’s a priori expected effects and examine whether those expectations have changed over time (although even this could be due to other changes, such as more realistic expectations over time, rather than changes in the true magnitude of the effects being studied). Given the recency of registration and preregistration in psychological science, this may be a metascientific question that cannot be resolved for older published research. But it may be a question that can be addressed for ongoing research in the near future.
Closing
In his widely read 1990 article in which he reflected on statistical practices in the behavioral sciences, Jacob Cohen wrote: I did what came to be called a meta-analysis of the articles in the 1960 volume of the Journal of Abnormal and Social Psychology (Cohen, 1962). I found [that] the median power to detect a medium effect was .46—a rather abysmal result. Of course, investigators could not have known how underpowered their research was, as their training had not prepared them to know anything about power, let alone how to use it in research planning. One might think that after 1969, when I published my power handbook that made power analysis as easy as falling off a log, the concepts and methods of power analysis would be taken to the hearts of null hypothesis testers. So one might think. (Stay tuned.) (p. 1308)
For a long time, things did not change. The average estimated power reported by Fraley and Vazire (2014) was close to the value Cohen originally reported in 1962, and Singleton Thorn (2020) detected no change from 1970 to 2000. But Cohen (1990, p. 1131) also observed that “these things take time” and assured himself by reflecting on the fact that it took two World Wars before Student’s t test was regularly included in statistics books. Our work suggests that after more than half a century, behavioral scientists working in social/personality psychology have finally begun to consider such matters. In fact, we may be entering an era in which one can be reasonably assured that articles published in the major journals are well powered to detect typical effect sizes. Or so one may hope. (Stay tuned.)
Footnotes
Transparency
Action Editor: Robert L. Goldstone
Editor: Robert L. Goldstone
Author Contribution(s)
