Sage Journals: Discover world-class research

Abstract

In a series of influential articles, Spence and Stanley have discussed to which degree researchers can expect a published effect to replicate in a replication study. They argue that expectations are often too high because sampling variability and measurement error are not fully taken into account. They conclude that (a) the failure of a single study to replicate a published effect might be less serious than often assumed, (b) the replication crisis might only exist for those who hold unduly high expectations about replications, and (c) researchers should temper their expectations about replication studies. However, these claims are based on a highly unusual and far too pessimistic approach they used in their initial work on this topic. Later, Spence and Stanley have promoted the well-established prediction intervals, for which their recent article in this journal provides an instructive tutorial. We use these prediction intervals to demonstrate that their previous claims about replications were far too pessimistic and need to be updated. We conclude that the failure of a single study to replicate a published effect should indeed be taken seriously (given of course, that the replication study is well-designed and has sufficient statistical power), and warn that too pessimistic expectations about replications can be just as detrimental to science as too optimistic expectations. This is crucial because too pessimistic expectations can make it prohibitively difficult for researchers to demonstrate evidence against original results.

Keywords

replication prediction interval replication crisis statistics open materials

Replication studies provide the core means to evaluate the robustness of findings in all areas of science. To evaluate the success of replication studies, it is important to consider what can be expected from a replication study. Spence and Stanley (2016, 2024; Stanley & Spence, 2014) have argued that expectations are often too high because sampling variability and measurement error are not appropriately considered. For example, they argued that “the data from one study should not be interpreted as replicating or failing to replicate the results of another study” (Stanley & Spence, 2014, p. 316). Consequently, they called for tempered expectations and suggested that the replication crisis might exist only for researchers with unduly high expectations for replications (Stanley & Spence, 2014, p. 316).

Stanley and Spence’s (2014) conclusions have strongly shaped researchers’ expectations regarding replications, as evidenced by recent articles citing them as a key reference on this topic (e.g., Fiedler & Trafimow, 2024; Lishner, 2024; Margoni, 2022; Vohs, 2015; Winne, 2017). For example, Vohs (2015, p. e87) used Stanley and Spence (2014) as one reason to question the validity of failed replication attempts. Likewise, Fiedler and Trafimow (2024) argued that “Monte-Carlo simulations by Stanley and Spence (2014) have clarified that the goal of replicating every finding is unwarranted. Replications of even completely valid findings are widely dispersed around the true effect size” (p. 2582).

However, we show that the method used to arrive at these conclusions provides a much too pessimistic view of what to expect from a replication. For brevity, we dub this method “SRI” (for “Stanley & Spence, 2014, replication interval”). We make our case by contrasting SRIs to prediction intervals (PIs). PIs are a well-established, frequentist method to estimate the outcome of future observations (Estes, 1997; Geisser, 1993) and are now also endorsed by Spence and Stanley (2024), thereby ensuring that our arguments are based on the same assumptions they are willing to make (for details on the relationship between SRIs and PIs, see Appendix A1).

Before making our argument, we stress that it is important to adopt neither an overly optimistic nor an overly pessimistic view on the possibilities of replications. Researchers trying to replicate an original study often face an uphill battle, especially when the study fails to replicate the original study. In such cases, researchers typically (and often rightfully) are required to have much higher sample sizes and perform multiple variants and control conditions of the original study and are faced with at least one reviewer being one of the authors of the original study (who is potentially biased). If in addition to these normal (and, in part, unavoidable) hurdles, there were also an overly pessimistic view of what to expect from a single replication study, failures to replicate would not be taken seriously enough, and it would be even harder to correct the scientific record.

Instructive Example

Consider a simple example: Stanley and Spence (2014, p. 315) described a situation in which the original study reported a correlation of r_orig = +.30 and then argued that scientists can expect a replication to result in values as low as r_rep = −.47 (with sample size N = 40, reliability r_yy = .70). That is, if researchers of a replication study found this substantial negative correlation, they would argue that the discrepancy could still fully be accounted for by sampling variability and measurement error alone such that the underlying true effects might still be identical. In contrast, the lower limit of the 95% PI is much tighter (r_rep = −.11), thereby providing a much more realistic view of what one can expect from a replication study (as long as both studies measured the same underlying true effect). This huge discrepancy between SRIs and PIs demonstrates that the SRIs are much more pessimistic than the well-established PIs. The disagreement between SRIs and PIs will also lead to different interpretations of specific results. For example, when a replication study attempted to replicate the measured correlation of r_orig = +.30 and found a negative correlation of r_rep = −.30, this would be considered a failure to replicate based on 95% PIs but not based on the SRIs.

Why are SRIs so pessimistic regarding replications? The main reason is that SRIs are based on double worst-case scenarios (Fig. 1a). We explain this step by step for the left side of the SRI: Assuming again a measured correlation in the original study of r_orig = +.30, Stanley and Spence (2014, p. 315) reasoned that if the true correlation was r_true = −.10, then the upper end of the 99% range of possible outcomes would just include r_orig = +.30. Therefore, they considered r_true = −.10 as a “possible underlying realit[y]” (p. 313) and used it to determine the lower limit of their SRI. This is the first worst-case scenario: The original study is assumed to have measured at the uppermost extreme of the 99% range of the assumed underlying reality. Next, they reasoned that if the true correlation was indeed r_true = −.10, then the 99% range of possible outcomes comprises values as low as r_rep = −.47 on the lower end. Therefore, they used r_rep = −.47 as lower limit of their SRI. This is the second worst-case scenario: The replication study is assumed to measure at the lowermost extreme of the 99% range. The analogous reasoning is then applied for the right side of the SRI, resulting in the lower and upper limits of the SRI, respectively. In our simulations below, we show that these double worst-case scenarios lead to extremely wide SRIs and that such double worst cases are highly unlikely and will rarely ever occur in practice.

Fig. 1.

Prediction interval (PI) and Stanley and Spence (2014) replication interval (SRI) for the instructive example. (a) Construction principle of SRI: For the left side of the SRI, the true correlation is considered for which the original measurement is just included at the upper end of the 99% range of possible outcomes (Worst-Case Scenario 1). Then, the lower end of this 99% range yields the lower SRI limit (Worst-Case Scenario 2). The right side is constructed analogously. Thus, SRIs assume a double worst-case scenario for each side. Note that correlation distributions are more skewed the farther away from zero the underlying true values are. (b) SRI (red) and 95% PI (green) for the example by Stanley and Spence (p. 315): We show that the SRI is much larger than the 95% PI. Note that Stanley and Spence considered only underlying realities r_true $\in {- . 10, . 10, . 20, . 30, . 40}$ , leading to a smaller interval (gray), but stressed that this interval still underestimates the full SRI (red). The limits of the SRI in this example are constructed as explained in (a); the only difference is that reliability is now imperfect (r_yy = .70). This changes the true correlations that produce the 99% ranges slightly but does not affect the size of the SRIs and PIs (see Appendix A2).

Before describing our simulations, we need to explain two details (Fig. 1b). First, Stanley and Spence (2014) performed only limited simulations, using true correlations of r_true $\in {- . 10, . 10, . 20, . 30, . 40}$ as possible underlying realities (gray points in Fig. 1b). This has the effect that they were not able to give the full range of their SRIs. Instead, they repeatedly cautioned that their values still underestimate the full width of the SRIs (e.g., p. 315). For consistency, we implemented their rationale without this arbitrary limitation (red interval in Fig. 1b). This results in a lower limit for the SRI of r_rep = −.50 instead of r_rep = −.47 (compare red vs. gray intervals in Fig. 1b). Second, Stanley and Spence performed simulations at different levels to separately demonstrate the influences of sampling variability, measurement error, and measurement reliability. Here, we used their most realistic simulations, which included sampling variability and measurement error, corresponding to a reliability of the simulated measurements of r_yy = .70 (for details, see Appendix A2).

Simulations

We simulated a situation in which a true correlation of r_true = +.30 existed and two studies were performed, each with a sample size of 40 and a reliability of the measurements of r_yy = .70. We then determined the SRIs and PIs for each simulation and depict them in Figure 2.

Fig. 2.

Prediction intervals (PIs) and Stanley and Spence (2014) replication intervals (SRIs) for 1,000 original studies and replications. (a) Simulated correlation coefficients of 1,000 original studies (N = 40) and replication studies (N_rep = 40) for a true correlation r_true = +.30 (solid, vertical line). Because we used a reliability of r_yy = .70, the attenuated correlation (dashed, vertical line) is shifted to the left of r = +.30 to r = +.25. (b) For the same simulations, we depict SRIs (red) and 95% PIs (green). Differences between replication study and original study are depicted as black dots. The bar on the left indicates the smallest of the two intervals the replication study is included in. The red SRIs (Stanley & Spence, 2014) are much wider than the green 95% PIs (Spence & Stanley, 2016, 2024) and do not even exclude any of the 1,000 replications. Results are sorted from bottom to top by the difference between replication study and original study.

Before interpreting these results, we first note that researchers using confidence intervals are always faced with a well-known trade-off (PIs are confidence intervals for the difference between original study and replication study centered around the original study’s value, with the variability of the data estimated from the original study): Researchers could either achieve a high confidence level at the cost of tolerating wide PIs or achieve narrow PIs at the cost of tolerating a low confidence level. The traditional solution for this trade-off is to choose a confidence level of 95%, as now also advocated by Spence and Stanley (2024).

As expected, Figure 2 shows that in roughly 95% (940 out of 1,000) of simulated studies, the difference between original study and replicated study lies within the 95% PI (Fig. 2, green), as already shown by Spence and Stanley (2016). However, SRIs (Fig. 2, red) are much wider, and none of the 1,000 studies fall outside the SRIs. In fact, only after drastically increasing the number of simulations could we determine that the SRIs correspond roughly to 99.97% PIs, and only 0.03% of simulated replications fall outside the SRIs (3,212 out of 10,000,000 simulations; see Appendix A4). That is, SRIs solve the above-described trade-off in a highly unusual and very conservative way, resulting in extremely wide predicted ranges for the difference between original study and replication. Therefore, SRIs allow for huge differences between original study and replication—much more than one would expect using standard methods. This makes it prohibitively difficult for researchers to demonstrate evidence against original results.

Conclusions

When it comes to replications, things are not as bleak as suggested by Stanley and Spence (2014). Given that this study has strongly shaped expectations of researchers, it is necessary to update expectations to a more realistic level: It is reasonable to expect replications to produce roughly similar results as the original study—assuming the replication study measured the same underlying effect, was well conducted, and had sufficient statistical power. A conventional range of plausible values is given by the 95% PI, and one can reasonably consider it a failure to replicate when a replication attempt falls outside this range (as now also advocated by Spence & Stanley, 2016, 2024). Only if researchers wanted to brace for catastrophic worst-case scenarios would they need to allow for the extremely wide ranges given by SRIs—which correspond to approximately 99.97% PIs. But—if researchers really wanted to follow such a policy—they would also need to drastically increase the threshold for accepting results in the original studies so that it is not easier to produce spurious results than to refute them.

Nevertheless, we agree with Spence and Stanley (2024) that more focus should be put on comparing effect sizes (either raw or standardized; Baguley, 2009; Morris, 2020) and not solely on the binary decision of whether an effect exists. Combining effect sizes in meta-analyses can reduce problems arising from underpowered original studies (as nicely discussed by Spence & Stanley, 2016, p. 18) because those studies are given less weight than well-powered replications. Although we argued that replication attempts should not be dismissed too quickly, replication studies should have sufficient power (ideally more than the original study) to avoid cluttering the literature. This way, the literature can converge toward appropriate effect estimates and better science at large.

Footnotes

Appendix

Acknowledgements

We acknowledge support from the Open Access Publishing Fund of the University of Tübingen. All simulation and analysis scripts are available at .

Transparency

Action Editor: Katie Corker

Editor: David A. Sbarra

Author Contributions

Frieder Göppert: Conceptualization; Data curation; Formal analysis; Investigation; Methodology; Software; Validation; Visualization; Writing – original draft; Writing – review & editing.

Kriti Bhatia: Conceptualization; Visualization; Writing – original draft; Writing – review & editing.

Sascha Meyen: Conceptualization; Formal analysis; Validation; Visualization; Writing – review & editing.

Volker H. Franz: Conceptualization; Data curation; Formal analysis; Funding acquisition; Investigation; Methodology; Project administration; Resources; Software; Supervision; Validation; Visualization; Writing – original draft; Writing – review & editing.

ORCID iDs

Frieder Göppert

Kriti Bhatia

Sascha Meyen

Volker H. Franz

References

Baguley

(2009). Standardized or simple effect size: What should be reported? British Journal of Psychology, 100, 603–617.

Estes

W. K.

(1997). On the communication of information by displays of standard errors and confidence intervals. Psychonomic Bulletin & Review, 4(3), 330–341.

Fiedler

Trafimow

(2024). Using theoretical constraints and the TASI taxonomy to delineate predictably replicable findings. Psychonomic Bulletin & Review, 31, 2581–2589. https://doi.org/10.3758/s13423-024-02521-4

Geisser

(1993). Predictive inference. Chapman and Hall/CRC. https://doi.org/10.1201/9780203742310

Lishner

D. A.

(2024). But did they really perceive no (low) choice? Comment on Vaidis et al. (2024). Advances in Methods and Practices in Psychological Science, 7. https://doi.org/10.1177/25152459241267915

Margoni

(2022). Reliability and replicability in infant research: A commentary on Byers-Heinlein et al. (2021). Infant and Child Development, 31(5), Article e2330. https://doi.org/10.1002/icd.2330

Morris

P. H.

(2020). Misunderstandings and omissions in textbook accounts of effect sizes. British Journal of Psychology, 111, 395–410.

Spearman

(1904). The proof and measurement of association between two things. American Journal of Psychology, 15, 72–101.

Spence

J. R.

Stanley

D. J.

(2016). Prediction interval: What to expect when you’re expecting . . . a replication. PLOS ONE, 11(9), Article e0162874. https://doi.org/10.1371/journal.pone.0162874

10.

Spence

J. R.

Stanley

D. J.

(2024). Tempered expectations: A tutorial for calculating and interpreting prediction intervals in the context of replications. Advances in Methods and Practices in Psychological Science, 7(1). https://doi.org/10.1177/25152459231217932

11.

Stanley

D. J.

Spence

J. R.

(2014). Expectations for replications: Are yours realistic? Perspectives on Psychological Science, 9(3), 305–318. https://doi.org/10.1177/1745691614528518

12.

Vohs

K. D.

(2015). Money priming can change people’s thoughts, feelings, motivations, and behaviors: An update on 10 years of experiments. Journal of Experimental Psychology: General, 144(4), e86–e93. https://doi.org/10.1037/xge0000091

13.

Winne

P. H.

(2017). Leveraging big data to help each learner and accelerate learning science. Teachers College Record, 119(3), 1–24. https://doi.org/10.1177/016146811711900305

Realistic Expectations for Replications: Expecting Too Little Is Just as Bad as Expecting Too Much

Abstract

Keywords

Instructive Example

Simulations

Conclusions

Footnotes

Appendix

Acknowledgements

Transparency

ORCID iDs

References