Abstract
In a series of influential articles, Spence and Stanley have discussed to which degree researchers can expect a published effect to replicate in a replication study. They argue that expectations are often too high because sampling variability and measurement error are not fully taken into account. They conclude that (a) the failure of a single study to replicate a published effect might be less serious than often assumed, (b) the replication crisis might only exist for those who hold unduly high expectations about replications, and (c) researchers should temper their expectations about replication studies. However, these claims are based on a highly unusual and far too pessimistic approach they used in their initial work on this topic. Later, Spence and Stanley have promoted the well-established prediction intervals, for which their recent article in this journal provides an instructive tutorial. We use these prediction intervals to demonstrate that their previous claims about replications were far too pessimistic and need to be updated. We conclude that the failure of a single study to replicate a published effect should indeed be taken seriously (given of course, that the replication study is well-designed and has sufficient statistical power), and warn that too pessimistic expectations about replications can be just as detrimental to science as too optimistic expectations. This is crucial because too pessimistic expectations can make it prohibitively difficult for researchers to demonstrate evidence against original results.
Replication studies provide the core means to evaluate the robustness of findings in all areas of science. To evaluate the success of replication studies, it is important to consider what can be expected from a replication study. Spence and Stanley (2016, 2024; Stanley & Spence, 2014) have argued that expectations are often too high because sampling variability and measurement error are not appropriately considered. For example, they argued that “the data from one study should not be interpreted as replicating or failing to replicate the results of another study” (Stanley & Spence, 2014, p. 316). Consequently, they called for tempered expectations and suggested that the replication crisis might exist only for researchers with unduly high expectations for replications (Stanley & Spence, 2014, p. 316).
Stanley and Spence’s (2014) conclusions have strongly shaped researchers’ expectations regarding replications, as evidenced by recent articles citing them as a key reference on this topic (e.g., Fiedler & Trafimow, 2024; Lishner, 2024; Margoni, 2022; Vohs, 2015; Winne, 2017). For example, Vohs (2015, p. e87) used Stanley and Spence (2014) as one reason to question the validity of failed replication attempts. Likewise, Fiedler and Trafimow (2024) argued that “Monte-Carlo simulations by Stanley and Spence (2014) have clarified that the goal of replicating every finding is unwarranted. Replications of even completely valid findings are widely dispersed around the true effect size” (p. 2582).
However, we show that the method used to arrive at these conclusions provides a much too pessimistic view of what to expect from a replication. For brevity, we dub this method “SRI” (for “Stanley & Spence, 2014, replication interval”). We make our case by contrasting SRIs to prediction intervals (PIs). PIs are a well-established, frequentist method to estimate the outcome of future observations (Estes, 1997; Geisser, 1993) and are now also endorsed by Spence and Stanley (2024), thereby ensuring that our arguments are based on the same assumptions they are willing to make (for details on the relationship between SRIs and PIs, see Appendix A1).
Before making our argument, we stress that it is important to adopt neither an overly optimistic nor an overly pessimistic view on the possibilities of replications. Researchers trying to replicate an original study often face an uphill battle, especially when the study fails to replicate the original study. In such cases, researchers typically (and often rightfully) are required to have much higher sample sizes and perform multiple variants and control conditions of the original study and are faced with at least one reviewer being one of the authors of the original study (who is potentially biased). If in addition to these normal (and, in part, unavoidable) hurdles, there were also an overly pessimistic view of what to expect from a single replication study, failures to replicate would not be taken seriously enough, and it would be even harder to correct the scientific record.
Instructive Example
Consider a simple example: Stanley and Spence (2014, p. 315) described a situation in which the original study reported a correlation of rorig = +.30 and then argued that scientists can expect a replication to result in values as low as rrep = −.47 (with sample size N = 40, reliability ryy = .70). That is, if researchers of a replication study found this substantial negative correlation, they would argue that the discrepancy could still fully be accounted for by sampling variability and measurement error alone such that the underlying true effects might still be identical. In contrast, the lower limit of the 95% PI is much tighter (rrep = −.11), thereby providing a much more realistic view of what one can expect from a replication study (as long as both studies measured the same underlying true effect). This huge discrepancy between SRIs and PIs demonstrates that the SRIs are much more pessimistic than the well-established PIs. The disagreement between SRIs and PIs will also lead to different interpretations of specific results. For example, when a replication study attempted to replicate the measured correlation of rorig = +.30 and found a negative correlation of rrep = −.30, this would be considered a failure to replicate based on 95% PIs but not based on the SRIs.
Why are SRIs so pessimistic regarding replications? The main reason is that SRIs are based on double worst-case scenarios (Fig. 1a). We explain this step by step for the left side of the SRI: Assuming again a measured correlation in the original study of rorig = +.30, Stanley and Spence (2014, p. 315) reasoned that if the true correlation was rtrue = −.10, then the upper end of the 99% range of possible outcomes would just include rorig = +.30. Therefore, they considered rtrue = −.10 as a “possible underlying realit[y]” (p. 313) and used it to determine the lower limit of their SRI. This is the first worst-case scenario: The original study is assumed to have measured at the uppermost extreme of the 99% range of the assumed underlying reality. Next, they reasoned that if the true correlation was indeed rtrue = −.10, then the 99% range of possible outcomes comprises values as low as rrep = −.47 on the lower end. Therefore, they used rrep = −.47 as lower limit of their SRI. This is the second worst-case scenario: The replication study is assumed to measure at the lowermost extreme of the 99% range. The analogous reasoning is then applied for the right side of the SRI, resulting in the lower and upper limits of the SRI, respectively. In our simulations below, we show that these double worst-case scenarios lead to extremely wide SRIs and that such double worst cases are highly unlikely and will rarely ever occur in practice.

Prediction interval (PI) and Stanley and Spence (2014) replication interval (SRI) for the instructive example. (a) Construction principle of SRI: For the left side of the SRI, the true correlation is considered for which the original measurement is just included at the upper end of the 99% range of possible outcomes (Worst-Case Scenario 1). Then, the lower end of this 99% range yields the lower SRI limit (Worst-Case Scenario 2). The right side is constructed analogously. Thus, SRIs assume a double worst-case scenario for each side. Note that correlation distributions are more skewed the farther away from zero the underlying true values are. (b) SRI (red) and 95% PI (green) for the example by Stanley and Spence (p. 315): We show that the SRI is much larger than the 95% PI. Note that Stanley and Spence considered only underlying realities rtrue
Before describing our simulations, we need to explain two details (Fig. 1b). First, Stanley and Spence (2014) performed only limited simulations, using true correlations of rtrue
Simulations
We simulated a situation in which a true correlation of rtrue = +.30 existed and two studies were performed, each with a sample size of 40 and a reliability of the measurements of ryy = .70. We then determined the SRIs and PIs for each simulation and depict them in Figure 2.

Prediction intervals (PIs) and Stanley and Spence (2014) replication intervals (SRIs) for 1,000 original studies and replications. (a) Simulated correlation coefficients of 1,000 original studies (N = 40) and replication studies (Nrep = 40) for a true correlation rtrue = +.30 (solid, vertical line). Because we used a reliability of ryy = .70, the attenuated correlation (dashed, vertical line) is shifted to the left of r = +.30 to r = +.25. (b) For the same simulations, we depict SRIs (red) and 95% PIs (green). Differences between replication study and original study are depicted as black dots. The bar on the left indicates the smallest of the two intervals the replication study is included in. The red SRIs (Stanley & Spence, 2014) are much wider than the green 95% PIs (Spence & Stanley, 2016, 2024) and do not even exclude any of the 1,000 replications. Results are sorted from bottom to top by the difference between replication study and original study.
Before interpreting these results, we first note that researchers using confidence intervals are always faced with a well-known trade-off (PIs are confidence intervals for the difference between original study and replication study centered around the original study’s value, with the variability of the data estimated from the original study): Researchers could either achieve a high confidence level at the cost of tolerating wide PIs or achieve narrow PIs at the cost of tolerating a low confidence level. The traditional solution for this trade-off is to choose a confidence level of 95%, as now also advocated by Spence and Stanley (2024).
As expected, Figure 2 shows that in roughly 95% (940 out of 1,000) of simulated studies, the difference between original study and replicated study lies within the 95% PI (Fig. 2, green), as already shown by Spence and Stanley (2016). However, SRIs (Fig. 2, red) are much wider, and none of the 1,000 studies fall outside the SRIs. In fact, only after drastically increasing the number of simulations could we determine that the SRIs correspond roughly to 99.97% PIs, and only 0.03% of simulated replications fall outside the SRIs (3,212 out of 10,000,000 simulations; see Appendix A4). That is, SRIs solve the above-described trade-off in a highly unusual and very conservative way, resulting in extremely wide predicted ranges for the difference between original study and replication. Therefore, SRIs allow for huge differences between original study and replication—much more than one would expect using standard methods. This makes it prohibitively difficult for researchers to demonstrate evidence against original results.
Conclusions
When it comes to replications, things are not as bleak as suggested by Stanley and Spence (2014). Given that this study has strongly shaped expectations of researchers, it is necessary to update expectations to a more realistic level: It is reasonable to expect replications to produce roughly similar results as the original study—assuming the replication study measured the same underlying effect, was well conducted, and had sufficient statistical power. A conventional range of plausible values is given by the 95% PI, and one can reasonably consider it a failure to replicate when a replication attempt falls outside this range (as now also advocated by Spence & Stanley, 2016, 2024). Only if researchers wanted to brace for catastrophic worst-case scenarios would they need to allow for the extremely wide ranges given by SRIs—which correspond to approximately 99.97% PIs. But—if researchers really wanted to follow such a policy—they would also need to drastically increase the threshold for accepting results in the original studies so that it is not easier to produce spurious results than to refute them.
Nevertheless, we agree with Spence and Stanley (2024) that more focus should be put on comparing effect sizes (either raw or standardized; Baguley, 2009; Morris, 2020) and not solely on the binary decision of whether an effect exists. Combining effect sizes in meta-analyses can reduce problems arising from underpowered original studies (as nicely discussed by Spence & Stanley, 2016, p. 18) because those studies are given less weight than well-powered replications. Although we argued that replication attempts should not be dismissed too quickly, replication studies should have sufficient power (ideally more than the original study) to avoid cluttering the literature. This way, the literature can converge toward appropriate effect estimates and better science at large.
Footnotes
Appendix
Acknowledgements
Transparency
Action Editor: Katie Corker
Editor: David A. Sbarra
Author Contributions
