Sage Journals: Discover world-class research

Abstract

Over the last decade, replication research in the psychological sciences has become more visible. One way that replication research can be conducted is to compare the results of the replication study with the original study to look for consistency, that is to say, to evaluate whether the original study is “replicable.” Unfortunately, many popular and readily accessible methods for ascertaining replicability, such as comparing significance levels across studies or eyeballing confidence intervals, are generally ill suited to the task of comparing results across studies. To address this issue, we present the prediction interval as a statistic that is effective for determining whether a replication study is inconsistent with the original study. We review the statistical rationale for prediction intervals, demonstrate hand calculations, and provide a walkthrough using an R package for obtaining prediction intervals for means, d values, and correlations. To aid the effective adoption of prediction intervals, we provide guidance on the correct interpretation of results when using prediction intervals in replication research.

Keywords

inference prediction prediction interval

Spurred by what is generally referred to as the replication crisis in psychology, replication research has become more prominent in psychological sciences. During the past decade or so, several large-scale replication initiatives have been conducted to test the replicability of important research findings (e.g., Klein et al., 2014; Open Science Collaboration, 2012, 2015). To put it simply, these initiatives have produced disappointing results with respect to the replicability of psychological research. For some, the widespread increase in the attempt to replicate scientific results is a positive step toward improving the credibility of scientific knowledge (Munafò et al., 2017; Vazire, 2018; Vazire et al., 2022).

Replications can be conducted to assess the credibility of the original finding by comparing it with the replication (e.g., Open Science Collaboration, 2012, 2015). The logic of replications can appear deceptively simple: Conduct an original study followed by a replication and then assess if the result of the replication is consistent with the original study. However, comparing results across studies to ascertain their agreement presents researchers with statistical considerations that are not present when testing hypotheses within individual studies. Specifically, what criterion should be used to determine if results are consistent or inconsistent? In other words, what can be used to determine if the study “successfully replicated” or “failed to replicate”?

When considering the replication question, it is important to distinguish between “replication” and “reproduciblity.” Unfortunately, these two terms are often used interchangeably. Replication research is generally understood to involve rerunning studies and collecting and analyzing new data (Peng et al., 2006). In contrast, reproducibility generally refers to the ability of a researcher to generate the same results of a study from the same raw data (Goodman et al., 2016; Gundersen, 2021; Patil et al., 2019; Plesser, 2018).

When it comes to replications, some of the popular replication initiatives in psychology have explicitly acknowledged that, “There is no single standard for evaluating replication success” (Open Science Collaboration, 2015, p. 943) and employed multiple criteria to evaluate correspondence across studies (Open Science Collaboration, 2012, 2015). However, in practice, replication success is often determined by examining if the replication study found a statistically significant result in the same direction as the original study (Anderson & Maxwell, 2016). Some authors have highlighted the limitations of relying on statistical significance to evaluate replication success and have proposed statistical alternatives to examining statistical significance (e.g., Anderson & Maxwell, 2016; Maxwell et al., 2015; Spence & Stanley, 2016). At the same time, some researchers have proposed frameworks evaluating replication consistency by comparing two studies’ results using confidence intervals (CIs; e.g., LeBel et al., 2019), whereas others have proposed Bayesian alternatives (e.g., Verhagen & Wagenmakers, 2014).

The computation of prediction intervals has been proposed and used as a method to assess the inconsistency of results between an original study and a replication (e.g., Patil et al., 2016; Spence & Stanley, 2016). Although lesser known than other approaches, prediction intervals are useful in that they provide researchers with a statistical method for determining a range of results that might reasonably occur in a replication because of sampling error. In the current article, we outline how prediction intervals can be computed and interpreted in the context of replications. We show how the results of a replication study can be statistically classified as inconsistent with the original study. Practical examples for prediction intervals, formulas, and R code are presented herein.

Replications: Assessing the Role of Sampling Error

A number of methods have been used to determine if a study has replicated. A common approach to evaluating replications has been to compare the significance of the original study with that of the replication study (cf. LeBel et al., 2019; Maxwell et al., 2015). This approach typically involves assessing whether the p value is less than .05 (or the CIs fails to overlap with zero). Next, the direction of the effect is assessed to determine if it is in the same direction as the original study. If there is consistency in direction and significance, the replication is deemed successful. If, however, there is inconsistency, the replication is deemed a failure. Aside from the well-documented limitations of relying solely on significance testing to interpret results (Kline, 2013), Cumming (2008) illustrated that this is a flawed approach because p values fluctuate so considerably across replication attempts because of sampling error, making it a poor criterion to evaluate replications. Moreover, the difference between significant and nonsignificant p values is not necessarily statistically significant (Gelman & Stern, 2006). This means that a significant and a nonsignificant result may not be significantly different from one another. Consequently, this consistency in significance testing approach is, at best, an attempt to look for superficial correspondence between studies.

Another approach for trying to understand the role of sampling error when making a substantive conclusion across different studies is the use of meta-analysis (see Borenstein et al., 2021; Schmidt & Hunter, 2014). Meta-analysis is typically used to average a large number of study results, estimate the likely variability because of sampling error, and evaluate moderators in light of this variability. Nonetheless, some researchers have attempted to use meta-analysis with just two studies: an original study and a replication study. In this context, a meta-analytic mean based on just the original study and replicating study is calculated (for an example, see Open Science Collaboration, 2015). This meta-analytic mean is then tested to see if it is different from zero. Although this approach could potentially be considered an improvement over using p values, it has been discounted because it is assumed that the effect-size estimate from the first study is inflated because of publication bias (Open Science Collaboration, 2015). Most importantly, the result of a meta-analysis with just two studies is considerably less robust than a meta-analysis with a large number of studies.

A third approach that is an attempt to understand the role of sampling error in a replication interpretation is to use CIs (e.g., Gilbert et al., 2016). If interpreted correctly, CIs can offer inferential information that is not available with p values (Belia et al., 2005; Cumming & Finch, 2005). CIs are interpreted as an interval constructed around a sample statistic that will capture the population parameter with a specified probability in an imagined infinitely large set of repeated studies. For example, imagine a scenario in which 20 studies are conducted to estimate a population mean. For each of the 20 studies, a 95% CI can be constructed, and each of the 20 intervals will likely be centered around a different mean and have a different width. However, 19 of 20 (i.e., 95%) of the different 95% CIs are expected, on average, to capture the population parameter (i.e., the population mean).

The use of CIs for interpreting replications, however, can be problematic (Cumming et al., 2004; Hoekstra et al., 2014). Specifically, CIs are sometimes incorrectly interpreted as representing a capture percentage for the next study result (see Cumming et al., 2004; “confidence-level misconception”). As noted above, however, CIs are designed to capture population parameters, not subsequent sample statistics from a replication. The capture rate of CIs for subsequent sample statistics departs substantially from 95% and varies greatly across sample-size scenarios (e.g., Spence & Stanley, 2016). Consequently, CIs are not an appropriate tool for trying to understand the role of sampling error when interpreting a replication study.

Thus, a variety of approaches have been used to evaluate replications. We believe that the framework proposed by Patil and colleagues (2019) provides a useful lens for thinking about approaches to evaluating replications. Within this framework, a successful replication occurs when a new data set is collected in a second study and the results of the second data set are consistent with the results of the first data set. Like Patil et al. (2016), we view consistency through the lens of sampling error. The results of a second study can be expected to differ from the results of a first study because of random sampling error. An inconsistency occurs when the results of the second study differ from the results of the first study by more than random sampling error. When this occurs, there is a failure to replicate. In this tutorial, we focus on explaining how to use and interpret prediction intervals when evaluating single replications. Prediction intervals provide a range of results that can be expected, because of random sampling error, for a replication study before it is conducted.

Prediction Intervals

Prediction intervals are a useful statistic for evaluating the results of replication studies. Specifically, prediction intervals provide researchers with a statistical method for determining if the original study and replication study are inconsistent with each other (Cumming, 2008; Patil et al., 2016; Spence & Stanley, 2016). Prediction intervals assess between-studies consistency by considering the variability in both the original study and replication study caused by sampling error and reconciling this variability into a single statistic. By considering sampling error in both studies, researchers have a yardstick to know if the results are more different from each other than you would expect because of random sampling error. Ideally, a prediction interval is calculated after the original study is conducted but before the replication is conducted. Indeed, a key feature of prediction intervals is that they can be used to provide a frame of reference for interpreting a replication result before the collection of replication data.

Prediction Intervals for Means

In the following sections, we outline how prediction intervals can be computed for means, d values, and correlations. We begin with an illustration of how to compute a prediction interval for means because this is the simplest case. To illustrate how a prediction interval can be computed for means, we use a hypothetical scenario.

Imagine that a student, Jane, is interested in estimating the number of hours undergraduate students typically sleep at her very large university of 100,000 people. In this example, students at the university are the population. Jane randomly samples 50 people $(N_{original} = 50)$ from the university population and calculates the descriptive statistics for the sample ( $M_{original} = 7.21$ , $S D_{original} = 2.20$ ). Her sample estimate for hours slept is 7.21 hr. Jane recognizes that this single estimate (7.21) may well differ substantially from the actual population mean because of sampling error. Therefore, she does not place too much faith in this initial estimate. Jane decides to ask a colleague, Richard, to conduct a second study (i.e., collect a second sample) to provide another estimate of the population mean. Richard will conduct the replication using a sample size of $N_{replication} = 70$ .

At this point, a key question for Richard is what range of results for his replication study would be consistent with Jane’s original study. Said another way, how much would one reasonably expect a replication sample mean to differ from the original sample mean (7.21) because of random sampling? Specifying this interval (i.e., a prediction interval) before collecting the data for the second sample (i.e., the replication study) will help to interpret the findings across both studies.

To obtain a prediction interval for the second mean (i.e., the mean in the replication study), it is important to realize that the researchers are mathematically modeling the difference between the two means, that is, the difference between $M_{original} = 7.21$ , that they know already, and the mean for the yet to be conducted replication study $(M_{replication})$ . Consequently, the relevant sampling distribution for this question is the $M_{original} - M_{replication}$ sampling distribution.

When the original study and the replication study have the same sample size ( $N$ ), then the variance of the $M_{original} - M_{replication}$ sampling distribution is determined by Equation 1 (Cumming & Maillardet, 2006; Estes, 1997):

V A R (M_{original} - M_{replication}) = 2 \times \frac{S D_{original}^{2}}{N_{original}} .

(1)

Spence and Stanley (2016) provided a review of this approach and illustrated how the formula could be rearranged to account for different sample sizes in the original and replication studies. Spence and Stanley noted that this formula is simply an application of the well-known rule for calculating the difference between independent variances: $V A R (A - B) = V A R (A) + V A R (B)$ . They demonstrated that the generalized version of the sampling distribution variance formula for $M_{1} - M_{2}$ when the second study has not yet been conducted is Equation 2. This equation allows for a replication study with a sample size that differs from the original study:

V A R (M_{original} - M_{replication}) = \frac{S D_{original}^{2}}{N_{original}} + \frac{S D_{original}^{2}}{N_{replication}} .

(2)

In Equation 2, the first term represents the sampling variance of the original study mean—consistent with the central-limit theorem. The second term represents the sampling variance of the replication mean. Although the replication study has not yet been conducted, it is assumed to be from the same population as the original mean; consequently, the variance estimate from the original is used as an estimate of the population variance in the second term. Calculations, however, use the standard deviation (i.e., standard error) version presented in Equation 3:

S D (M_{original} - M_{replication}) = \sqrt{\frac{S D_{original}^{2}}{N_{original}} + \frac{S D_{original}^{2}}{N_{replication}}} .

(3)

To calculate a prediction interval, see Equation 4. For the difference between two means, we include the mean that was observed in the original study and a two-tailed critical t value using degrees of freedom $d f = N_{original} - 1$ . This formula for the degrees of freedom is used (instead of $d f = N_{original} + N_{replication} - 2$ ) because only information from the original study was used to estimate the population variance:

95 % PI = M_{original} \pm t_{97.5 (N_{original} - 1)} \sqrt{\frac{S D_{original}^{2}}{N_{original}} + \frac{S D_{original}^{2}}{N_{replication}}} .

(4)

Richard, who is familiar with prediction intervals, has not yet conducted his replication study, but he does know the sample size he plans to use. He plans to collect data from 70 people, $N_{replication} = 70$ , in his replication study—which is more than the 50 people, $N_{original} = 50$ , Jane used in the original study. If Richard applies Equation 4, using these values, he obtains the following:

\begin{array}{l} 95 % PI = M_{original} \pm t_{97.5 (N_{original} - 1)} \sqrt{\frac{S D_{original}^{2}}{N_{original}} + \frac{S D_{original}^{2}}{N_{replication}}} \\ = 7.21 \pm (2.009575) \sqrt{\frac{{2.20}^{2}}{50} + \frac{{2.20}^{2}}{70}} \\ = 7.21 \pm 0.8186223 \end{array} .

Consequently:

\begin{array}{l} L L_{PI} = 6.39 \\ U L_{PI} = 8.03 \end{array} .

This 95% prediction interval [6.39, 8.03] provides the range of means that can be expected in a replication from the same population because of sampling error alone. With this prediction interval now calculated, Richard proceeds to collect his data.

Once Richard collects his data, he calculates a sample mean, $M_{replication}$ , and compares it with the 95% prediction interval he previously calculated. If the replication mean falls outside the prediction interval, he will consider his study a failed replication. He would make this conclusion based on the fact that if the population parameters for his study are the same as the original study, there is a 95% chance the prediction interval will capture his study mean.

On the other hand, if the 95% prediction interval does capture his sample mean, $M_{replication}$ , then Richard is encouraged that his results are not statistically inconsistent with the original study. That said, Richard will not declare that his replication supports the findings of the original study. He treats them as encouraging results—but he will conduct another replication study to gain additional confidence in this result. He realizes that a single replication is not sufficient “to prove” the first study’s conclusion is correct. The reason for his hesitation in making a strong conclusion when a prediction interval captures a replication result is detailed in the next section.

Application: Mean Difference Prediction Interval

Prediction intervals are well suited for determining if a replication result is inconsistent with the original result. However, caution should be exercised before identifying that a result that falls within the interval is consistent with the original study result. We detail the reasons for this with a new example.

Consider a study that assessed the mean number of alcoholic drinks consumed by engineering-college students during orientation week. The published study does not specify how the sample was collected but indicates that $M_{original} = 10.1$ , $S D_{original} = 3.5$ , $N_{original} = 40$ . A second researcher decides to conduct a replication to confirm this estimate—but ends up using arts-college students. Before conducting that replication study, the second researcher calculates a prediction interval based on information from the original study and his intended sample size $N_{replication} = 80$ using the predictionInterval $R$ package below. A web interface for this code is also available: https://replication.shinyapps.io/mean/:

This code provides a 95% prediction interval of [8.73, 11.47]. If the population parameters (i.e., $μ, σ$ ) for arts-college students are the same as the population parameters (i.e., $μ, σ$ ) for engineering students, there is a 95% chance the replication sample mean will fall in the prediction interval.¹ Consequently, if the second researcher obtains a study mean, $M_{replication}$ , outside of this range, then the replication may be considered as a failure to find a result that is consistent with the effect of the original study. In other words, the population parameter for arts students may be considered different compared with engineering students in so far as it exceeds what is expected in 95% of replications when considering only sampling error.

As previously noted, we stress caution when interpreting a replication result that falls inside the prediction interval. Imagine a situation in which a researcher conducts an original study and then creates a 95% prediction interval around that original study mean. Now imagine that the replication mean falls inside the 95% prediction interval. We consider two scenarios that can underlie this situation. We emphasize that one would need to be omniscient to know which scenario they are in as an applied researcher. In the first scenario, the replication mean is sampled from the same population as the original mean. As a result, the original mean and sample mean differ only because of sampling error. In this first scenario, interpreting the replication mean falling within the prediction interval as a “replication success” would lead to a correct conclusion. The same population mean underlies both sample means. In the second scenario, the replication mean is sampled from a different population (with a different population mean). As a result, the original mean and the replication mean differ for two reasons: (a) the difference in population means and (b) the random sampling error. Because the population means differ, this is conceptually a replication failure. Moreover, it is possible that even though the two population means are different, the replication mean could fall inside the prediction interval constructed around the original study mean. In this second scenario, interpreting the replication mean falling within the prediction interval as a “replication success” would lead to an incorrect conclusion. Consequently, when a replication mean falls within the prediction interval, it could reflect two scenarios. Therefore, prediction intervals should not be used to “confirm” a successful replication because of the inherent uncertainty created by random sampling.

Prediction intervals provide statistical criteria to determine a replication failure but not a replication success—because of the way sampling error operates. When a replication result falls within a prediction interval, there is no evidence of statistical difference, but it is not conclusive; additional replications are needed. This conclusion asymmetry is similar to the logic found in null hypothesis significance testing, in which it would be a misguided to conclude there is no effect when a p value is nonsignificant (e.g., Wasserstein et al., 2019). When a replication result falls outside a prediction interval, it indicates only that the replication result is unlikely if the two studies were samples from the same population. Indeed, replication results will fall outside a 95% prediction interval 5% of the time becauses of random sampling error. Consequently, we note that a result falling outside the interval does not constitute “proof of failure.”

Prediction Interval for Effect Sizes

In addition to means, prediction intervals can also be computed for standardized mean differences (i.e., d values) and correlations (Spence & Stanley, 2016). Computing prediction intervals for these statistics is slightly different because of asymmetry in the sampling distributions of nonzero effect sizes. To address this nonnormality, a different procedure is used to calculate the prediction interval in these cases; however, the underlying interpretation of the interval remains unchanged.

Standardized mean differences

To illustrate how to use prediction intervals with standardized mean differences (i.e., d values), we focus on an ego-depletion study by Sripada et al. (2014). We do so because this study was the focus of a large-scale replication study by Hagger et al. (2016). The design in the original and replication studies was a one-way experimental design with an experimental ego-depletion condition and a control condition. The dependent variable was reaction-time variability.

The original study reported by Sripada et al. (2014) found an ego-depletion condition $(N = 23)$ mean reaction-time variability of 0.32 $(S D = 0.07)$ and a control condition $(N = 24)$ mean reaction-time variability of 0.27 $(S D = 0.05)$ . We can use this information to calculate a standardize mean difference (i.e., d value.). Thus, for the original study, $d_{original} = 0.68$ , 95% $CI$ = [0.09, 1.27], where $n 1_{original} = 23$ and $n 2_{original} = 24$ .

If a replication of Sripada et al. (2014) was conducted, what range of results can be expected because of sampling error? Below, we review two scenarios: Scenario 1, in which the prediction interval captures the replication result, and Scenario 2, in which the prediction interval does not capture the replication result. We stress, as we walk through evaluating this preexisting data, that prediction intervals are intended to be calculated after the original study but before the collection of data for the replication study.

Prediction interval captures replication

To begin, we examine one of the many replications contained in Sripada et al. (2014). Specifically, we revisit the replication by Hagger and colleagues (2016) focusing on the study by Hagger, Chatzisarantis, and Zwienenberg. These authors conducted a replication with $N = 46$ in the ego-depletion condition and $N = 55$ in the control condition. Imagine that before collecting their data, they constructed a prediction interval. We outline this process below.

Given the intended replication cell sizes, what range of results can be expected in this particular replication because of sample error alone? To answer this question, a 95% prediction interval can be constructed. This interval indicates where 95% of d values in a replication are expected to fall because of sampling error alone—assuming identical population parameters.

Because nonzero d values have sampling distributions that follow noncentral t distributions and noncentral t distributions are asymmetrical, we cannot use the same approach we used with means. The asymmetry of sampling distributions increase as effect sizes increase (i.e., the further the d values are away from zero). As a result, using the same technique we used for means will result in increasingly inaccurate estimates as effect sizes increase.

To compute prediction intervals for d values, we can follow an approach outlined by Spence and Stanley (2016) that takes the asymmetry of sampling distributions into account. Spence and Stanley applied a technique developed by Zou (2007). Below are the three steps in this approach.

Step 1: obtain original study d value CI

In the original study, $d_{original} = 0.68$ . We use this information along with the cell sizes (23 and 24) to obtain a CI with many decimals:

which produces the output:

We use the notation below to indicate this original-study CI with several decimals:

\begin{array}{l} L L_{CI : original} = 0.08762005 \\ U L_{CI : original} = 1.265256 \end{array} .

Step 2: obtain replication study imaginary d value CI

True to form, a $d$ -value prediction interval is an interval that is computed as a basis to assess the replication study before the replication study is conducted. Thus, we do not know $d_{replication}$ before the study is conducted, so we use $d_{replication} = 0.68$ as an estimate for the replication study. That is, we assume $d_{replication} = 0.68$ . Now we need to compute a CI around this value using the group sizes planned for the replication. We can create this using the MBESS package with the R commands below. We find the interval, $d_{replication} = 0.68$ , 95% $CI$ = [0.28, 1.08]. Again, we must highlight this contains the effect size from the original study with the group sizes that are planned for the replication:

which produces the output:

We use the notation below to indicate this replication-study CI:

\begin{array}{l} L L_{CI : replication} = 0.2755066 \\ U L_{CI : replication} = 1.081244 \end{array} .

Step 3: obtain the $d$ -value prediction interval

The $d$ -value prediction interval is constructed using Equations 5 and 6. The prediction interval is conceptually the same as the one we constructed for means earlier, but this different calculation method is needed to account for the asymmetrical distributions involved:

L L_{P I} = d_{original} - \sqrt{{(d_{original} - L L_{CI : original})}^{2} + {(U L_{CI : replication} - d_{original})}^{2}} .

(5)

U L_{P I} = d_{original} + \sqrt{{(d_{original} - L L_{CI : replication})}^{2} + {(U L_{CI : original} - d_{original})}^{2}} .

(6)

In Equations 5 and 6, $d_{original}$ is the $d$ value from original study, $L L_{CI : original}$ is the lower bound of the CI from the original study, $U L_{CI : original}$ is upper bound of the CI from original study, $L L_{CI : replication}$ is the lower bound of the imaginary CI for the replication study, and $U L_{CI : replication}$ is the upper bound of the imaginary CI from the replication study.

Substituting the values obtained in Steps 1 and 2, we can calculate the lower and upper prediction interval limits.

The prediction interval lower limit is

\begin{array}{l} L L_{P I} = d_{original} - \sqrt{{(d_{original} - L L_{CI : original})}^{2} + {(U L_{CI : replication} - d_{original})}^{2}} \\ = 0.68 - \sqrt{{(0.68 - 0.08762005)}^{2} + {(1.081244 - 0.68)}^{2}} \\ = 0.68 - \sqrt{0.350914 + 0.1609967} \\ = 0.68 - 0.7154793 \\ = - 0.0354793 \\ = - 0.04 \end{array} .

The prediction interval upper limit is

\begin{array}{l} U L_{P I} = d_{original} + \sqrt{{(d_{original} - L L_{CI : replication})}^{2} + {(U L_{CI : original} - d_{original})}^{2}} \\ = 0.68 + \sqrt{{(0.68 - 0.2755066)}^{2} + {(1.265256 - 0.68)}^{2}} \\ = 0.68 + \sqrt{0.1636149 + 0.3425246} \\ = 0.68 + 0.7114348 \\ = 1.391435 \\ = 1.39 \end{array} .

As a result, the 95% prediction interval for Sripada et al. (2014) with the sample size used by Hagger et al. (2016) is [–0.04, 1.39]. Pretending that we are Hagger et al., now that the prediction interval is calculated, we would now proceed to collect our data.

Following data collection, Hagger et al. (2016) found $d_{replication} = 0.00$ . This value contrasts sharply with $d_{original} = 0.68$ from Sripada et al. (2014). Given the magnitude of the difference between the two results, it might be tempting to conclude the two studies are inconsistent. But we rely on our prediction interval to make this assessment. The 95% prediction interval ranges from −0.04 to 1.39, and this range captures the $d_{replication} = 0.00$ replication result. This indicates that the $d_{replication} = 0.00$ does not differ from $d_{original} = 0.68$ by more than one would expect because of sampling error alone.

This result may be surprising given that the replication study found no statistical difference between conditions. The wide interval is useful in highlighting the lack of information and imprecision contained in small-sample studies. Because of the small sample sizes, the large difference observed between studies is still within what can be expected because of random sampling error.

Again, we stress that when a replication result falls within the prediction interval, one cannot conclude that the replication was a success. One can conclude the replication did not fail only when considering what is expected because of sampling error. This is because, as outlined in the mean prediction interval section above, there are multiple reasons replication results can fall within the prediction interval. Prediction intervals are useful for indicating a failed replication. When a replication falls within the prediction, there is no evidence of statistical difference—additional replications are still needed.

Obtaining a $d$ -value prediction interval has many steps that can be cumbersome. In particular, calculating the CI for the imaginary replication effect may be, for some, a bit odd. Consequently, as noted previously, the predictionInterval package for R can be used instead. This package calculates all of the required CIs and the final prediction interval. The user needs to input only $d_{original}$ , $n 1_{original}$ , $n 2_{original}$ , $n 1_{replication}$ , and $n 2_{replication}$ . The code is illustrated below. A web interface for this code is also available: https://replication.shinyapps.io/dvalue/:

For the original study, d = 0.68, N1 = 23, N2 = 24, 95% CI = [0.09, 1.27]. For the replication study, N1 = 46, N2 = 55. The 95% prediction interval is [–0.04, 1.39].

Prediction interval does not capture replication

Consider the second scenario, one in which the replication result is inconsistent with the original study. Recall that in the original Sripada et al. (2014) study that $d_{original} = 0.68$ . In this second example, we use the result from the Evans, Fay, and Mosser replication presented in Hagger et al., (2016). This replication attempt used 40 participants in the ego-depletion condition and 49 participants in the control condition. Imagine that the authors constructed the prediction interval below before data collection.

To calculate the prediction interval, recall that Sripada et al.’s (2014) original study found a $d_{original} = 0.68$ with 23 participants in the ego-depletion condition and 24 in the control condition. Here is the required R code to calculate a prediction interval for the above example:

For the original study, d = 0.68, N1 = 23, N2 = 24, 95% CI = [0.09, 1.27]. For the replication study, N1 = 40, N2 = 49. The 95% prediction interval is [–0.05, 1.41].

Hypothetically, Evans et al., having calculated a 95% prediction interval of [–0.05, 1.41], proceeded to data collection. Following data collection, Evans et al. found $d_{replication} = - 0.27$ . This result differs sharply from $d_{original} = 0.68$ . More importantly, in contrast to the previous example, $d_{replication} = - 0.27$ falls outside the 95% prediction interval of [–0.05, 1.41]. Consequently, this replication could be considered a replication “failure.” That is, the difference between $d_{original} = 0.68$ and $d_{replication} = - 0.27$ is sufficiently large that although the discrepancy could be the result of only random sampling error, this is unlikely. Consequently, the results of these two studies may be considered to be statistically different.

Correlation

In this section, we walk through an example of how a prediction interval can be calculated in the context of replications in which correlations ( $r$ ) are the reported effect size. The process is largely equivalent to the $d$ -value case and follows the same steps. For the current correlation example, we selected two studies from a Many Labs registered replication of Albarracìn et al. (2008) by Chartier et al. (2020). This replication effort is focused on the fifth study reported by Albarracìn et al. We selected two studies that are reported in Figure 1 of Chartier et al. because this figure reports correlations and 95% CIs for the correlations. Consequently, no conversions are required, and readers can readily cross-reference the values reported here and in the Chartier et al. publication.

To begin, we consider the replication conducted at the University of Oregon as the “original” study (effect size reported in Figure 1 and sample size reported in Table 1 of Chartier et al., 2020). Because we argue that prediction intervals should be conducted before the replication is conducted, we pretend as if a replication study has not been conducted and we wish to determine the range of results that may be expected in a replication. To do so, we calculate a prediction interval using the technique presented by Spence and Stanley (2016). Because the steps are the same and outlined the three-step process in detail for the $d$ value, we jump right to using the predictionInterval package to show how prediction intervals can be calculated for correlations.

The “original” University of Oregon study reported a correlation of $r$ = .05 with a sample size of n = 178. For the “replication” study, we consider the first row of Figure 1 of Chartier et al. (2020, p. 336), which is the replication conducted at Ashland University. The Ashland University study had a sample size of n = 81.

Using the predictionInterval package, we would use the following code, in which r is the original study’s correlation, n is the original study’s sample size, rep.n is the expected sample size of the replication, and prob.level specifies a 95% PI:

This code generates the following output:

This indicates that a correlation between –.21 and .31 can be expected in a replication using a sample size of n = 81. Now we can pretend that Ashland University goes ahead and conducts its replication and finds a correlation of .02 (Figure 1 of Chartier et al., 2020, p. 336). This correlation falls within the prediction interval and can therefore be interpreted as being not inconsistent with the “original” study, which is to say it falls within a range that can be expected because of sampling error, given the sample sizes of the original study and replication study. We note a web interface for the R code is also available: https://replication.shinyapps.io/correlation/.

Discussion

Summary

With the rise of replication research, it is important for researchers to have a way to effectively and objectively assess the results of replication studies in relation to the original study. Commonly used approaches to assessing replications (e.g., p values, CIs) have important limitations. In the current article, we present and review the prediction interval as a method for assessing if the results across two studies are statistically inconsistent with one another. The criteria for assessing inconsistency is based on the extent to which a difference can be expected because of sampling error. We covered how prediction intervals can be computed for means, $d$ values, and correlations.

Comparisons with other statistical approaches

Equivalence testing

Researchers familiar with equivalence testing (Lakens, 2017) may wonder how this approach compares with the use of prediction intervals. With prediction intervals, information from an original study is used to create an interval around a sample statistic that will capture a subsequent sample statistic with a specified probability. The width of the interval is based entirely on the sampling error for the distribution of differences for a particular sample statistic. In contrast, with equivalence testing, researchers typically specify an interval around zero that they believe corresponds to a null effect. The bounds of this interval are typically based on the smallest effect size of interest (SESOI)—see Lakens et al. (2018). Researchers can determine the SESOI through a wide variety of techniques, ranging from subjective to objective. The equivalence test itself is effectively two one-sided t tests used to determine if the population effect size falls in the interval defined by the researcher. Typically, equivalence testing is used to reject the presence of a small effect within a single study.

Meta-analysis

The prediction-interval approach is most appropriate in a two-study scenario, that is, when a prediction interval is constructed after an original study and before a replication and is then used to assess the replication result. When there are already a large number of studies in a research domain, we suggest the “Did it replicate?” question is less relevant. In this circumstance, we suggest moving from a replication mindset to a knowledge-aggregation mindset, that is, a meta-analytic mindset (see Borenstein et al., 2021; Schmidt & Hunter, 2014). There are different types of meta-analysis, and arguably the most useful of these is the psychometric meta-analysis approach (Schmidt & Hunter, 2014), which allows a researcher to generate an estimate of the population-level effect size while simultaneously correcting for study artifacts such as reliability, range restriction, and scale coarseness, among others. Meta-analysis allows for not only the aggregation of the effect sizes but also the calculation of the variance of effect sizes once the variability because of sampling error and other artifacts has been removed. Researchers can also explore the extent to which variation in study results is moderated by study attributes. We note, however, that if there is publication bias in a research domain, this bias will be encoded in the meta-analysis. Consequently, it is important to conduct additional analyses that explore the potential influence of publication bias on meta-analytic estimates (for a review, see McShane et al., 2016). Indeed, the APA Journal Article Reporting Standards (Appelbaum et al., 2018) requires the use of bias-detection techniques. Meta-analysis is extremely effective for knowledge generation when researchers can analyze a large number of studies in which publication bias is not a factor—such as with the Many Labs projects. Our previously stated concerns about meta-analysis in replications pertain only to situations in which meta-analysis is used with a very small number of studies.

CIs

Some researchers might be tempted to use CIs to capture the results of other studies. This approach is incorrect because CIs are designed to capture population parameters, not sample statistics from other studies. Indeed, using a 95% CI to capture sample statistics is problematic because the capture rate can depart substantially from 95% and vary considerably across situations (depending on the relative sample sizes of the two studies). For example, Spence and Stanley (2016) conducted simulations that revealed that when sample sizes are unequal, a 95% CI around a $d$ value can capture subsequent $d$ values at a rate as low as 33%. Even more problematic than the low capture rate is the inconsistency in the capture rates across sample-size scenarios. Given these problems, CIs should not be used to evaluate the consistency of two studies.

The problems associated with using CIs to capture sample statistics is most easily demonstrated in context of sample means—although the same issues apply with $d$ values and correlations. Consider the case in which the sample size of the original mean and the replication mean are the same sample size (i.e., $N_{original} = N_{replication} = N$ ). If we were to create a prediction interval, we would use the formula below. When the sample sizes are the same, the prediction interval formula can be rearranged to use the standard error term typically seen in a CI formula:

\begin{array}{l} 95 % PI = M_{original} \pm t_{97.5 (N - 1)} \sqrt{\frac{S D_{original}^{2}}{N} + \frac{S D_{original}^{2}}{N}} \\ = M_{original} \pm t_{97.5 (N - 1)} \sqrt{S E^{2} + S E^{2}} \\ = M_{original} \pm t_{97.5 (N - 1)} \sqrt{S E^{2} (2)} \\ = M_{original} \pm t_{97.5 (N - 1)} S E \sqrt{2} \end{array} .

Contrast the resulting formula above with the usual way of writing a 95% CI. Notice the only difference is the $\sqrt{2}$ term:

95 % CI = M_{original} \pm t_{97.5 (N - 1)} S E .

In the case of study means (with equal sample sizes), a 95% CI has an error term that is too small to capture sample statistics at a 95% rate. The error term would need to be multiplied by $\sqrt{2}$ to capture sample statistics at a 95% rate, at which point, it would no longer be a CI but, rather, a prediction interval.

In this equal-sample-size scenario, it is also possible to determine what the capture rate of a 95% CI would be for sample statistics. Said another way, if we were to view a 95% CI (for means with equal sample sizes) as a prediction interval, what would the prediction interval capture percentage be? To determine this value, we can algebraically convert the CI into a prediction interval. We do this in a scenario in which $N_{original} = N_{replication} = 1, 000$ .

We begin by noting that for a 95% CI, we use a $t$ value associated with the 0.975 point in the $t$ distribution because of the calculation below. This will become relevant again later:

\begin{array}{l} 1 - (1 - P) / 2 = 1 - (1 - . 95) / 2 \\ = 0.975 \end{array} .

A 95% CI can be converted to a prediction interval by adding a 1.0 multiplier in the form of $\sqrt{2} / \sqrt{2} = 1$ :

95 % CI = \bar{x} \pm t_{97.5} \frac{\sqrt{2}}{\sqrt{2}} S E .

We redistribute one $\sqrt{2}$ to be the denominator for the $t$ value. When this is done, we have the CI in the prediction-interval format—with $\sqrt{2}$ in front of $S E$ :

95 % CI = \bar{x} \pm \frac{t_{97.5}}{\sqrt{2}} \sqrt{2} S E .

Now that we have the CI in prediction-interval format, we need to investigate the $\frac{t_{97.5}}{\sqrt{2}}$ term to determine the corresponding capture percentage if this CI were a prediction interval. To do this, we become more specific—and incorporate the sample size. Recall, $N = 1, 000$ , so $d f = 999$ . In this scenario, we find the CI is functioning as a prediction interval created with a $t$ value of 1.387585 as per the calculation below:

\frac{t_{97.5}}{\sqrt{2}} = \frac{1.962341}{\sqrt{2}} = 1.387585 .

We note that 1.387585 corresponds to the $t$ value: $t_{91.72136}$ .That is,

t_{91.72136} = 1.387585 .

Therefore:

\begin{array}{l} 95 % CI = \bar{x} \pm \frac{t_{97.5}}{\sqrt{2}} \sqrt{2} S E \\ = \bar{x} \pm t_{91.72136} \sqrt{2} S E \end{array} .

Moreover, creating an interval using $t_{91.72136}$ results in an interval with an 83.4% capture rate for sample statistics, as indicated by the calculation below:

\begin{array}{l} 1 - (1 - P) / 2 = . 9172136 \\ 1 - . 9172136 = (1 - P) / 2 \\ 0.8344272 = P \end{array} .

Consequently,

\begin{array}{l} 95 % CI = \bar{x} \pm t_{97.5} S E \\ = \bar{x} \pm t_{97.5} \frac{\sqrt{2}}{\sqrt{2}} S E \\ = \bar{x} \pm \frac{t_{97.5}}{\sqrt{2}} \sqrt{2} S E \\ = \bar{x} \pm t_{91.72136} \sqrt{2} S E \\ = 83.4 % Prediction Interval \end{array} .

Thus, a 95% CI functions as an 83.4% prediction interval—when working with means in which the original and replication sample sizes are identical. We stress that the $\sqrt{2}$ adjustment factor applies only to the situation with means in which the original and replication means have exactly the same sample size. We also stress that when the sample sizes are not identical, the capture rate for the CI fluctuates widely and can be as low as 33% (Spence & Stanley, 2016). The math is more complicated with correlations and $d$ values, but the same inconsistent capture rate occurs with these statistics as well (as illustrated in Spence & Stanley, 2016). Consequently, CIs should not be used to capture sample statistics when interpreting replications.

Interval-width considerations

Prediction intervals are effective because they consider the sampling error associated with both the original and replication studies and reconcile this information into a single statistic. Correspondingly, the width of a prediction interval is directly influenced by the sample sizes of both studies. To the extent that both studies have adequate sample sizes, the respective CIs for each study will be narrow and result in a narrow prediction interval. Researchers will likely find a narrow prediction interval desirable when attempting to assess a replication result.

Prediction intervals can be calculated before or after a replication study is conducted. We believe that prediction intervals have the greatest utility if calculated before collecting data for the replication study as an aid for planning (and possible preregistration). A researcher can use an initial planned replication sample size in the prediction-interval calculation and then assess if the resulting interval is undesirably wide (a personal judgment). If the resulting interval is judged to be wide, the researcher can recalculate the prediction interval using a larger sample size for the replication study. In many cases, simply increasing the replication sample size will decrease the width of the prediction interval. In some cases, however, the prediction interval may not decrease in width when the replication sample size is increased—indicating that the wide interval is the result of small sample size for the original study. This situation is problematic because the original study sample size cannot be increased after the fact.

We believe that when a prediction interval is undesirably wide because of the sample size of the original study that this indicates a replication mindset may not be appropriate. Others might argue that this indicates a problem with prediction intervals; however, we suggest this situation merely communicates the consequences of a small-sample-size choice for the original study. Unfortunately, small-sample-size studies are so common in psychology that some believe much of psychological research is statistically unfalsifiable (Morey & Lakens, 2016). We suggest that a prediction interval that is extraordinarily wide only because of the sample size of the original study is an indication that the findings of the first study may be unfalsifiable. Indeed, a wide interval could indicate that validating an initial study is not worthwhile given the error associated with it. In this situation, it might be most appropriate to discard the conclusions from the original study and conduct a new, well-designed study instead. That is, the new study is not viewed as replication but, rather, as the first meaningful source of information on the original question. Of course, the original study could still be a useful input to a future meta-analysis but likely has little informational value on its own.

We expect most researchers to construct 95% prediction intervals because of the ubiquitious use of 5% as the desired Type I error rate when significance testing. We note, however, that it is possible to construct prediction intervals at other probability levels. A dicussion of the associated issues is beyond the scope of this tutorial; however, we encourage readers interested to see the logic outlined in Maier and Lakens (2022).

Final comments

For individuals interested in more reading on prediction intervals in the context of replications, the following sources are recommended: Anderson and Maxwell (2016), Spence and Stanley (2016), and Patil et al. (2016). With respect to prediction intervals more generally, Cumming (2008) and Cumming and Fidler (2009) contain discussions of prediction intervals as they pertain to means. In the context of multiple regression, Cumming and Calin-Jageman (2016) discussed the placement of prediction intervals around individual criterion (Y) scores. And there has been some psychology research that has examined how prediction intervals can help how people interpret uncertainty in forecasts (e.g., Savelli & Joslyn, 2013).

We remind readers that prediction intervals can be calculated using the free predictionInterval R package. In addition, prediction intervals are also provided as web applications. For mean application, see https://replication.shinyapps.io/mean/. For correlation application, see https://replication.shinyapps.io/correlation/. For d-value application, see https://replication.shinyapps.io/dvalue/.

Footnotes

Transparency

Action Editor: Katie Corker

Editor: David A. Sbarra

Author Contributions

Both authors contributed equally.

Jeffrey R. Spence: Conceptualization; Funding acquisition; Investigation; Methodology; Project administration; Resources; Writing – original draft; Writing – review & editing.

David J. Stanley: Conceptualization; Investigation; Methodology; Project administration; Resources; Software; Writing – original draft; Writing – review & editing.

ORCID iD

Jeffrey R. Spence

Notes

References

Albarracín

Handley

I. M.

Noguchi

McCulloch

K. C.

Leeper

Brown

R. D.

Earl

Hart

W. P.

(2008). Increasing and decreasing motor and cognitive output: A model of general action and inaction goals. Journal of Personality and Social Psychology, 95(3), 510–523. https://doi.org/10.1037/a0012833

Anderson

S. F.

Maxwell

S. E.

(2016). There’s more than one way to conduct a replication study: Beyond statistical significance. Psychological Methods, 21(1), 1–12. https://doi.org/10.1037/met0000051

Appelbaum

Cooper

Kline

R. B.

Mayo-Wilson

Nezu

A. M.

Rao

S. M.

(2018). Journal article reporting standards for quantitative research in psychology: The APA publications and communications board task force report. American Psychologist, 73(1), 3–25. https://doi.org/10.1037/amp0000191

Belia

Fidler

Williams

Cumming

(2005). Researchers misunderstand confidence intervals and standard error bars. Psychological Methods, 10(4), 389–396. https://doi.org/10.1037/1082-989X.10.4.389

Borenstein

Hedges

L. V.

Higgins

J. P.

Rothstein

H. R.

(2021). Introduction to meta-analysis. John Wiley & Sons.

Chartier

C. R.

Arnal

J. D.

Arrow

Bloxsom

N. G.

Bonfiglio

D. B.

Brumbaugh

C. C.

Corker

K. S.

Ebersole

C. R.

Garinther

Giessner

S. R.

Hughes

Inzlicht

Lin

Mercier

Metzger

Rangel

Saunders

Schmidt

Storage

Tocco

(2020). Many Labs 5: Registered replication of Albarracin et al. (2008), Experiment 5. Advances in Methods and Practices in Psychological Science, 3(3), 332–339. https://doi.org/10.1177/2515245920945963

Cumming

(2008). Replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better. Perspectives on Psychological Science, 3(4), 286–300. https://doi.org/10.1111/j.1745-6924.2008.00079.x

Cumming

Calin-Jageman

(2016). Introduction to the new statistics: Estimation, open science, and beyond. Routledge.

Cumming

Fidler

(2009). Confidence intervals: Better answers to better questions. Zeitschrift für Psychologie/Journal of Psychology, 217(1), 15–26.

10.

Cumming

Finch

(2005). Inference by eye: Confidence intervals and how to read pictures of data. American Psychologist, 60(2), 170–180. https://doi.org/10.1037/0003-066X.60.2.170

11.

Cumming

Maillardet

(2006). Confidence intervals and replication: Where will the next mean fall? Psychological Methods, 11(3), 217–227. https://doi.org/10.1037/1082-989X.11.3.217

12.

Cumming

Williams

Fidler

(2004). Replication and researchers’ understanding of confidence intervals and standard error bars. Understanding Statistics, 3(4), 299–311.

13.

Estes

(1997). On the communication of information by displays of standard errors and confidence intervals. Psychonomic Bulletin & Review, 4(3), 330–341.

14.

Gelman

Stern

(2006). The difference between “significant” and “not significant” is not itself statistically significant. The American Statistician, 60(4), 328–331.

15.

Gilbert

King

Pettigrew

Wilson

(2016). More on “Estimating the Reproducibility of Psychological Science.” https://gking.harvard.edu/sites/scholar.harvard.edu/files/gking/files/gkpw_post_publication_response.pdf

16.

Goodman

S. N.

Fanelli

Ioannidis

J. P.

(2016). What does research reproducibility mean? Science Translational Medicine, 8(341), Article 341ps12. https://doi.org/10.1126/scitranslmed.aaf5027

17.

Gundersen

O. E.

(2021). The fundamental principles of reproducibility. Philosophical Transactions of the Royal Society Series A: Mathematical, Physical, and Engineering Sciences, 379(2197), Article 20200210. https://doi.org/10.1098/rsta.2020.0210

18.

Hagger

M. S.

Chatzisarantis

N. L.

Alberts

Anggono

C. O.

Batailler

Birt

A. R.

Brand

Brandt

M. J.

Brewer

Bruyneel

Calvillo

D. P.

Campbell

W. K.

Cannon

P. R.

Carlucci

Carruth

N. P.

Cheung

Crowell

De Ridder

D. T. D.

Dewitte

. . . Zwienenberg

(2016). A multilab preregistered replication of the ego-depletion effect. Perspectives on Psychological Science, 11(4), 546–573. https://doi.org/10.1177/1745691616652873

19.

Hoekstra

Morey

R. D.

Rouder

J. N.

Wagenmakers

E.-J.

(2014). Robust misinterpretation of confidence intervals. Psychonomic Bulletin & Review, 21, 1157–1164.

20.

Klein

Ratliff

Vianello

Adams

Jr. Bahnk

Bernstein

Bocian

Brandt

M. J.

Brooks

Crumbaugh

C. C.

Cemalcilar

Chandler

Cheong

Davis

W. E.

Eisner

Frankowska

Furrow

Galliani

E. M.

Hasselman

. . . Nosek

B. A.

(2014). Data from investigating variation in replicability: A “many labs” replication project. Journal of Open Psychology Data, 2(1), Article e4. https://doi.org/10.5334/jopd.ad

21.

Kline

R. B.

(2013). Beyond significance testing: Statistics reform in the behavioral sciences. American Psychological Association.

22.

Lakens

(2017). Equivalence tests: A practical primer for t tests, correlations, and meta-analyses. Social Psychological and Personality Science, 8(4), 355–362.

23.

Lakens

Scheel

A. M.

Isager

P. M.

(2018). Equivalence testing for psychological research: A tutorial. Advances in Methods and Practices in Psychological Science, 1(2), 259–269.

24.

LeBel

E. P.

Vanpaemel

Cheung

Campbell

(2019). A brief guide to evaluate replications. Meta-Psychology, 3, Article MP.2018.843. https://doi.org/10.15626/MP.2018.843

25.

Maier

Lakens

(2022). Justify your alpha: A primer on two practical approaches. Advances in Methods and Practices in Psychological Science, 5(2). https://doi.org/10.1177/25152459221080396

26.

Maxwell

S. E.

Lau

M. Y.

Howard

G. S.

(2015). Is psychology suffering from a replication crisis? What does “failure to replicate” really mean? American Psychologist, 70(6), 487–498. https://doi.org/10.1037/a0039400

27.

McShane

B. B.

Bockenholt

Hansen

K. T.

(2016). Adjusting for publication bias in meta-analysis: An evaluation of selection methods and some cautionary notes. Perspectives on Psychological Science, 11(5), 730–749.

28.

Morey

R. D.

Lakens

(2016). Why most of psychology is statistically unfalsifiable. https://github.com/richarddmorey/psychology_resolution

29.

Munafò

M. R.

Nosek

B. A.

Bishop

D. V.

Button

K. S.

Chambers

C. D.

Percie

Sert

Simonsohn

Wagenmakers

E. J.

Ware

J. J.

Ioannidis

(2017). A manifesto for reproducible science. Nature Human Behaviour, 1, Article 0021. https://doi.org/10.1038/s41562-016-0021

30.

Open Science Collaboration. (2012). An open, large-scale, collaborative effort to estimate the reproducibility of psychological science. Perspectives on Psychological Science, 7(6), 657–660.

31.

Open Science Collaboration. (2015). PSYCHOLOGY. Estimating the reproducibility of psychological science. Science, 349(6251), Article aac4716. https://doi.org/10.1126/science.aac4716

32.

Patil

Peng

R. D.

Leek

J. T.

(2016). What should researchers expect when they replicate studies? A statistical view of replicability in psychological science. Perspectives on Psychological Science, 11(4), 539–544.

33.

Patil

Peng

R. D.

Leek

J. T.

(2019). A visual tool for defining reproducibility and replicability. Nature Human Behaviour, 3(7), 650–652.

34.

Peng

R. D.

Dominici

Zeger

S. L.

(2006). Reproducible epidemiologic research. American Journal of Epidemiology, 163(9), 783–789.

35.

Plesser

H. E.

(2018). Reproducibility vs. replicability: A brief history of a confused terminology. Frontiers in Neuroinformatics, 11, Article 76. https://doi.org/10.3389/fninf.2017.00076

36.

Savelli

Joslyn

(2013). The advantages of predictive interval forecasts for non-expert users and the impact of visualizations. Applied Cognitive Psychology, 27(4), 527–541.

37.

Schmidt

F. L.

Hunter

J. E.

(2014). Methods of meta-analysis: Correcting error and bias in research findings. Sage.

38.

Spence

J. R.

Stanley

D. J.

(2016). Prediction interval: What to expect when you’re expecting . . . A replication. PLOS ONE, 11(9), Article e0162874. https://doi.org/10.1371/journal.pone.0162874

39.

Sripada

Kessler

Jonides

(2014). Methylphenidate blocks effort-induced depletion of regulatory control in healthy volunteers. Psychological Science, 25(6), 1227–1234.

40.

Vazire

(2018). Implications of the credibility revolution for productivity, creativity, and progress. Perspectives on Psychological Science, 13(4), 411–417.

41.

Vazire

Schiavone

S. R.

Bottesini

J. G.

(2022). Credibility beyond replicability: Improving the four validities in psychological science. Current Directions in Psychological Science, 31(2), 162–168.

42.

Verhagen

Wagenmakers

E.-J.

(2014). Bayesian tests to quantify the result of a replication attempt. Journal of Experimental Psychology: General, 143(4), 1457–1475. https://doi.org/10.1037/a0036731

43.

Wasserstein

R. L.

Schirm

A. L.

Lazar

N. A.

(2019). Moving to a world beyond “p < 0.05.” Taylor & Francis.

44.

Zou

G. Y.

(2007). Toward using confidence intervals to compare correlations. Psychological Methods, 12(4), 399–413. https://doi.org/10.1037/1082-989X.12.4.399.

Tempered Expectations: A Tutorial for Calculating and Interpreting Prediction Intervals in the Context of Replications

Abstract

Keywords

Replications: Assessing the Role of Sampling Error

Prediction Intervals

Prediction Intervals for Means

Application: Mean Difference Prediction Interval

Prediction Interval for Effect Sizes

Standardized mean differences

Prediction interval captures replication

Step 1: obtain original study d value CI

Step 2: obtain replication study imaginary d value CI

Step 3: obtain the d -value prediction interval

Prediction interval does not capture replication

Correlation

Discussion

Summary

Comparisons with other statistical approaches

Equivalence testing

Meta-analysis

CIs

Interval-width considerations

Final comments

Footnotes

Transparency

ORCID iD

Notes

References

Step 3: obtain the $d$ -value prediction interval