Abstract
Two-way interaction effects in linear regression occur when the relation between two variables changes depending on the level of a third. Despite their frequent use, interactions are notoriously difficult to estimate accurately and test for statistical significance because of small effect sizes and low reliability. In this study, we used Monte Carlo simulations to establish stability thresholds for two-way interactions between continuous variables across combinations of reliability (0.7–1.0), main effect size (0.1–0.5), collinearity (0.1–0.5), and interaction effect size (0.05–0.2). Stability was defined as the consistency of estimated effect sizes across repeated samples of the same size from the same population and operationalized using modified definitions of the corridor of stability and point of stability from Schönbrodt and Perugini. Results show that the stability of interaction estimates is primarily determined by sample size and predictor reliability. The case representing a realistic psychology field study, in which researchers have limited control over variables, stabilized at n = 3,800, requiring 72% statistical power. At n ≤ 100, 11% to 45% of the estimates were incorrectly signed (i.e., negative when the true effect was positive). Most psychology studies enroll far fewer than 500 participants, and our results indicate many published interactions may be unstable. Analyses involving highly reliable predictors, such as group assignment in experimental designs, may stabilize at lower sample sizes because they attenuate the expected effect size less than variables with more measurement error. Researchers are encouraged to avoid routine tests of two-way interactions unless sample size and reliability are adequate and hypotheses are specified a priori.
Interaction or moderation effects in linear regression occur when the relation between two variables changes depending on the level of a third variable. 1 Interaction analyses are commonplace in the social sciences, in which researchers often seek to understand how the influence of one variable may change in different contexts or conditions. Interaction analyses are used to test a wide range of questions, including but not limited to gene–environment (Plomin et al., 1977), person–situation (Lewin et al., 1936), aptitude–treatment (Cronbach, 1957; Cronbach & Snow, 1981), and society–individual (Blumer, 1986) interactions, among many others. A significant interaction indicates that the relationship between the predictors and the outcome variable cannot be fully explained by the main effects alone because the predictors’ effects are interdependent in a nonadditive way (Jaccard & Turrisi, 2003). Interactions introduce unique complexities that can make them challenging to detect (MacKinnon, 2011), especially in nonexperimental or naturalistic investigations in which researchers have limited control over variables (i.e., field studies; McClelland & Judd, 1993). Some researchers use 80% power as a stopping rule for detecting interactions, but in practice, many reported interactions are underpowered. Power analyses are often omitted or conducted using conventional benchmarks that assume unrealistically large effect sizes for interactions. As a result, the stability of interaction effects, defined as the consistency of their estimated effect size across repeated samples, is frequently uncertain even when statistical significance is obtained.
In the context of null hypothesis significance testing, “power” refers to the probability of correctly detecting an effect if it exists. There are two primary issues with power in interaction effects: their typically small effect sizes (Aguinis et al., 2005; Vize, Sharpe, et al., 2023)and low reliability (Busemeyer & Jones, 1983). 2 Beyond the normatively small effect sizes of interactions, they can be obscured by restricted ranges of predictors (McClelland & Judd, 1993), the size and collinearity of the main effects (Baranger et al., 2023; Dormann et al., 2013), and unequal cell sizes in categorical moderators (Frazier et al., 2004). The reliability of the interaction term is approximately the product of the reliabilities of its components, which can further attenuate effect size. As a consequence, large sample sizes tend to be necessary to detect interactions (Hyatt et al., 2022; Lakens, 2019; Vize, Baranger, et al., 2023). Even seasoned researchers easily fall into false dichotomies around statistical significance when interpreting interaction effects (McShane & Gal, 2017), yet many studies in the extant literature do not sufficiently power for interaction analyses.
Publication bias casts doubt on the understanding of published interactions (Sotola & Credé, 2023; Ferguson & Brannick, 2012). The number of significant interaction effects is likely overstated because of the selective reporting of exaggerated effect sizes (Simmons et al., 2011), failure to correct for multiple comparisons (see Pease & Lewis, 2015), and severe underpowering (see Jensen-Campbell et al., 2007; see Ode et al., 2008). Recent z-curve analyses of health-psychology journals confirm this pattern, in which researchers estimated replication rates below 50% and showed that for every significant interaction reported, nearly two nonsignificant results go unreported (Fremling et al., 2025). A meta-analysis found that only 22% of interaction analyses replicate in psychology (Open Science Collaboration, 2015), and similarly poor replicability exists in other fields (Credé & Sotola, 2024; Greenland, 1993). Some have even urged researchers to “mend it” or “end it” regarding the search for moderators (Murphy & Russell, 2017).
One factor to consider when designing interaction analyses is stability, defined as the consistency of estimated effect sizes across repeated samples of the same size from the same population. Larger samples yield tighter clusters of estimates and thus greater stability. Whereas standard measures of spread (e.g., standard errors, standard deviations, and confidence intervals [CIs]) describe uncertainty for a single analytic result, stability is about the analytic design of the study and reflects how reliably a given design reproduces estimates of effects.
“Power” and “stability” both describe the expected performance of a design rather than outcomes from a single data set, but they capture different features of that performance. Power pertains to the performance of decision rules in hypothesis testing: the probability of rejecting a null hypothesis at a chosen alpha when an interaction truly exists. Stability pertains to estimate reliability: the degree to which identically designed studies yield similar estimates of effects. Unlike power or confidence-interval coverage, stability does not depend on alpha or a specific test; it is purely about the spread of estimates generated by a study design across repeated simulations. Interactions are particularly prone to artifacts (Gelman, 2023; McClelland & Judd, 1993; Murphy & Russell, 2017; Rimpler et al., 2025), and conflating power or confidence-interval coverage with stability can lead researchers to draw theoretical conclusions from effects that are statistically significant yet unlikely to replicate.
These replication failures point to a deeper issue with conventional power analyses. Power is the probability of rejecting a null hypothesis at a chosen alpha when an interaction truly exists but does so based on assumed effect sizes. When those assumed effects are exaggerated because of publication bias, selective reporting, or unrealistic benchmarks, a study can be “powered” on paper yet still yield misleading results. In addition, finding that an effect differs from zero is not the same as estimating its magnitude accurately. For science, the critical question is not whether an effect exists but whether its size can be estimated with sufficient consistency and accuracy across samples. This is the question stability answers directly.
A well-known framework for quantifying stability was introduced by Schönbrodt and Perugini (2013) in the context of correlations.
3
Instead of focusing on the CIs around a single estimate, they used simulations to determine the sample size at which repeated estimates fell within a predefined range of the true value. They simulated a population correlation and drew 100,000 bootstrapped samples across a range of sample sizes (ns = 20–1,000). In each sample, they estimated the correlation between the two variables and traced the correlation values as a function of sample size to form trajectories. The number of trajectories falling within a fixed-width corridor around the population parameter, called the “corridor of stability” (COS),
4
was used to find the point of stability (POS), the smallest sample size at which a given percentage (80%, 90%, or 95%) of trajectories remained within the COS. For example, a correlation of r = .1 stabilized at n = 252 with 80% of estimates in the COS. At this sample size, there is approximately 38% power to detect the effect.
5
For
Although Schönbrodt and Perugini (2013) evaluated correlations between perfectly measured observed variables, research evaluating correlations between imperfectly measured latent variables has shown that such correlations require even larger sample sizes to stabilize because of unreliability. Kretzschmar and Gignac (2019) conducted Monte Carlo simulations similar to Schönbrodt and Perugini and used McDonald’s omega to introduce unreliability into latent-variable correlations. Although a latent-variable correlation of
In this study, we used Monte Carlo simulations to evaluate the stability of interaction effects under varying conditions. We aimed to find the sample sizes, main effect sizes, and predictor collinearity values that produce stable interactions. Our approach is different from Schönbrodt and Perugini (2013) in four notable ways. First, we considered more potential conditions by including three effect sizes: that of the interaction, main effects, and the collinearity of predictors. Second, our definition of stability accounts for effect size directly through the use of percentages instead of translating effect sizes using Fisher’s r to z and adding or subtracting predetermined widths (w = 0.1, 0.15, 0.2) before back-translating with Fisher’s z to r. Third, we tested a wider range of COS widths to allow us to consider conservative and less conservative thresholds for stability. Finally, we did not trace trajectories across sample sizes from the same population and instead resampled entirely at each new sample size because we had a uniform simulated population to draw from. Given that the mean effect size for interactions is below r = .1 in empirical research (Credé & Sotola, 2024; Freese & Peterson, 2017; Open Science Collaboration, 2015), the goal of this simulation study is to illustrate the conditions necessary for stable interactions. We answer the following research question:
Research Question: How do main effect size, intercorrelation, and reliability affect the sample size required for stable interaction-effect estimates?
Method
In this preregistered study, we used Monte Carlo simulations to evaluate the conditions required for interaction effects to stabilize. Simulations were conducted in R (R Core Team, 2024), and the full reproducible code is available in the supplemental materials (https://osf.io/zmvsf/?view_only=46e3f25d45ea4c83a33ffaef111abef9). The R package InteractionPoweR (Baranger & Castillo, 2025) is available for researchers interested in conducting similar analyses. In this article, the regression model applied is
where
Simulation parameters
Reliability
Reliability reflects the degree to which predictor values are affected by random measurement error. We considered four levels of reliability for
Effect size
We evaluated two relevant effect sizes, those of the main effects (
Both
The expected interaction effect size (
Sample sizes
Although the original preregistered approach was to examine sample sizes ranging from
The simulation parameters produced 192 distinct combinations of variables, excluding sample size. At each examined sample size, for each combination of parameters, 10,000 data sets were generated.
Evaluating stability
Stability of the interaction is measured using a modified version of the COS and POS from Schönbrodt and Perugini (2013). Specifically, we used percentages to define how similar effect-size estimates must be to the expected effect to be considered stable. Statistical power was calculated at each examined sample size using the power_interaction_r2() analytic power function from the InteractionPoweR package in R (Baranger et al., 2023). The bias of the model was measured by evaluating how frequently the models recapture the true value
COS
The COS
8
is an interval in which estimates are considered stable if they fall within its bounds. Its width is determined using percentages, which we denote
To demonstrate how the
POS
The POS is the sample size at which a certain percentage of the estimates (one of
The following example demonstrates how the COS and POS are calculated and applied in our study. Consider the case in which we select a
We opted for percentage-based thresholds when quantifying stability to scale according to the magnitude of the effect. Schönbrodt and Perugini (2013) constructed a
Percentage of Deviation Allowed From
Note: The italicized rows are the effect sizes employed in the present study. The bold rows represent effect sizes unique to the present study.
In six of 12 cases with the effect sizes we analyzed in our study, the
Multiple values for
True-value recapture rate
The true-value recapture rate is a measure of bias in the estimates of the interaction coefficients. It is the proportion of estimates falling within bounds centered on the true (unattenuated) interaction effect size
Power
Power for the interaction term is calculated analytically using the InteractionPoweR (Baranger et al., 2023) package in R, which accounts for attenuation. This is different than the preregistered approach of recording the proportion of significant estimates during simulations. The change was made to improve computational efficiency.
Sign errors
Sign errors for the interaction term were calculated as the proportion of negatively signed interaction effect-size estimates to the total number of estimates because our simulated effect sizes (
Statistical analyses
All simulations and analyses were performed using the parameters and evaluation metrics described above. Information presented about the POS is calculated using data simulated with the search algorithm described above in Sample Sizes. When specific sample sizes were examined (see the Results section;
Results
Stabilization across combinations
The subsequent analyses are reported with PCOS = 50% and
As the reliability of the predictors increases, the POS decreases. The average
Main effect size and collinearity also affect the
Stability in the average study
A combination of parameters that approximates a field study in psychology
10
(Brysbaert, 2019; Nosek et al., 2022; Open Science Collaboration, 2015) has a point of stability of
effect size
In a nonpreregistered analysis, to better illustrate the POS/COS trade-offs, we also calculated Cohen’s

Sample size and
For example, consider Points 1 through 3 in Figure 1.
14
All illustrate the
Power and stability
In a deviation from our preregistration, we chose to more specifically examine statistical power at the POS for all combinations considering various COS widths because we believe readers might find such an examination useful. With a
Sign errors and true-value-recapture rates
Sign errors are most frequent at small sample sizes across all combinations (
Case example
For the simulated data for the combination with median parameters,
15
see Figure 2. The data illustrated are not simulated using the search algorithm described in the method section because the plot would be incomplete. Instead, data are generated using the same procedure for each

Plot of the estimates for the interaction-term coefficient (
Stability is achieved at n = 425 with a COS of
Discussion
Despite their theoretical appeal, the results of this study suggest that estimating two-way interaction effects using linear regression presents major methodological and statistical challenges. An estimate of an effect is considered stable when it is likely to replicate in magnitude and direction across samples. Our simulations identified sample size, predictor reliability, and interaction effect size as the primary determinants of the stability of interaction estimates. When these parameters were evaluated at levels realistic for a field study, the sample size required for a stable estimate of the interaction coefficient was found to be 3,800.
An important takeaway from this study is that 72% power to detect the interaction is synonymous with estimates being reasonably stable (
Although there is no inherent flaw with interaction analyses, there are unique properties to interaction effects that explain the difficulties researchers encounter when testing for them. It has been repeatedly observed that interaction effects are small, with an absence of evidence from any field that there are many large interactions (Credé & Sotola, 2024; Freese & Peterson, 2017; Open Science Collaboration, 2015). Simultaneously, they are uniquely susceptible to attenuation because their reliability is approximately the product of the two main effects. Methodologically, the sample-size thresholds for stability we have outlined will likely be impractical for most researchers to meet because of constraints on resources and sociological pressures around publishing. The same pressures drive many interaction analyses. If the main effects in a study are not found to be significant, researchers may turn to testing interactions until a significant (and thus, publishable; Greenwald, 1975) result is found. In such cases, researchers are typically searching for interactions they are grossly underpowered to detect and that would yield highly unstable estimates. These conditions can encourage atheoretical serial testing of interactions. This will lead and has led to increased family-wise error rate and gross overestimations of population effect sizes (Sotola & Credé, 2023).
Power and stability are not the only issues with interactions. Even if a given sample size provides adequate power and stability for an interaction, there remain deeper conceptual and interpretive challenges. As Rohrer and Arslan (2021) emphasized, there are additional issues to consider: scale dependence, the distinction between moderation of slopes and moderation of correlations, and causal identification. Interactions can change in magnitude (or even reverse direction) depending on the measurement scale, leading to contradictory conclusions. Differences in correlations between groups may be mistaken for slope differences or vice versa, obscuring the true nature of the effect. Finally, significant interaction terms do not imply causal interactions unless both variables are appropriately manipulated or strong assumptions are met. Thus, even precise and stable estimates can be misleading without careful attention to these foundational issues.
Stability and power
Statistical power was identified as a strong proxy for stability, although the two constructs are theoretically and operationally distinct. For the 192 parameter combinations, stability (defined as
Point of Stability (n) for Different
Note: Main effect size is 0.2, and the collinearity between the predictors is 0.1. We also calculated POS values for COSs with widths of 10% and 75%. For brevity, they are excluded from this table and are available in the Supplemental Material available online. Negative effect sizes were not evaluated in this study. In response to reviewer feedback, we evaluated one case from this table (
CIs are often misinterpreted as indicators of stability. A confidence interval reflects within‑studies uncertainty around the attenuated parameter targeted by the estimator under measurement error. It does not evaluate consistency across replications like the COS and POS. For example, at n = 100 with ρ = 0.1, the average 95% CI for the interaction term across 10,000 simulations was [–0.11, 0.27] around an attenuated β3 ≈ 0.08, indicating substantial imprecision in a single study. By contrast, the COS and POS are design‑level metrics, and they quantify the proportion of replicated estimates that fall within a prespecified band around the attenuated population value. CIs summarize precision of one estimate, whereas the COS and POS summarize the ability of the design to produce stable results. Narrow CIs do not guarantee high stability if the corridor is tighter than the typical sampling variability; conversely, a design can meet a stability criterion even when single‑study CIs remain relatively wide.
Recapture rates and sign errors
There is a moderate to high risk of sign errors when the sample size is below
Likewise, the true population value (
Alternatives
Our results accord with Murphy and Russell’s (2017) recommendation for researchers to end the search for moderators unless major improvements to their detection can be made. Estimating interaction effects necessitates at least four groups because they involve the examination of multiple levels of two independent variables simultaneously. Additive main effects are often mistaken to be interactions (Vize, Baranger, et al., 2023), and one cannot conclude an interaction exists simply by virtue of the two variables producing a larger effect jointly as opposed to separately. In linear regression, the interaction term models a situation in which variance is explicable beyond the sum of the constituent main effects. Thus, theoretical justification on why two main effects are insufficient for explaining the phenomena is critical when testing for interactions. A theoretical examination of all possible moderators should always be disclosed as exploratory. Given seven variables, there are 21 possible two-way interactions; one is likely to be significant at
Assuming a genuine interaction is hypothesized, we suggest that statistical approaches beyond linear regression may reduce the impact of measurement error and the signal-to-noise ratio inherent to small effects. Reliability affects both the point at which estimates for the interaction can be regarded as “stable” and on the observable effect size because of attenuation of
Practical recommendations
Researchers planning interaction analyses with two continuous predictors can take several concrete steps to improve the replicability of their research. First and most important, developing an interaction hypothesis and powering for its analysis using realistic interaction effect sizes (r ≤ .10) rather than Cohen’s benchmarks will help avoid the pitfalls of underpowered, exploratory interaction testing. When feasible, SEM can mitigate some of the reliability-related challenges we have identified. Under an SEM framework, stability may be achieved at sample sizes comparable with our simulated cases that had perfect reliability (Hoyle, 2012), although many of the perfectly reliable cases still require enormous sample sizes for stability. Given the substantial sample-size requirements we identify, researchers unable to achieve at least 72% power for the interaction should consider whether their research questions can be addressed through alternative approaches.
Limitations
This study had several limitations. Any attempt to quantify the stability of an estimate in a regression framework across multiple trials requires an arbitrary selection of cut points to delineate “stable” and “unstable,” which applies to our percentage-based COS and POS, even though these had advantages over their existing definitions. Furthermore, the Monte Carlo simulations relied on synthetic data that may not reflect typical social-science data in which peculiarities such as nonnormality, omitted variables, or selection bias can appear. Measurement error similarly is modeled only through predictor reliability, which is but one potential source of noise in the data among many (Loken & Gelman, 2017; Schmidt et al., 2003). Potential concerns about the applicability of linear models to nonlinear patterns remain beyond the scope of this study. Large sample sizes required for stability may be impractical in empirical research, although alternative methods, such as SEM (Hoyle, 2012), could mitigate some of the identified issues with interaction analyses.
Conclusion
Our results strongly suggest that testing two-way interaction effects in psychology field studies using a linear regression framework is largely untenable. 22 Even under the most favorable conditions, achieving stability requires sample sizes in excess of what is common in empirical research. It follows from the results of this study and corroborating evidence from existing research (Murphy & Russell, 2017; Vize, Baranger, et al., 2023; Vize, Sharpe, et al., 2023) that published interaction effect using a linear-regression framework should be regarded with skepticism. Researchers should test interactions only if they have a specific interaction hypothesis that the relation between two variables changes depending on the level of a third and the study design is appropriate (sufficiently large N and reliable measures). Alternative methods to regression, such as SEM, may better address challenges related to reliability, power, and sampling variability in interaction analyses (Cole & Preacher 2014; Hoyle, 2012; Marsh et al., 2004). Neglecting these factors risks portraying negligible interactions as stable and reliable, leading to irreplicable studies that form the backbone of misguided theories built on the artifacts of misaligned incentives and null hypothesis testing.
Supplemental Material
sj-pdf-1-amp-10.1177_25152459251407860 – Supplemental material for When Do Interaction/Moderation Effects Stabilize in Linear Regression?
Supplemental material, sj-pdf-1-amp-10.1177_25152459251407860 for When Do Interaction/Moderation Effects Stabilize in Linear Regression? by Andrew Castillo, Joshua D. Miller, Colin Vize, David A. A. Baranger and Donald R. Lynam in Advances in Methods and Practices in Psychological Science
Footnotes
Transparency
Action Editor: Pamela Davis-Kean
Editor: David A. Sbarra
Author Contributions
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
