Abstract

Cognitive dissonance in the induced-compliance paradigm (Croyle & Cooper, 1983; Festinger & Carlsmith, 1959) represents one of the foundational theories and experimental paradigms of social psychology. And yet despite a truly heroic effort, Vaidis et al. (2024) were unable to obtain similar results across a host of partner laboratories based in over a dozen nations that recruited a total of over 4,000 research participants. The original Croyle and Cooper (1983) research has been widely cited and influential and was authored by eminent researchers at prestigious academic institutions. And yet it drew a small sample from a single population and was carried out before it became commonplace to limit researcher degrees of freedom (Simmons et al., 2011) by committing to one’s planned methods and analyses in advance (Van’t Veer & Giner-Sorolla, 2016; Wagenmakers et al., 2012). As with most crowd initiatives, Vaidis et al. assembled a coalition of the willing, including researchers who varied greatly in academic seniority and topic specialization and were based at many different institutions around the world. A familiar dilemma repeats itself, but this time with much greater stakes given the extraordinary impact and importance of the original work. What should one conclude from a systematic failed replication?
The Context-Sensitivity Defense
Initially, the main argument against drawing strong inferences from null results was the likely sensitivity of social judgments and behaviors to hidden moderators, such as cultural and population differences (Bargh, 2012b; Gilbert et al., 2016; Schnall, 2014; Schwarz & Strack, 2014; Stroebe & Strack, 2014; Van Bavel et al., 2016). Effects that are confirmed meta-analytically aggregating across sites do exhibit statistically substantial heterogeneity across populations, with large effect sizes in some samples and small or even near-zero estimates in others (Klein et al., 2018; Krefeld-Schwalb et al., 2024). However, we believe this variability is at times overstated in that it only relatively rarely involves qualitative differences in effects.
Consider the recent widely discussed evidence of cross-sample heterogeneity in decision biases provided by Krefeld-Schwalb et al. (2024), some of the best evidence yet for sensitivity to context in behavioral research. Closer examination of these results reveals that the default effect and framing replicated significantly in 10 of 11 populations and directionally in 11 of 11 populations, the less-is-better effect replicated significantly and directionally in 11 of 11 populations, and the sunk cost effect replicated significantly in eight of 11 populations and directionally in nine of 11 populations. There were precisely zero statistically significant reversals of any effect in any sample. Although researcher choices in experimental designs and statistical approaches can be hugely impactful (Landy et al., 2000; Silberzahn et al., 2018), there is not as much population variability in real findings, as many scholars have explicitly argued or implicitly assumed.
More importantly, meta-analytic evidence reveals little cross-site heterogeneity in overall replication failures, sharply contradicting the context-sensitivity defense (Olsson-Collentine et al., 2020). Effects that appear to be false positives under the criteria they fail to produce a significant directional estimate aggregating across sites (e.g., money or flag priming; Klein et al., 2014) are typically not characterized by significant replications at some sites, near-zero estimates at others, and significant reversals at others. The perspectivist thesis that most findings are massively moderated and thus likely to fully reverse across different populations (“the opposite of a great truth is also true”; McGuire, 1973, 1983) provides the intellectual backdrop for the hidden-moderator rebuttal to failed replications. Although a beautiful intellectual vision, perspectivism is not empirically supported by crowdsourced direct replications.
The Expertise Defense
The accumulated empirical evidence also contradicts the claim that the modest overall replicability rate for published findings from top journals (Klein et al., 2014, 2018; Nosek et al., 2022; Open Science Collaboration, 2015) is attributable to replicator inexpertise (Bargh, 2012a; Baumeister, 2016; Schnall, 2014). Traditional indicators of scientific eminence, such as publication records, do not predict the empirical results replicators obtain (Bench et al., 2017; Landy et al., 2000). In addition, involving original authors as consultants or data collectors does not appreciably affect replication effect sizes (Klein et al., 2022; Schweinsberg et al., 2016) even when they resample the original population (Schweinsberg et al., 2016).
Some highly complex research paradigms (e.g., those involving confederates and hidden cameras) are much more difficult to scale than others, unquestionably limiting the scope of crowd-science initiatives. Some laboratory measures, such as functional MRI, require considerable prior training to deploy successfully. However, reasoning backward from a failed replication to the conclusion that “they must have done it wrong,” as is likely to occur in the case of Vaidis et al. (2024), is defensive, unscientific, and fallacious. If some findings are fragile and require expert hands, then more accomplished scientists by traditional metrics should be more likely to obtain them. But at least among the original effects reexamined thus far, it just is not so.
The currently available evidence suggests that psychological findings are either fairly robust and generalizable across most research teams and participant populations (e.g., representativeness heuristic, defaults, framing, loss aversion; Klein et al., 2014; Krefeld-Schwalb et al., 2024) or consistently are not (e.g., prime to behavior effects, ego depletion, effects of power poses on hormone levels; Cesario et al., 2017; Klein et al., 2014; Lodder et al., 2019; Verschuere et al., 2018; Vohs et al., 2021). The field seems to have produced one set of highly robust findings that hold across most contexts and another collection of dubious findings that do not emerge again and again when research is done under crowd conditions that put the expertise and context-sensitivity arguments to systematic empirical tests.
The Operational-Failure Defense
Because these earlier rebuttals face accumulating empirical counterevidence, the emergent defense against a systematic nonreplication is now that of operational failure (Baumeister et al., 2023; Fiedler et al., 2021). Perhaps the experimental manipulation did not successfully activate or affect the targeted mediating psychological state. If so, the replication may not have provided an informative test of the hypothesized causal relationship between the independent and dependent variables. As Baumeister et al. (2023) wrote, “Operational failures . . . do not constitute falsifications of the hypothesis, because they were unable to provide a test of it” (p. 919).
Vaidis et al. (2024) addressed the operational-failure concern by carefully measuring a key mediating state, specifically, subjectively perceived choice. They found that the perceived voluntariness of writing a counter-attitudinal essay is greater in the high-choice condition but that this does not instigate attitude change, as predicted by cognitive-dissonance theory (Croyle & Cooper, 1983; Festinger & Carlsmith, 1959). Perhaps future replications should similarly capture mediating states even when they were never assessed in the original study. Going further, even systematic crowdsourced replications could be collectively discounted by the scholarly community as uninformative if the manipulation does not significantly affect the mediator.
Manipulation checks and mediational measures are inherently valuable to include in both original studies and replications whenever feasible. However, the operational-failure defense underestimates the severity of many skeptics’ concerns about small-sample classic studies. Indeed, there is one major form of metascientific skepticism regarding the original work that is supported, rather than undermined, by evidence of operational failure.
Statistical Skepticism Versus Hypothesis Skepticism
One can distinguish between the “hypothesis skeptic,” who doubts the original theoretical claim (“It seems unlikely to me that perceived choice in engaging in a counter-attitudinal act causes attitude change”), and the “statistical skeptic,” who dismisses implausibly large effect sizes from small underpowered studies as mainly noise rather than signal. Note that the key metascientific articles that instigated the crisis of confidence in science focused principally on statistical and methodological concerns, such as insufficient statistical power, effect-size overestimation, researcher degrees of freedom, and publication bias—issues that generalize across research topics (Fanelli, 2010; Ioannidis, 2005; Simmons et al., 2011). Many metascientists and replicators, ourselves among them, approach the literature from the standpoint of a statistical skeptic rather than a hypothesis skeptic. We see limited informational value in an experimental laboratory investigation with tiny numbers of participants per cell: The reported effects of condition on not only the dependent variable but also any process measures are at high risk of proving spurious.
There is therefore no need for a statistical skeptic to show that the ego-depletion manipulation exhausted participants’ mental resources, that fart spray made them feel disgusted, that recalling a time when they felt powerful made them feel powerful, or that the incidental presence of money activated thoughts about materialism. Given the statistical noise associated with the original designs, it is questionable whether these manipulations ever effectively induced their intended states or truly influenced scores on the dependent measures. Therefore, we would not expect either these mediating states or theorized downstream outcomes, such as the ability to resist tempting treats, harsher moral judgments, more agentic behaviors, and greater cheating, to prove robust. A replicator approaching the work from this stance need only repeat the original experimental manipulation and estimate the effect on the dependent variable using a large sample and preregistered analyses. The inclusion of process measures could add value if the aim is to faithfully recreate the original experimental design in its entirety but is not essential, especially if the original study itself featured no manipulation checks or mediational measures.
Another virtue of statistical skepticism, relative to hypothesis skepticism, is epistemological. From the perspective of traditional philosophy of science, it is extremely difficult to disprove scientific claims, especially in the social sciences (Kuhn, 1962; Lakatos, 1970; Lipton, 2008). In principle, an alternative operationalization of the independent or dependent variable could reveal support for the original theory. Thus, the hypothesis skeptic can deepen doubts but never definitively falsify the original theoretical claim. Alternative experimental designs or variations of the induced-compliance paradigm might still demonstrate attitude changes under specific conditions, underscoring the provisional nature of hypothesis skepticism. In contrast, statistical skepticism focuses on the empirical robustness of findings rather than theoretical plausibility. There exist powerful tools capable of showing that a piece of experimental evidence does not provide robust positive support for the stated conclusions. If the original study reports implausibly large effects based on tiny samples (Schimmack, 2012), features p values barely over the significance threshold (Simonsohn et al., 2014; van Aert et al., 2016), and/or the effect systematically fails to emerge in numerous multisite direct replications, the narrow claim that the original work should be largely discounted in Bayesian terms is supported.
Conclusion
The hypothesis skeptic suspects the original theory is false; the statistical skeptic suspects the original study captured mostly noise. Providing evidence against a theorized independent variable/dependent variable link requires careful manipulation checks and measures of mediating states, as in Vaidis et al. (2024), and faces the potentially insurmountable epistemological and empirical challenge of proving that something never happens. In contrast, overwhelming the noisy estimates of unreliable original studies with the strong signals provided by superior multisite samples and more rigorous analyses is and should continue to be the primary goal of replication. This approach remains pivotal in advancing the reliability and validity of psychological research, revealing the clear signals of robust phenomena.
Footnotes
Transparency
Action Editor: David A. Sbarra
Editor: David A. Sbarra
Author Contributions
