Noise Versus Signal: What Can One Conclude When a Classic Finding Fails to Replicate?

Abstract

Cognitive dissonance in the induced-compliance paradigm (Croyle & Cooper, 1983; Festinger & Carlsmith, 1959) represents one of the foundational theories and experimental paradigms of social psychology. And yet despite a truly heroic effort, Vaidis et al. (2024) were unable to obtain similar results across a host of partner laboratories based in over a dozen nations that recruited a total of over 4,000 research participants. The original Croyle and Cooper (1983) research has been widely cited and influential and was authored by eminent researchers at prestigious academic institutions. And yet it drew a small sample from a single population and was carried out before it became commonplace to limit researcher degrees of freedom (Simmons et al., 2011) by committing to one’s planned methods and analyses in advance (Van’t Veer & Giner-Sorolla, 2016; Wagenmakers et al., 2012). As with most crowd initiatives, Vaidis et al. assembled a coalition of the willing, including researchers who varied greatly in academic seniority and topic specialization and were based at many different institutions around the world. A familiar dilemma repeats itself, but this time with much greater stakes given the extraordinary impact and importance of the original work. What should one conclude from a systematic failed replication?

The Context-Sensitivity Defense

Initially, the main argument against drawing strong inferences from null results was the likely sensitivity of social judgments and behaviors to hidden moderators, such as cultural and population differences (Bargh, 2012b; Gilbert et al., 2016; Schnall, 2014; Schwarz & Strack, 2014; Stroebe & Strack, 2014; Van Bavel et al., 2016). Effects that are confirmed meta-analytically aggregating across sites do exhibit statistically substantial heterogeneity across populations, with large effect sizes in some samples and small or even near-zero estimates in others (Klein et al., 2018; Krefeld-Schwalb et al., 2024). However, we believe this variability is at times overstated in that it only relatively rarely involves qualitative differences in effects.

Consider the recent widely discussed evidence of cross-sample heterogeneity in decision biases provided by Krefeld-Schwalb et al. (2024), some of the best evidence yet for sensitivity to context in behavioral research. Closer examination of these results reveals that the default effect and framing replicated significantly in 10 of 11 populations and directionally in 11 of 11 populations, the less-is-better effect replicated significantly and directionally in 11 of 11 populations, and the sunk cost effect replicated significantly in eight of 11 populations and directionally in nine of 11 populations. There were precisely zero statistically significant reversals of any effect in any sample. Although researcher choices in experimental designs and statistical approaches can be hugely impactful (Landy et al., 2000; Silberzahn et al., 2018), there is not as much population variability in real findings, as many scholars have explicitly argued or implicitly assumed.

More importantly, meta-analytic evidence reveals little cross-site heterogeneity in overall replication failures, sharply contradicting the context-sensitivity defense (Olsson-Collentine et al., 2020). Effects that appear to be false positives under the criteria they fail to produce a significant directional estimate aggregating across sites (e.g., money or flag priming; Klein et al., 2014) are typically not characterized by significant replications at some sites, near-zero estimates at others, and significant reversals at others. The perspectivist thesis that most findings are massively moderated and thus likely to fully reverse across different populations (“the opposite of a great truth is also true”; McGuire, 1973, 1983) provides the intellectual backdrop for the hidden-moderator rebuttal to failed replications. Although a beautiful intellectual vision, perspectivism is not empirically supported by crowdsourced direct replications.

The Expertise Defense

The accumulated empirical evidence also contradicts the claim that the modest overall replicability rate for published findings from top journals (Klein et al., 2014, 2018; Nosek et al., 2022; Open Science Collaboration, 2015) is attributable to replicator inexpertise (Bargh, 2012a; Baumeister, 2016; Schnall, 2014). Traditional indicators of scientific eminence, such as publication records, do not predict the empirical results replicators obtain (Bench et al., 2017; Landy et al., 2000). In addition, involving original authors as consultants or data collectors does not appreciably affect replication effect sizes (Klein et al., 2022; Schweinsberg et al., 2016) even when they resample the original population (Schweinsberg et al., 2016).

Some highly complex research paradigms (e.g., those involving confederates and hidden cameras) are much more difficult to scale than others, unquestionably limiting the scope of crowd-science initiatives. Some laboratory measures, such as functional MRI, require considerable prior training to deploy successfully. However, reasoning backward from a failed replication to the conclusion that “they must have done it wrong,” as is likely to occur in the case of Vaidis et al. (2024), is defensive, unscientific, and fallacious. If some findings are fragile and require expert hands, then more accomplished scientists by traditional metrics should be more likely to obtain them. But at least among the original effects reexamined thus far, it just is not so.

The currently available evidence suggests that psychological findings are either fairly robust and generalizable across most research teams and participant populations (e.g., representativeness heuristic, defaults, framing, loss aversion; Klein et al., 2014; Krefeld-Schwalb et al., 2024) or consistently are not (e.g., prime to behavior effects, ego depletion, effects of power poses on hormone levels; Cesario et al., 2017; Klein et al., 2014; Lodder et al., 2019; Verschuere et al., 2018; Vohs et al., 2021). The field seems to have produced one set of highly robust findings that hold across most contexts and another collection of dubious findings that do not emerge again and again when research is done under crowd conditions that put the expertise and context-sensitivity arguments to systematic empirical tests.

The Operational-Failure Defense

Because these earlier rebuttals face accumulating empirical counterevidence, the emergent defense against a systematic nonreplication is now that of operational failure (Baumeister et al., 2023; Fiedler et al., 2021). Perhaps the experimental manipulation did not successfully activate or affect the targeted mediating psychological state. If so, the replication may not have provided an informative test of the hypothesized causal relationship between the independent and dependent variables. As Baumeister et al. (2023) wrote, “Operational failures . . . do not constitute falsifications of the hypothesis, because they were unable to provide a test of it” (p. 919).

Vaidis et al. (2024) addressed the operational-failure concern by carefully measuring a key mediating state, specifically, subjectively perceived choice. They found that the perceived voluntariness of writing a counter-attitudinal essay is greater in the high-choice condition but that this does not instigate attitude change, as predicted by cognitive-dissonance theory (Croyle & Cooper, 1983; Festinger & Carlsmith, 1959). Perhaps future replications should similarly capture mediating states even when they were never assessed in the original study. Going further, even systematic crowdsourced replications could be collectively discounted by the scholarly community as uninformative if the manipulation does not significantly affect the mediator.

Manipulation checks and mediational measures are inherently valuable to include in both original studies and replications whenever feasible. However, the operational-failure defense underestimates the severity of many skeptics’ concerns about small-sample classic studies. Indeed, there is one major form of metascientific skepticism regarding the original work that is supported, rather than undermined, by evidence of operational failure.

Statistical Skepticism Versus Hypothesis Skepticism

One can distinguish between the “hypothesis skeptic,” who doubts the original theoretical claim (“It seems unlikely to me that perceived choice in engaging in a counter-attitudinal act causes attitude change”), and the “statistical skeptic,” who dismisses implausibly large effect sizes from small underpowered studies as mainly noise rather than signal. Note that the key metascientific articles that instigated the crisis of confidence in science focused principally on statistical and methodological concerns, such as insufficient statistical power, effect-size overestimation, researcher degrees of freedom, and publication bias—issues that generalize across research topics (Fanelli, 2010; Ioannidis, 2005; Simmons et al., 2011). Many metascientists and replicators, ourselves among them, approach the literature from the standpoint of a statistical skeptic rather than a hypothesis skeptic. We see limited informational value in an experimental laboratory investigation with tiny numbers of participants per cell: The reported effects of condition on not only the dependent variable but also any process measures are at high risk of proving spurious.

There is therefore no need for a statistical skeptic to show that the ego-depletion manipulation exhausted participants’ mental resources, that fart spray made them feel disgusted, that recalling a time when they felt powerful made them feel powerful, or that the incidental presence of money activated thoughts about materialism. Given the statistical noise associated with the original designs, it is questionable whether these manipulations ever effectively induced their intended states or truly influenced scores on the dependent measures. Therefore, we would not expect either these mediating states or theorized downstream outcomes, such as the ability to resist tempting treats, harsher moral judgments, more agentic behaviors, and greater cheating, to prove robust. A replicator approaching the work from this stance need only repeat the original experimental manipulation and estimate the effect on the dependent variable using a large sample and preregistered analyses. The inclusion of process measures could add value if the aim is to faithfully recreate the original experimental design in its entirety but is not essential, especially if the original study itself featured no manipulation checks or mediational measures.

Another virtue of statistical skepticism, relative to hypothesis skepticism, is epistemological. From the perspective of traditional philosophy of science, it is extremely difficult to disprove scientific claims, especially in the social sciences (Kuhn, 1962; Lakatos, 1970; Lipton, 2008). In principle, an alternative operationalization of the independent or dependent variable could reveal support for the original theory. Thus, the hypothesis skeptic can deepen doubts but never definitively falsify the original theoretical claim. Alternative experimental designs or variations of the induced-compliance paradigm might still demonstrate attitude changes under specific conditions, underscoring the provisional nature of hypothesis skepticism. In contrast, statistical skepticism focuses on the empirical robustness of findings rather than theoretical plausibility. There exist powerful tools capable of showing that a piece of experimental evidence does not provide robust positive support for the stated conclusions. If the original study reports implausibly large effects based on tiny samples (Schimmack, 2012), features p values barely over the significance threshold (Simonsohn et al., 2014; van Aert et al., 2016), and/or the effect systematically fails to emerge in numerous multisite direct replications, the narrow claim that the original work should be largely discounted in Bayesian terms is supported.

Conclusion

The hypothesis skeptic suspects the original theory is false; the statistical skeptic suspects the original study captured mostly noise. Providing evidence against a theorized independent variable/dependent variable link requires careful manipulation checks and measures of mediating states, as in Vaidis et al. (2024), and faces the potentially insurmountable epistemological and empirical challenge of proving that something never happens. In contrast, overwhelming the noisy estimates of unreliable original studies with the strong signals provided by superior multisite samples and more rigorous analyses is and should continue to be the primary goal of replication. This approach remains pivotal in advancing the reliability and validity of psychological research, revealing the clear signals of robust phenomena.

Footnotes

Transparency

Action Editor: David A. Sbarra

Editor: David A. Sbarra

Author Contributions

Wilson Cyrus-Lai: Conceptualization; Investigation; Methodology; Writing – original draft; Writing – review & editing.

Warren Tierney: Writing – review & editing.

Eric Luis Uhlmann: Conceptualization; Investigation; Supervision; Writing – original draft; Writing – review & editing.

ORCID iD

Wilson Cyrus-Lai

References

Bargh

J. A.

(2012a). Nothing in their heads. Psychology Today. https://replicationindex.com/wp-content/uploads/2020/07/bargh-nothingintheirheads.pdf

Bargh

J. A.

(2012b). Priming effects replicate just fine, thanks. Psychology Today. https://www.psychologytoday.com/blog/the-natural-unconscious/201205/priming-effects-replicate-just-fine-thanks

Baumeister

R. F.

(2016). Charting the future of social psychology on stormy seas: Winners, losers, and recommendations. Journal of Experimental Social Psychology, 66, 153–158.

Baumeister

R. F.

Tice

D. M.

Bushman

B. J.

(2023). A review of multisite replication projects in social psychology: Is it viable to sustain any confidence in social psychology’s knowledge base? Perspectives on Psychological Science, 18(4), 912–935.

Bench

S. W.

Rivera

G. N.

Schlegel

R. J.

Hicks

J. A.

Lench

H. C.

(2017). Does expertise matter in replication? An examination of the Reproducibility Project: Psychology. Journal of Experimental Social Psychology, 68, 181–184.

Cesario

Jonas

K. J.

Carney

D. R.

(2017). CRSP special issue on power poses: What was the point and what did we learn? Comprehensive Results in Social Psychology, 2, 1–5.

Croyle

R. T.

Cooper

(1983). Dissonance arousal: Physiological evidence. Journal of Personality and Social Psychology, 45(4), 782–791.

Fanelli

(2010). “Positive” results increase down the hierarchy of the sciences. PLOS ONE, 5(4), Article e10068. https://doi.org/10.1371/journal.pone.0010068

Festinger

Carlsmith

J. M.

(1959). Cognitive consequences of forced compliance. The Journal of Abnormal and Social Psychology, 58(2), 203–210.

10.

Fiedler

McCaughey

Prager

(2021). Quo vadis, methodology? The key role of manipulation checks for validity control and quality of science. Perspectives on Psychological Science, 16, 816–826.

11.

Gilbert

D. T.

King

Pettigrew

Wilson

T. D.

(2016). Comment on “Estimating the reproducibility of psychological science.” Science, 351, Article 1037. https://doi.org/10.1126/science.aad7243

12.

Ioannidis

J. P.

(2005). Why most published research findings are false. PLOS Medicine, 2(8), Article e124. https://doi.org/10.1371/journal.pmed.0020124

13.

Klein

R. A.

Cook

C. L.

Ebersole

C. R.

Vitiello

Nosek

B. A.

Hilgard

Ahn

P. H.

Brady

A. J.

Chartier

C. R.

Christopherson

C. D.

Clary

Collisson

Crawford

J. T.

Cromar

Gardiner

Gosnell

C. L.

Grahe

Hall

Ca.

Howard

. . . Ratliff

K. A.

(2022). Many Labs 4: Failure to replicate mortality salience effect with and without original author involvement. Collabra, 8(1), Article 35271. https://doi.org/10.1525/collabra.35271

14.

Klein

R. A.

Ratliff

K. A.

Vianello

Adams

R. B.

Jr. Bahník

Š.

Bernstein

M. J.

Bocian

Brandt

M. J.

Brooks

Brumbaugh

C. C.

Cemalcilar

Chandler

Cheong

Davis

W. E.

Devos

Eisner

Frankowska

Furrow

Galliani

E. M.

. . . Nosek

B. A.

(2014). Investigating variation in replicability: A “Many Labs” replication project. Social Psychology, 453(3), 142–152.

15.

Klein

R. A.

Vianello

Hasselman

Adams

B. G.

Adams

R. B.

Jr. Alper

Aveyard

Axt

J. R.

Babalola

M. T.

Bahník

Š.

Batra

Berkics

Bernstein

M. J.

Berry

D. R.

Bialobrzeska

Binan

E. D.

Bocian

Brandt

M. J.

Busching

. . . Nosek

B. A.

(2018). Many Labs 2: Investigating variation in replicability across sample and setting. Advances in Methods and Practices in Psychological Science, 1(4), 443–490. https://doi.org/10.1177/2515245918810225

16.

Krefeld-Schwalb

Sugerman

E. R.

Johnson

E. J.

(2024). Exposing omitted moderators: Explaining why effect sizes differ in the social sciences. Proceedings of the National Academy of Sciences, USA, 121(12), Article e2306281121. https://doi.org/10.1073/pnas.2306281121

17.

Kuhn

T. S.

(1962). The structure of scientific revolutions. University of Chicago Press.

18.

Lakatos

(1970). Falsification and the methodology of scientific research programmes. In Lakatos

Musgrave

(Eds.), Criticism and the growth of knowledge (pp. 91–195). Cambridge University Press.

19.

Landy

J. F.

Jia

Ding

I. L.

Viganola

Tierney

Dreber

Johanneson

Pfeiffer

Ebersole

C. R.

Gronau

Q. F.

van den Bergh

Marsman

Derks

Wagenmakers

E. J.

Proctor

Bartels

D. M.

Bauman

C. W.

Brady

W. J.

. . . Uhlmann

E. L.

(2020). Crowdsourcing hypothesis tests: Making transparent how design choices shape research results. Psychological Bulletin, 146(5), 451–479. https://doi.org/10.1037/bul0000220

20.

Lipton

(2008). Inference to the best explanation. Routledge.

21.

Lodder

Ong

H. H.

Grasman

R. P. P. P.

Wicherts

(2019). A comprehensive metaanalysis of money priming. Journal of Experimental Psychology: General, 148(4), 688–712.

22.

McGuire

W. J.

(1973). The yin and yang of progress in social psychology: Seven koan. Journal of Personality and Social Psychology, 26(3), 446–456.

23.

McGuire

W. J.

(1983). A contextualist theory of knowledge: Its implications for innovations and reform in psychological research. In Berkowitz

(Ed.), Advances in experimental social psychology (Vol. 16, pp. 1–47). Academic Press.

24.

Nosek

B. A.

Hardwicke

T. E.

Moshontz

Allard

Corker

K. S.

Dreber

Fidler

Hilgard

Kline Struhl

Nuijten

M. B.

(2022). Replicability, robustness, and reproducibility in psychological science. Annual Review of Psychology, 73, 719–748.

25.

Olsson-Collentine

Wicherts

J. M.

van Assen

M. A. L. M

. (2020). Heterogeneity in direct replications in psychology and its association with effect size. Psychological Bulletin, 146(10), 922–940.

26.

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), Article aac4716. https://doi.org/10.1126/science.aac4716

27.

Schimmack

(2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551–566.

28.

Schnall

(2014). Social media and the crowd-sourcing of social psychology. https://www.psychol.cam.ac.uk/cece/blog

29.

Schwarz

Strack

(2014). Does merely going through the same moves make for a “direct” replication? Concepts, contexts, and operationalizations. Social Psychology, 45(4), 305–306.

30.

Schweinsberg

Madan

Vianello

Sommer

S. A.

Jordan

Tierney

Awtrey

Zhu

Diermeier

Heinze

Srinivasan

Tannenbaum

Bivolaru

Dana

Davis-Stober

C. P.

Du Plessis

Gronau

Q. F.

Hafenbrack

A. C.

Liao

E. Y.

. . . Uhlmann

E. L.

(2016). The pipeline project: Pre-publication independent replications of a single laboratory’s research pipeline. Journal of Experimental Social Psychology, 66, 55–67.

31.

Silberzahn

Uhlmann

E. L.

Martin

Anselmi

Aust

Awtrey

Bahník

Š.

Bai

Bannard

Bonnier

Carlsson

Cheung

Christensen

Clay

Craig

Dalla Rosa

Dam

Evans

M. H.

Flores Cervantes . . . Nosek

B. A.

(2018). Many analysts, one dataset: Making transparent how variations in analytical choices affect results. Advances in Methods and Practices in Psychological Science, 1, 337–356.

32.

Simmons

J. P.

Nelson

L. D.

Simonsohn

(2011). False–positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366.

33.

Simonsohn

Nelson

L. D.

Simmons

J. P.

(2014). P-curve and effect size: Correcting for publication bias using only significant results. Perspectives on Psychological Science, 9, 666–681.

34.

Stroebe

Strack

(2014). The alleged crisis and the illusion of exact replication. Perspectives on Psychological Science, 9, 59–71.

35.

Vaidis

D. C.

Sleegers

W. W. A.

van Leeuwen

DeMarree

K. G.

Sætrevik

Ross

R. M.

Schmidt

Protzko

Morvinski

Ghasemi

Roberts

A. J.

Stone

Bran

Gourdon-Kanhukamwe

Gunsoy

Moussaoui

L. S.

Smith

A. R.

Nugier

Fayant

M.-P.

. . . Priolo

(2024). A multilab replication of the induced-compliance paradigm of cognitive dissonance. Advances in Methods and Practices in Psychological Science, 7(1). https://doi.org/10.1177/25152459231213375

36.

van Aert

R. C. M.

Wicherts

J. M.

van Assen

. (2016). Conducting meta-analyses on p-values: Reservations and recommendations for applying p-uniform and p-curve. Perspectives on Psychological Science, 11(5), 713–729.

37.

Van Bavel

J. J.

Mende-Siedlecki

Brady

W. J.

Reinero

D. A

. (2016). Contextual sensitivity in scientific reproducibility. Proceedings of the National Academy of Sciences, USA, 113, 6454–6459.

38.

Van’t Veer

Giner-Sorolla

. (2016). Pre-registration in social psychology: A discussion and suggested template. Journal of Experimental Social Psychology, 67, 2–12.

39.

Verschuere

Meijer

E. H.

Jim

Hoogesteyn

Orthey

McCarthy

R. J.

Skowronski

J. J.

Acar

O. A.

Aczel

Bakos

B. E.

Barbosa

Baskin

Bègue

Ben-Shakhar

Birt

A. R.

Blatz

Charman

S. D.

Claesen

Clay

S. L.

. . . Yıldız

(2018). Registered Replication Report on Mazar, Amir, and Ariely (2008). Advances in Methods and Practices in Psychological Science, 1(3), 299–317.

40.

Vohs

K. D.

Schmeichel

B. J.

Lohmann

Gronau

Q. F.

Finley

A. J.

Ainsworth

S. E.

Alquist

J. L.

Baker

M. D.

Brizi

Bunyi

Butschek

G. J.

Campbell

Capaldi

Cau

Chambers

Chatzisarantis

N. L. D.

Christensen

W. J.

Clay

S. L.

Curtis

. . . Albarracín

(2021). A multisite preregistered paradigmatic test of the ego-depletion effect. Psychological Science, 32(10), 1566–1581.

41.

Wagenmakers

E.-J.

Wetzels

Borsboom

van der Maas

H. L. J.

Kievit

R. A.

(2012). An agenda for purely confirmatory research. Perspectives on Psychological Science, 7(6), 632–638.