Abstract
Rife et al. conducted a multilab replication of Trafimow and Hughes (Experiment 3) to examine whether mortality salience produces higher death-thought accessibility immediately after imagining one’s own death or after a time delay. Like Trafimow and Huges, Rife et al. found that thinking about death without delay produced higher death-thought accessibility than when thinking about death with delay or thinking about dental pain. This pattern occurred regardless of whether participants were randomly assigned the original Trafimow and Hughes word-generation death-thought-accessibility measure or assigned a word-fragment death-thought-accessibility measure more commonly used in the literature. However, we argue that regardless of whether multilab replications produce results consistent with the original replicated study, they offer weak insight into the integrity and empirical plausibility of the original study. Instead, multilab replications provide valuable theoretical and methodological information regarding the span of effect created by the procedural elements shared among the set of replication studies. This information, in turn, permits clearer theoretical inference when using similar procedural elements to investigate theoretically related phenomena. Moreover, multilab replications are often conceived as a defense against false-positive empirical results and theoretical interpretations, but Rife et al.’s results reveal they may be better suited for protecting against false-negative empirical results and theoretical interpretations.
Rife et al. (2025) contributed another set of multilab replication studies to the growing body of such work in psychological science. This time, the target of replication is a study originally conducted by Trafimow and Hughes (2012, Experiment 3) concerning the effect of mortality salience on death-thought accessibility depending on time delay, a process central to contemporary terror-management theory (TMT; Greenberg et al., 1994). In their original study, Trafimow and Hughes used an online procedure in which they experimentally manipulated mortality salience by randomly assigning participants to imagine the experience of physical death or the experience of dental pain. Participants were then randomly assigned to complete a word-generation task either immediately after the mortality-salience manipulation or after first reading and briefly responding to a distracting article about “concept hotels.” Degree of death-thought accessibility was assessed by the number of death-related words generated on the task. As shown in Table 1, Trafimow and Hughes found that participants generated the most death-related words when they imagined their death and responded immediately to the word-generation task than in the three other conditions, t(116) = 2.02, p = .045, d = 0.38. This finding was noteworthy when originally published because it suggested that death-thought accessibility immediately after mortality-salience induction is stronger than after a time delay. However, according to Trafimow and Huges, this finding is contrary to the death-thought suppression and rebound hypothesis, a theoretical assumption of contemporary TMT.
Descriptive Statistics for Death-Thought-Accessibility Measures Aggregated Across Laboratories From Rife et al. (2025)
Note: These results are based on responses from all participants who met methodological inclusion criteria established by Rife et al. (2025; N = 3,415). Standard deviations for means are in parentheses. Counts for percentages are in parentheses. T&H (2012) = Trafimow and Hughes (2012, Experiment 3); WG = word-generation task; WC = word-completion task; WG% = percentage of participants who generated one or more death words; WC% = percentage of participants who made one or more death-word completions.
In their series of multilab replications, Rife et al. (2025) relied on a similar online procedure using a similar mortality-salience manipulation and a similar article in their delay condition. However, participants also were randomly assigned to one of two death-thought-accessibility measures: either the word-generation measure used in the Trafimow and Hughes (2012) study or a word-fragment-completion measure more commonly used in the TMT literature. As shown in Table 1, Rife et al.’s results produced the expected one-versus-three pattern for mean responses on the word-generation measure, t(1,621) = 2.21, p = .028, d = 0.14, although the effect was weaker than in the original study. This effect is not statistically significant if a df correction for unequal variances is applied, t(440.63) = 1.93, p = .054. A similar analysis, not examined in Rife et al., also revealed the expected pattern with a stronger effect for mean responses on the word-completion measure t(1,790) = 4.44, p < .001, d = 0.25. We suggest that a holistic interpretation of these results and the additional meta-analytic results examined in Rife et al. appear generally consistent with the results pattern of the original Trafimow and Hughes study. Moreover, this holistic interpretation appears inconsistent with the death-thought suppression and rebound hypothesis and instead is consistent with the Trafimow and Hughes claim that death-thought accessibility is stronger immediately after exposure to mortality-salience-inducing stimuli than after time delay.
How Much Death-Thought Accessibility?
Despite general consistency in the pattern of results between the original Trafimow and Hughes (2012) study and the aggregated multilab studies, we did note an empirical discrepancy with the original study that signals potentially important theoretical implications of Rife et al.’s (2025) multilab studies. Specifically, the average number of word generations in the imagine-death/no-delay condition was low in the Trafimow and Hughes study (M = 0.94, SD = 1.21, 95% confidence interval [CI] = [0.51, 1.37]). However, mean word generations in the parallel condition of the Rife et al. studies were even lower (M = 0.38, SD = 0.85, 95% CI = [0.29, 0.47]). As shown in Table 1, only 23.10% (76/329) of participants in this condition generated one or more death-related words, suggesting a pronounced floor effect on the primary dependent measure. This floor effect suggests a very weak impact of the mortality-salience manipulation on the span of death-thought accessibility and suggests confident generalization of the multilab results only to instances of low death-thought accessibility. The floor-effect issue is lessened for the word-completion measure given that 81.24% (342/421) of participants produced at least one death-related word and a more symmetric, unimodal distribution of responses (M = 1.74, SD = 1.22).
Span of Effect Versus Strength of Effect
The impact of the mortality-salience manipulation on producing differences in only low responses on the death-thought-accessibility measures illustrates important considerations when making theoretical inferences and generalizations from the multilab-replication results. Even in the “highest” death-thought-accessibility-inducing condition, the degree of death-thought accessibility of most participants appears to range from low to nonexistent. This weakens generalization of theoretical interpretations to effects on other relevant variables that may require more moderate or high levels of death-thought accessibility. To the extent that other consequences proposed by TMT (e.g., worldview defense) require moderate or high levels of death-thought accessibility, more impactful manipulations of mortality salience than that used by Rife et al. may be required to investigate these consequences.
An analogous example is to consider the effects of a heat manipulation on the experience of warmth. Participants randomly assigned to a condition in which their hands are placed on a high-heat-generating plate may report higher ratings of warmth than participants whose hands are placed on a low-heat-generating plate. The empirical effect of the heat manipulation on ratings of warmth might be quite strong (a strong association between assignment to manipulation condition and ratings of warmth), yet the manipulation may be ineffective in producing differences in the experience of pain if the temperature of the heat-generating plate is not high enough. Studies using the heat manipulation to investigate the experience of pain would find no effect of the heat manipulation on measures of pain, erroneously suggesting exposure to heat is unrelated to the experience of pain because the heat manipulation fails to span heat temperatures that trigger a pain response.
Note that the generalization argument espoused here and illustrated by analogy to the heat manipulation is different than the effect-strength-generalization argument offered by Rife et al. (2025). Instead, we contend that the strength of effect on death-thought accessibility produced by their mortality-salience manipulation is less important to consider theoretically than is the span of death-thought accessibility produced by their mortality-salience manipulation. For instance, if worldview defense requires moderate to high levels of death-thought accessibility, then even an empirically strong effect of the mortality-salience manipulation on death-thought accessibility would be insufficient to spur differences in degree of worldview defense if the affected degree of death-thought accessibility in participants fails to range appreciably into moderate to high death-thought-accessibility “construct values.” In contrast, a strong or weak empirical effect of a mortality-salience manipulation on death-thought accessibility would be sufficient to spur differences in degree of worldview defense so long as the impact of the manipulation brought participants’ death-thought accessibility into the span of moderate to high construct values.
Not only is the issue of effect span important to consider for making theoretical generalizations from multilab replications, but it also is essential to consider when interpreting “failed” multilab replications. For example, inadequate span of effect was a major explanation for the apparent failure of Vaidis et al. (2024) to successfully replicate a classic induced-compliance dissonance study originally conducted by Croyle and Cooper (1983) in which participants in a low-perceived-choice condition evidenced attitude change after writing counterattitudinal statements. Specifically, failure of Vaidis et al.’s experimental manipulation to situate perceived choice into a low enough span when writing counterattitudinal statements may have resulted in insufficient levels of dissonance arousal to induce attitude change among participants (for span-of-effect critiques of Vaidis et al., 2024, see Harmon-Jones & Harmon-Jones, 2024; Lishner, 2024). Likewise, effect span, rather than effect strength (cf. Rife et al., 2025), may account for other “failed” replications and a multisite replication that examined the effect of mortality-salience manipulations on worldview defense (e.g., Klein et al., 2022). Unfortunately, absence of an assessment of death-thought accessibility in many of these studies renders the effect-span explanation speculative.
Replications and Study Effects Are Methodologically Contextualized
The distinction between empirical effect strength and the span of effect and their implications for theoretical generalization and inference is an important reminder of a larger metamethodological issue inherent in all multilab replications: All studies in psychological science are methodologically contextualized. Study responses in psychology studies and the study effects abstracted from them reflect a dynamic interaction among idiosyncratic participant, setting, and operational (measurement) procedural elements specific to the study in which those responses were acquired (see Lishner, 2015). If one or more procedural elements differ between studies, then it is possible the dynamic interaction among their procedural elements and the effect abstracted from each study also differ from each other. This is basically an extension of ideas presented in Lewin (1951/2004) applied to understanding participant responses in psychology studies (for other methodological contextualization perspectives, see also Broers, 2021; Kellen et al., 2021; Van Bavel et al., 2016).
Rife et al.’s (2025) multilab studies likely differ among each other in some procedural elements but likely differ from the original Trafimow and Hughes (2012) study even more. For instance, the sharp difference in effect size and responses between the Rife et al. studies and the original study conducted over a decade ago may indicate that more contemporary participants were less engaged in online studies and thus were less engaged in the imagination exercise used to manipulate mortality salience. Perhaps the occurrence of the COVID-19 pandemic during the course of the multilab studies diminished the impact of the manipulation. Rife et al.’s claim that one would expect mortality-salience manipulations to be more effective in the pandemic is debatable. Alternatively, it seems plausible that one would expect mortality-salience manipulations to be less effective during the pandemic when participants were likely exposed to more instances of human death and sickness, which forced more psychological management of death-thought accessibility.
Despite the challenges of variability in procedural elements between the multilab studies and the original Trafimow and Hughes (2012) study, so long as they are rigorously conducted, Rife et al. (2025) and other multilab replications provide a major benefit to the field. They offer clearer insight into generalization from the procedural elements that are common across the multilab studies than what is possible when trying to generalize across pairs of replication studies that are limited in number and may vary quite a bit more on participant, setting, and operational procedural elements. In multilab replication studies, operational elements (e.g., the mortality-salience manipulation, the word-generation measure) are typically invariant. Even participant and setting procedural elements that are not fixed across the multilab studies tend to be less variable and more controlled (more bounded) than what might exist between the multilab studies and the original replicated study (e.g., participant engagement in online studies, presence of a deadly pandemic). On one hand, it is important not to overgeneralize multilab results, for instance, by claiming or implying results approximate a “true effect” or the results raise questions about the integrity or empirical plausibility of the original replicated study. On the other hand, multilab replications permit more confident generalizations and inferences of theoretical associations, or lack thereof, across the range of construct values sampled by the multilab replication studies than what is typically possible with individual study replications. In other words, the theoretical generalizations and inferences generated from rigorously conducted multilab replications are of greater value for building and evaluating theory than for critiquing the veracity of the original replicated study results.
When Many Failed Replications Provide Evidence of a Theoretical Effect
Rife et al.’s (2025) results reveal a fascinating paradox. For the word-generation measure, data from only seven of 21 labs produced a predicted pattern that was statistically significant (this number drops to only one lab if an adjustment for unequal variances is applied). For the word-completion measure, data from only one lab (out of 22) produced a predicted pattern that was statistically significant. Yet when aggregated across all labs, the predicted pattern appears more robust. As individual replication studies, evidence for the Trafimow and Hughes (2012) theoretical claim is scant, as is evidence of any impact of delay on death-thought accessibility. As a collection of replication studies, evidence for the theoretical claim is more compelling in that results based on data aggregation across replication study and meta-analytic data analyses reveal higher death-thought accessibility following mortality salience without delay than with delay.
Rife et al.’s (2025) multilab results deliver an important message: Multilab replications can help researchers detect effects that would otherwise go undetected in individual replication studies. Instead of protecting against empirical and theoretical false positives established by previously published studies, as multilab replications are typically envisioned, they may be more beneficial for protecting against theoretical false negatives more broadly. As Fiedler et al. (2012) argued, false-negative empirical results always produce a theoretical false negative for a suitable theoretical claim. However, false-positive empirical results and any accompanying false-positive theoretical claims also always produce a theoretical false negative for a more suitable theoretical claim. The benefit of multisite replication studies is that they provide a large, robust sampling of empirical results for a specific methodological contextualization of fixed procedural elements, which reduces both empirical false-positive and empirical false-negative results. So long as theoretical interpretations account for methodological rigor and span of effect encompassed by the procedural elements, multilab replication studies should help avoid making false-negative theoretical claims. Indeed, one wonders if the field should more strongly emphasize using multilab replications as an initial approach to investigating new empirical and theoretical claims, which may go undetected in individual studies, rather than as just a secondary, follow-up approach to checking the veracity of empirical and theoretical claims detected in individual studies.
Caveat: Ensure Easy Comparison With the Original Study and Easy Access to Data
As noted in Rife et al.’s (2025) acknowledgments, we detected an error in the R code and corresponding results that generated a systematic error in the results of an earlier accepted-for-publication version of their article. Based on our experience and consideration of other published multilab replication reports, we suggest researchers planning to report multilab results in the future always provide readers descriptive statistics aggregated across all replication laboratories that permit direct comparison with descriptive statistics of the original replicated study (presented in our Table 1). This is essential because replication fundamentally involves consideration of empirical consistency/inconsistency as a starting point before drawing theoretical inferences and generalizations. Moreover, descriptive statistics of this sort allow for consideration of effect span and effect strength. Even better are additional reports of those descriptive statistics for each individual laboratory (perhaps as online supplemental materials), which permit insight into replication variability across the bounded procedural elements encompassed in the multilab endeavor. Finally, in the event researchers plan to share their multilab replication data publicly, we encourage them to do so with a downloadable working data set that requires minimal programming code for extraction. Easy-to-access data make it easier to detect data errors.
What Multilab Replications Tell the Field and What They Do Not
Rife et al. (2025) provided a useful example of what multilab replications can and cannot tell the field. Relatively large differences in methodological contextualization between multilab replications and the original replicated study suggest multilab replications may be weaker than commonly assumed for making inferences regarding the integrity or empirical plausibility of the original replicated study. For this reason, multilab replications may prove less effective as defense against published “false-positive” results. Instead, multilab replications may permit stronger inference about the span of construct values across which theoretical associations are likely meaningful for the specific methodological contextualization shared by the multilab replications. This knowledge offers (a) insight into what to expect when relying on a similar methodological contextualization to investigate other related phenomena of theoretical interest, (b) a firmer basis for building and evaluating theory, and (c) improved defense against theoretical false negatives.
Footnotes
Transparency
Action Editor: David A. Sbarra
Editor: David A. Sbarra
Author Contributions
