Sage Journals: Discover world-class research

Abstract

Team-science projects have become the “gold standard” for assessing the replicability and variability of key findings in psychological science. However, we believe the typical meta-analytic approach in these projects fails to match the wealth of collected data. Instead, we advocate the use of Bayesian hierarchical modeling for team-science projects, potentially extended in a multiverse analysis. We illustrate this full-scale analysis by applying it to the recently published Many Labs 4 project. This project aimed to replicate the mortality-salience effect—that being reminded of one’s own death strengthens the own cultural identity. In a multiverse analysis, we assess the robustness of the results with varying data-inclusion criteria and prior settings. Bayesian model comparison results largely converge to a common conclusion: The data provide evidence against a mortality-salience effect across the majority of our analyses. We issue general recommendations to facilitate full-scale analyses in team-science projects.

Keywords

Bayes factor Bayesian hierarchical modeling replication team science

A salient recent reform in psychological science is the trend toward “team science.” In crowd-sourced collaborative projects, many different sites across the globe jointly collect data to answer questions about replicability and variability of effects (Chartier et al., 2018; Forscher et al., 2023; Uhlmann et al., 2019). These team-science efforts have become the “gold standard” for assessing the robustness of key findings in the psychological literature. Noteworthy examples of such large-scale endeavours are The Reproducibility Project: Psychology (Open Science Collaboration, 2015), Many Labs (Ebersole et al., 2016; Klein et al., 2014, 2018, 2022), ManyBabies (Frank et al., 2017; The ManyBabies Consortium, 2020), the Pipeline Project (Schweinsberg et al., 2016), and the Psychological Science Accelerator (Chen et al., 2018; Jones et al., 2021; Moshontz et al., 2018). These crowd-sourcing data-collection efforts allow researchers to obtain larger samples, and hence increase statistical power, and reach traditionally less studied populations (i.e., non-Western participants; Henrich et al., 2010).

Given the wealth of data that are obtained in these collaborative projects, we believe it is important to fully make use of the available information in the statistical analysis. Unfortunately, the analytic strategies that are often taken in team-science projects may not do justice to the collected data. Although some projects, such as ManyBabies, have conducted sophisticated hierarchical analyses, most of the Many Labs projects and other large-scale team-science projects have used standard meta-analytic approaches. In these standard analyses, the data are summarized per lab or site, and a frequentist meta-analysis is conducted in which either a fixed- or random-effects structure is applied. We refer to this type of analysis with compressed data as a “minimal analysis.” We believe a minimal analysis constitutes a missed opportunity because it both limits analytic possibilities and compromises the informativeness of the data. For instance, in a meta-analysis, one cannot investigate participant-level predictors, and the data are reduced to mean effect size and its standard error per lab, thereby losing information about the primary data. A huge advantage of large-scale team-science projects is that participant-level and sometimes even trial-level data within a person are available, so we believe one should use their full potential.

In the following, we argue for what we call a “full-scale analysis” instead of the minimal analysis in team-science efforts. Specifically, we demonstrate the usefulness of Bayesian hierarchical modeling (also known as multilevel modeling; see also Rouder et al., 2019). We first highlight general advantages of the Bayesian modeling approach and then illustrate our method by applying it to the recently published Many Labs 4 project (Klein et al., 2022).

Many Labs 4 is a large-scale attempt to replicate the mortality-salience effect from terror-management theory (TMT; Greenberg et al., 1994): Reminders of one’s own death strengthen one’s cultural identity. In the classical demonstration of this effect, participants from the United States who were prompted to imagine their own death expressed more pro-American (i.e., in line with their worldview) beliefs than participants who were prompted to imagine watching TV. In addition to the question of replicability, Klein et al. (2022) wanted to assess the impact of involving the original authors in the study design. Therefore, some studies followed a standard protocol that was agreed on by experts in the field (“author-advised”), whereas other studies were designed by the labs conducting them (“in-house”). After data collection from more than 2,000 participants in 21 labs with and without involvement of the original authors, the project could not replicate the original finding of Study 1 of Greenberg et al. (1994) and reported an overall meta-analytic effect size of $g = 0.07,$ 95% $credible interval (CI) = [- 0.03, 0.17]$ . The authors concluded that they found “little evidence that priming mortality salience increased worldview defense compared to a control condition” and that “the present evidence does . . . provide an important challenge for TMT to address” (Klein et al., 2022, p. 10).

Bayesian Hierarchical Modeling

So what should such a full-scale analysis look like for Many Labs 4 or other team-science projects? In the following, we describe four features that we believe a full-scale analysis for team-science projects should include.

First, we believe a Bayesian analysis is preferred over a frequentist analysis because the former allows one to obtain evidence for the null hypothesis and to quantify (posterior) uncertainty (Wagenmakers et al., 2018). Especially in replication studies, the chances of obtaining null results are considerable. We opt for a Bayesian analysis using Bayes-factor model comparison (Jeffreys, 1939; Kass & Raftery, 1995). In short, Bayes factors quantify the relative evidence for a model (e.g., the alternative) over another model (e.g., the null). For an introduction to Bayes-factor model comparison, we refer the reader to Wagenmakers et al. (2018) and Rouder et al. (2018).

The main advantage of Bayesian statistics in light of large-scale replication efforts is that it allows a distinction to be made between evidence for the absence of the effect of interest and the absence of evidence for or against the effect (Keysers et al., 2020). In other words, failure to successfully replicate a key effect could mean that the data are undiagnostic for determining whether the effect is present, or it could mean that the data provide substantial evidence against the presence of the effect. Obviously, this difference is highly consequential for interpreting the results of a study.

The second feature of a full-scale analysis relates to the hierarchical nature of data in team-science projects. That is, instead of a meta-analysis, we advocate the use of a hierarchical model including all primary data with participants nested within labs (Hoogeveen, Haaf, et al., 2022; Rouder et al., 2019). In a hierarchical model, the lowest-level data are nested within their higher-level groups, such as trials nested within participants or participants nested within labs or countries. This structure makes it possible to assess general or overall effects and individual or lab-specific deviations from those overall effects. For a tutorial on Bayesian hierarchical modeling, we refer the reader to Veenman et al. (2022). Additional demonstrations of the Bayesian hierarchical modeling approach for team-science efforts can be found in Hoogeveen, Haaf, et al. (2022), Gervais et al. (2017), Tierney et al. (2021), and Tierney et al. (2022). The hierarchical approach for team-science efforts brings several benefits. First, by capitalizing on the full resolution of the data, no information is lost in the interim aggregation process. For instance, in a meta-analysis, a relatively large standard error for a given lab or site might reflect either a heterogeneous sample or simply a small sample. In a hierarchical model, the source of the (im)precision of the estimate is retained and thus can be interpreted. Second and relatedly, hierarchical shrinkage reduces the influence of outlying labs with small samples, hence automatically weighing the contribution of the different labs toward the global estimate (Efron & Morris, 1977). Third, although study-level predictors may be included in a meta-analysis, the hierarchical model additionally allows for the inclusion of participant-level predictors and/or the assessment of interaction effects. Finally, in the hierarchical approach, one can easily evaluate whether effects meaningfully differ per site/lab (e.g., in terms of WEIRDness [Western, educated, industrialized, rich, and democratic] or cross-cultural robustness; e.g., Hoogeveen, Haaf, et al., 2022).

The third feature of a full-scale analysis concerns the inclusion of theoretical constraint in the statistical analysis (Haaf et al., 2018; Haaf & Rouder, 2017; Rouder et al., 2019). Psychological theories typically constrain behavioral data in the sense that theories dictate ordinal predictions; observed effects are described in the form of “manipulation X causes higher scores on Y or slower responses” or “higher scores on X are associated with lower scores on Y or faster responses.” Given the ordinal nature of the hypotheses, we believe statistical tests should reflect the theoretical predictions about the direction of effects. For instance, we expect participants who imagined their own death to identify more with American culture than participants who imagined watching TV rather than just a difference between conditions.

The hierarchical nature of the data in team-science projects allows for more informative testing of ordinal predictions beyond directional constraint at the aggregate level. Specifically, rather than testing whether on average, participants who imagined their own death identify more with American culture than participants who imagined watching TV, one can also test whether this pattern holds across every lab that is included in the analysis. This latter constitutes a much riskier prediction because the effect now needs to be present in every single lab (e.g., 21 times instead of once). This risky prediction is potentially rewarded in terms of evidence when the data reflect the predicted pattern, boosting the effect’s credibility. Rouder et al. (2019) referred to this “Does every study?” question as a test of qualitative differences because it provides information on whether the effect of interest is qualitatively equal across studies (i.e., in the same direction).

Bayesian modeling methods are particularly well suited to test ordinal constraints at different levels, such as “Is the overall mortality-salience effect positive?” or “Does every lab show a positive mortality-salience effect?” We would therefore advocate to include both versions of these theoretically motivated ordinal constraints in the statistical analysis for team-science projects (for application of these ordinal constraints in meta-analysis and individual cognitive performance, see also Rouder et al., 2019, and Haaf & Rouder, 2017, respectively).

Finally, the fourth feature of a full-scale analysis relates to assessing the robustness of the findings. That is, beyond using Bayesian hierarchical modeling, we believe team-science projects can at least sometimes benefit from conducting a multiverse analysis (Steegen et al., 2016). In a multiverse analysis, the researcher can evaluate different potential constellations of the data (e.g., exclusions, theoretically relevant subgroups), priors, and predictors without committing to one—perhaps arbitrarily—chosen analysis path. As is demonstrated by the Many Labs 4 example below, there are often multiple defensible analytic choices that can be considered. A complete assessment of the robustness of a given effect might thus require many labs and many analyses (Wagenmakers et al., 2022). The multiverse approach not only presents a broader and more complete picture of the results, but it also allows one to explore the consequences of analytic choices. For instance, does including only an ideal subgroup of participants indeed increase the evidence for the presence of the effect? Do exclusions based on manipulation checks affect the evidence? Does the particular operationalization of a construct make a difference? Furthermore, the Bayes-factor model comparison approach allows for a straightforward interpretation of the multiverse results. Although we do not integrate the evidence from the different multiverse paths directly, the nature of Bayes factors as ratios or odds makes them easily comparable across paths. For instance, given equal prior odds and Bayes factors in favor of the effect of interest ranging between, say, 50 to 1 and 200 to 1 across all paths, one can be fairly confident in the presence of the effect. In other words, with those posterior odds (and unit prior odds), one would probably be comfortable betting on the effect irrespective of the chosen analysis path. Moreover, Bayes factors automatically take into account the sample size and reflect the informativeness of the data.

Many Labs 4 Reanalysis

Given the outlined advantages of a full-scale analysis, in the following, we present a Bayesian multiverse reanalysis of the Many Labs 4 data using hierarchical models. Note that we also conducted a Bayesian model-averaged meta-analysis (Gronau et al., 2021), which is reported in the Appendix. The results of the model-averaged meta-analysis are qualitatively comparable with those of the hierarchical modeling reported below.

A brief history

In December 2019, the Many Labs 4 authors posted a preprint of the project on PsyArXiv (Klein et al., 2019). Soon after, a critique of the analysis emerged in which Chatard et al. (2020) pointed out that Klein and colleagues (2019) had not followed their own preregistered analysis. Chatard et al. argued that the preregistration specified a minimum of 40 participants per experimental cell as the threshold for sufficient power of any individual study and therefore determined a total of 80 participants as target sample size for each lab. When reanalyzing the data from the Many Labs 4 project including only studies with 40 participants per condition, Chatard et al. found a significant effect in line with the original results. Intrigued by these divergent reports, we then decided to conduct a Bayesian multiverse analysis. A preprint of this analysis was published in 2020 on PsyArXiv (Chatard et al., 2020). Then in 2022, Klein and colleagues published their final results in Collabra, after which we also revisited the data, resulting in the current article.

Include or exclude?

Which of the different proposed analyses—Klein et al. (2022) or Chatard et al. (2020)—is the correct one? Given the theoretical arguments and (interpretations of) the preregistered plan, there may be several valid answers to this question and several levels of exclusion criteria that ought to be considered to subset the full sample of 2,281 participants across 21 labs.

The full set of exclusion criteria employed by either Klein et al. (2022), Chatard et al. (2020), and ourselves consists of five layers of exclusion settings with each three or two specific choices, resulting in $3 \times 3 \times 2 \times 2 \times 2 = 72$ constellations of exclusion criteria. These five types of exclusions settings relate to participant-level criteria, N-based criteria, protocol criteria, timing-based criteria, and interpretation of participant-level criteria and are shortly discussed below. In the Appendix, we report the full set of these criteria and the rationale for choosing either of those. In short, the “participant-level criteria” refers to theory-based arguments about among whom the mortality-salience effect should occur: According to Greenberg et al. (1994), the effect may be present only among participants who self-identify as White and report to be born in the United States (participant-level Exclusion Criterion 2) or even only participants who are White, American-born, and strongly identify with American culture (a score of 7 or higher on a 9-point Likert scale; participant-level Exclusion Criterion 3). “N-based criteria” refers to the inclusion of labs based on the number of participants recruited per lab. “Protocol criteria” refers to the inclusion of both in-house labs and author-advised labs or only the latter, based on the suggestion that the effect may emerge only in author-advised studies because the mortality-salience effect is highly sensitive to nuances in the study implementation (Greenberg et al., 1994). “Timing-based criteria” addresses Klein et al.’s decision to discard all observations collected by some in-house labs before the preregistration date (February 15, 2017), resulting in the exclusion of 566 participants (25.4%). Although we considered this exclusion wasteful and unnecessary, we added it as another layer in the multiverse analysis for the sake of completeness. The final layer of exclusion settings refers to the way in which the participant-level criteria are applied. That is, Klein et al. and Chatard et al. applied the participant-level exclusion criteria only to the author-advised protocols, which means that for Exclusion Criteria 2 and 3, all participants from the in-house labs were retained. However, because Exclusion Criteria 2 and 3 were specified by Greenberg et al. (1994) as a strict and genuine test of the theory, we believe that it is important to thoroughly apply these criteria to all participants even if this means discarding participants for which this information is unavailable.

Note that some of the criteria are completely overlapping (e.g., only author-advised labs recorded American identity, hence all in-house labs are excluded for the third participant-level exclusion set). As a result, there are 45 instead of 72 unique constellations. Table 1 shows all 45 unique constellations, the resulting number of studies, and total number of included participants (see the Appendix for a table with all 72 constellations).

Table 1.

Exclusion Constellations and Resulting Sample Sizes

Participant-level	N-based	Protocol	Timing-based	Apply P-based	Sample Size	Labs
All	All	All	All	AA only	2,225	21
White & US-born	All	All	All	AA only	1,880	21
US-Identity > 7	All	All	All	AA only	1,699	21
All	N > 60	All	All	AA only	2,067	17
White & US-born	N > 60	All	All	AA only	1,746	17
US-Identity > 7	N > 60	All	All	AA only	1,593	17
All	N > 80	All	All	AA only	1,866	14
White & US-born	N > 80	All	All	AA only	1,545	14
US-Identity > 7	N > 80	All	All	AA only	1,392	14
All	All	AA	All	AA only	798	9
White & US-born	All	AA	All	AA only	453	9
All	N > 80	AA	All	AA only	699	7
White & US-born	N > 80	AA	All	AA only	378	7
US-Identity > 7	N > 80	AA	All	AA only	225	7
All	All	All	After prereg	AA only	1,659	20
White & US-born	All	All	After prereg	AA only	1,314	20
US-Identity > 7	All	All	After prereg	AA only	1,133	20
All	N > 60	All	After prereg	AA only	1,544	17
White & US-born	N > 60	All	After prereg	AA only	1,223	17
US-Identity > 7	N > 60	All	After prereg	AA only	1,070	17
All	N > 80	All	After prereg	AA only	1,343	14
White & US-born	N > 80	All	After prereg	AA only	1,022	14
US-Identity > 7	N > 80	All	After prereg	AA only	869	14
All	All	AA	After prereg	AA only	797	9
White & US-born	All	AA	After prereg	AA only	452	9
US-Identity > 7	All	AA	After prereg	AA only	271	9
All	N > 60	AA	After prereg	AA only	698	7
White & US-born	N > 60	AA	After prereg	AA only	377	7
US-Identity > 7	N > 60	AA	After prereg	AA only	224	7
All	All	All	All	AA and IH	2,211	21
White & US-born	All	All	All	AA and IH	983	16
US-Identity > 7	All	All	All	AA and IH	272	9
All	N > 60	All	All	AA and IH	2,053	17
White & US-born	N > 60	All	All	AA and IH	897	13
All	N > 80	All	All	AA and IH	1,852	14
White & US-born	N > 80	All	All	AA and IH	864	12
All	All	AA	All	AA and IH	799	9
All	N > 60	AA	All	AA and IH	700	7
All	All	All	After prereg	AA and IH	1,650	20
White & US-born	All	All	After prereg	AA and IH	777	15
All	N > 60	All	After prereg	AA and IH	1,535	17
White & US-born	N > 60	All	After prereg	AA and IH	702	13
All	N > 80	All	After prereg	AA and IH	1,334	14
White & US-born	N > 80	All	After prereg	AA and IH	669	12

Note: Orange rows refer to Klein et al.’s key analyses; green rows refer to Chatard et al.’s key analyses; purple rows refer to our currently chosen analyses; AA = author-advised; IH = in-house. “Apply P-based” indicates whether the participant-level exclusion criteria are applied to the author-advised labs only (retaining all in-house participants) or to both author-advised and in-house labs (missing data excluded).

In the following, we report a reanalysis for the three exclusion constellations of the key analyses from Klein et al. (2022, orange rows in Table 1), the three exclusion constellations from Chatard et al. (2020, green rows), and our own choice of exclusion criteria (purple rows). Subsequently, lacking compelling argumentation for or against any of the criteria, we decided to conduct an analysis using the entire set of 45 unique constellations as a multiverse analysis (Steegen et al., 2016).

Disclosures

Preregistration

Our analyses, including prior settings, were preregistered on OSF (osf.io/ae4wx; see also Appendix D). However, we decided to deviate from the preregistration by including more constellations of exclusion criteria. Specifically, we originally planned to use only participant-level Exclusion Criterion 1 and later decided to include all of them. Moreover, two additional exclusion layers became apparent only after the final version of the Many Labs 4 report was published, specifically, those related to the timing-based exclusion criteria and the application of the participant-level criteria to the author-advised only or author-advised and in-house labs. We believe that including these additional paths in the multiverse analysis helps to provide a more complete analysis. We also note that the preregistration includes both the hierarchical analysis and the model-averaged meta-analysis. The latter is reported in Appendix B.

Data and materials

Readers can access the data and the R code to conduct all analyses (including all figures) at github.com/SuzanneHoogeveen/ml4-reanalysis.

Reporting

This study involved an analysis of existing data rather than new data collection.

Ethical approval

No ethical approval was required for this work because we did not collect any data.

Method

For Bayesian hierarchical modeling, we take advantage of the open availability of all collected data from the Many Labs 4 project. The dependent variable is the same across all studies (i.e., identification with American culture, operationalized through relative preference for American vs. non-American authors), and participants are nested in studies, resulting in a hierarchical data structure. We employed a modeling approach similar to the one developed for the embodied-cognition reanalysis by Rouder et al. (2019). That is, we used Bayes-factor model comparison with hierarchical models reflecting different structures of the data, varying in the extent to which they constrain their predictions. We believe this approach satisfies the analytic desiderata for team-science projects outlined before, that is, appropriately accounting for the nested structure of the data without compromising on informativeness, directly testing both the presence of an overall mortality-salience effect and the presence of between-studies heterogeneity, and reflecting theoretical constraints on the direction of the effect.

Concretely, there are four models under consideration. The null model (Model 1) corresponds to the notion that none of the studies show an effect; this model assumes no overall experimental effect or heterogeneity between studies. The common-effect model (Model 2) corresponds to the notion that all studies show the same effect in the expected direction; this model assumes no heterogeneity between studies. The positive-effects model (Model 3) corresponds to the notion that all studies show an effect in the expected direction yet to varying degrees. The unconstrained model (Model 4) refers to the notion that the overall effect and study effects may vary freely (in direction and size). We compute Bayes factors for Models 2, 3, and 4 against Model 1, the null model. Evidence for Model 1 would indicate the absence of a mortality-salience effect across all labs; evidence for Model 2 would indicate that on average, people who contemplate their own death identify more strongly with their culture than people who contemplate watching TV to a similar degree across labs; evidence for Model 3 would indicate that in all of the labs, people who contemplate their own death identify more strongly with their culture than people who contemplate watching TV but to varying degrees across labs; and evidence for Model 4 would indicate that in some labs, people who contemplate their own death identify more strongly with their culture than people who contemplate watching TV, whereas in other labs, people who contemplate watching TV identify more strongly with their culture than people who contemplate their own death.

The Bayesian hierarchical modeling is conducted using the R package BayesFactor (Morey & Rouder, 2018). See Box 1 for a mathematical specification of the model.

Box 1

Hierarchical Model Specifications

The base model for the mortality-salience effect is a mixed linear model. Let $Y_{i j k}$ be the rating for the $i th$ lab, the $j th$ participant, and the $k th$ condition. Then

Y_{i j k} \sim N (α_{i} + x_{k} θ_{i}, σ^{2}),

where $α_{i}$ is the $i th$ lab’s specific overall culture identification rating effect and $θ_{i}$ is the $i th$ lab’s mortality-salience effect. The variable $x_{k} = - 0.5, 0.5$ if $k = 1, 2$ , respectively, with $k = 1$ when condition is “watching TV” and $k = 2$ when condition is “contemplate death.” Here, $θ_{i}$ is the parameter of interest that is varied across models to reflect the different constraints. Specifically, for the null model, we specify $θ_{i} = 0$ . For the common-effect model, $θ_{i} = ν$ , where ν represents the true value for the mortality-salience effect that is constrained to be positive $(ν > 0)$ . For the positive-effects model, $θ_{i}$ comes from a distribution with a mean mortality-salience effect $(μ_{θ})$ and between-studies variability in the size of this effect $(σ_{θ}^{2})$ : $θ_{i} \sim N_{+} (μ_{θ}, σ_{θ}^{2})$ , where the $N_{+}$ represents a normal distribution truncated at below zero to reflect the prediction that $θ_{i} > 0$ . Finally, for the unconstrained model, we let $θ_{i}$ free to vary in size and direction: $θ_{i} \sim N (μ_{θ}, σ_{θ}^{2})$ .

There are two critical prior settings to consider: the scale setting on the overall effect $(μ_{θ})$ and the scale setting on the between-labs heterogeneity $(σ_{θ}^{2})$ . These scales can be roughly interpreted as standardized effect size such as Cohen’s d. The scale on the overall effect corresponds to the expected size of the overall effect. As Rouder et al. (2019) did, we set this scale to 0.4 because we expect a small to medium effect size. The scale of the between-labs variance corresponds to the expected amount of variability in effect size across studies. Again, we kept the value of 0.24 as proposed by Rouder et al. (i.e., 60% of the expected overall effect). See the Appendix for a visualization of the prior distributions for $μ_{θ}$ and $σ_{θ}^{2}$ and the implied priors for two different labs under these settings.

Results

In the following, we first reanalyze the data from the key findings reported by Klein et al. (2022) using our proposed full-scale analysis and then those from Chatard et al. (2020). Finally, we report the analysis of the data based on our own choice of exclusion-criteria constellations.

Bayesian reanalysis of Klein et al.’s (2022) key findings

Figure 1a shows the observed, unstandardized effects and the estimates from the unconstrained hierarchical model for the first participant-level exclusion criterion. This is the main analysis that is the basis for the key claims of the Many Labs 4 project, as reported in the published article (Klein et al., 2022). Klein et al. (2022) included participants whose data were collected after the lead team posted its preregistration and only studies that featured more than 60 observations (before participant-level exclusions). The participant-level exclusion criteria were applied only to author-advised studies, whereas all participants from the in-house studies were retained.

Fig. 1.

Forest plot with Bayesian parameter estimates for the key analyses by Klein et al. (2022) for the three participant-level exclusion sets (applied to author-advised protocol participants only) with data collected after the lead team posted their preregistration and only from studies that featured more than 60 observations. (a) Participant-level Exclusion Set 1. The light orange squares represent unstandardized observed effects for each study with 95% confidence intervals. The dark orange points represent estimated unstandardized effects from the unconstrained model with 95% credible intervals. (b) Participant-level Exclusion Set 2. (c) Participant-level Exclusion Set 3. The estimates are sorted by the size of the observed effects for participant-level Exclusion Set 1 (Fig. 1a).

As is shown in Figure 1, there is considerable hierarchical shrinkage reducing the variability of estimated effects compared with observed effects. Effect-size estimates from the unconstrained model (similar to Cohen’s d) are 0.02, 95% $CI = [- 0.12, 0.16]$ , for participant-level Exclusion Criterion 1; 0.04, 95% $CI = [- 0.11, 0.19]$ , for Exclusion Criterion 2; and 0.05, 95% $CI = [- 0.22, 0.32]$ , for Exclusion Criterion 3. Note that posterior means are close to zero and that all credible intervals cover zero. The estimates are therefore consistent with the absence of an overall effect.

Bayes factors are shown in the first three rows of Table 2. “ ${BF}_{0 f}$ ” refers to the Bayes factor between the null model and the unconstrained model; “ ${BF}_{01}$ ” refers to the Bayes factor between the null model and the common-effect model, in which the overall effect is positive and there is no variability between study effects; and “ ${BF}_{0 +}$ ” refers to the Bayes factor between the null model and the positive-effects model, in which study effects may vary but all are consistently positive. All Bayes factors are in comparison with the preferred model, the null model, indicating evidence that none of the studies show an effect. The second best model is the common-effect model, in which all studies have the same positive effect and the Bayes factor between the null model and the common-effect model is between 5.35 to 1 to 4.21 to 1 in favor of the null model depending on the different participant-level exclusion criteria. If we allowed for variability across effects but maintained that the effect should be present across all studies, we would obtain strong evidence against this hypothesis, with Bayes factors ranging between 629 to 1 and 158 to 1 in favor of the null model over the positive-effects model. In sum, this pattern indicates evidence against an overall mortality-salience effect (null model), and even if there was an effect (common-effect model), there is no evidence for variability of study effects. These results are consistent across the three data sets, and they are in line with the estimation results shown in Figure 1.

Table 2.

Bayes Factors for Key Analyses

			Evidence
Participant-level	N	Labs	${BF}_{0 f}$	${BF}_{01}$	${BF}_{0 +}$	Effect [95% CI]
Klein et al. (2022)
All	1,544	17	12.44	5.35	628.62	0.02 [−0.12, 0.16]
White & US-born	1,223	17	9.35	4.21	204.55	0.04 [−0.11, 0.19]
US-Identity > 7	1,070	17	6.95	4.50	157.52	0.05 [−0.11, 0.21]
Chatard et al. (2020)
All	699	7	14.61	2.16	13.32	0.08 [−0.12, 0.28]
White & US-born	378	7	6.76	0.95	2.61	0.14 [−0.11, 0.39]
US-Identity > 7	225	7	4.47	0.79	1.42	0.18 [−0.12, 0.49]
Current choice
All	2,211	21	35.21	10.33	10,490.12	0.01 [−0.11, 0.12]
White & US-born	983	16	21.46	13.99	2,538.61	−0.04 [−0.20, 0.12]
US-Identity > 7	272	9	8.44	2.77	11.88	0.05 [−0.22, 0.32]

Note: All Bayes factors are reported in favor of the null model.

In summary, the null results are consistent across participant-level exclusion criteria. Even though the evidence against an effect is more pronounced when all participants are included in the analysis, this pattern is easily explained by the resolution of the analysis with increasing numbers of observations: The smaller the number of observations, the less evidence there is in any direction and the wider the estimated posterior distribution of the overall effect.

Bayesian reanalysis of Chatard et al.’s (2020) key findings

We also reanalyzed Chatard et al.’s (2020) findings with a hierarchical-modeling approach. Figure 2 shows study estimates from the unconstrained model for the unstandardized effects. All confidence intervals and credible intervals cover zero.

Fig. 2.

Forest plot with Bayesian parameter estimates for the key analyses by Chatard et al. (2020) for the three participant-level exclusion sets, only studies that featured more than 80 observations, and author-advised labs only. (a) Participant-level Exclusion Set 1. The light green squares represent unstandardized observed effects for each study with 95% confidence intervals. The dark green points represent estimated unstandardized effects from the unconstrained model with 95% credible intervals. (b) Participant-level Exclusion Set 2. (c) Participant-level Exclusion Set 3. The estimates are sorted by the size of the observed effects for participant-level Exclusion Set 1 (i.e., Fig. 2a).

Effect-size estimates from the unconstrained model (similar to Cohen’s d) are 0.08, 95% $CI = [- 0.12, 0.28]$ , for participant-level Exclusion Criterion 1; 0.14, 95% $CI = [- 0.11, 0.39]$ , for participant-level Exclusion Criterion 2; and 0.18, 95% $CI = [- 0.12, 0.49]$ , for participant-level Exclusion Criterion 3. Note that all credible intervals include zero, and even though the posterior mean increases with more conservative exclusion criteria, the width of the credible interval increases as well, implying increasing uncertainty about the effect size. The posterior distribution of the overall effect size is therefore again consistent with the absence of an overall effect.

The pattern of Bayes factors is somewhat less consistent across exclusions than the estimation results. Bayes factors are shown in the middle three rows of Table 2. The pattern of Bayes factors depends on the participant-level exclusion criterion. Under participant-level Exclusion Criterion 1, the preferred model is the null model, and it is weakly preferred over the second-best model, the common-effect model, by a Bayes factor of ${BF}_{01} = 2.16$ . For the other two exclusion criteria, the common-effect model is preferred over the null model, but the Bayes factors are even weaker (1.06 and 1.29 in favor of the common-effect model over the null model). In sum, the Bayes-factors results are in line with the absence of any (consistent) evidence for or against an effect. Across the three participant-level exclusion criteria, there is only weak and inconsistent evidence for or against an overall mortality-salience effect. Here, we advise readers not to overly interpret whether the Bayes factor is 1.5 to 1 for or against the overall effect—none of these Bayes factors are convincing. Instead, all of the analyses in this section point to the conclusion that more data are needed. The exclusion criteria applied here thinned out the data so much—in the final analytic data set, only 10% of the initial data is retained—that no firm conclusion is possible anymore.

Bayesian analysis of our current choice

We also included an analysis of the Many Labs 4 data using our own choice of the exclusion criteria. Following Klein et al. (2022) and Chatard et al. (2020), we looked at all three participant-level exclusion criteria while settling on one particular choice for the other factors that seemed most sensible to us. The goal for this choice was to include the maximum number of participants but still adhere to the recommendations by the original authors ( Greenberg et al., 1994) to give the effect the best chance. Specifically, we included all complete data from all labs and protocols and applied the participant-level exclusions to both author-advised labs and in-house labs, discarding missing values. For Exclusion Criterion 1—completeness of the measures—we did retain participants for labs in which no explicit information on missingness was available as long as they were assigned to an experimental condition and answered both items of the dependent variable.¹ Note that our choice of analysis paths leads to quite variable numbers of participants (between $N = 2, 211$ and $N = 272$ ).

Figure 3 shows the study estimates from the unconstrained model for the unstandardized effects. Again, all confidence and credible intervals include zero. Effect size estimates from the unconstrained model are $0.01$ , 95% $CI = [- 0.11, 0.12]$ , for participant-level Exclusion Criterion 1; $- 0.04$ , 95% $CI = [- 0.20, 0.12]$ , for participant-level Exclusion Criterion 2; and $0.05$ , 95% $CI = [- 0.22, 0.32]$ , for participant-level Exclusion Criterion 3. Again, all credible intervals overlap with zero, and the width of the credible interval increases with fewer observations included in the analysis because less data imply more uncertainty. Note that the absence of estimates for certain labs in Figures 3b and 3c is due to the participant-level exclusion criteria leaving no participants in these particular labs (rather than excluding any labs per se).

Fig. 3.

Forest plot with Bayesian parameter estimates for the key analyses of our choice for the three participant-level exclusion sets (applied to both author-advised and in-house protocol participants), including all participants and all labs. (a) Participant-level Exclusion Set 1. The light purple squares represent unstandardized observed effects for each study with 95% confidence intervals. The dark purple points represent estimated unstandardized effects from the unconstrained model with 95% credible intervals. (b) Participant-level Exclusion Set 2. (c) Participant-level Exclusion Set 3. The estimates are sorted by the size of the observed effects for participant-level Exclusion Set 1 (i.e., Fig. 3a).

The Bayes factors paint a similar picture; the evidence against the presence of the mortality-salience effect is stronger with a larger sample size. For the most inclusive sample, with Exclusion Criterion 1 $(N = 2, 211)$ , the null model outperforms the common-effect model by a factor of $10.33$ . For Exclusion Criterion 2, the estimated effect goes slightly in the direction opposite to the hypothesis, hence the Bayes factor more strongly favors the null model over the common-effect model, ${BF}_{01} = 13.99$ . Finally, for the strictest exclusion criterion, only $272$ observations are retained. As a result, we get a much weaker Bayes factor of $2.77$ in favor of the null model over the common-effect model. In sum, with our chosen set of exclusion criteria, we obtained strong to weak evidence against the mortality-salience effect.

Bayesian multiverse analysis across all exclusion criteria

To assess the robustness of the previously reported results, we conducted a multiverse analysis using the 45 unique data sets from Table 1. We used the same hierarchical model construction as reported above and report here the Bayes factors for the presence of an effect against its absence. The Bayes factors are plotted in Figure 4 (y-axis). Bayes factors in favor of the mortality-salience effect are above the horizontal line, and Bayes factors against the mortality salience effect are below the horizontal line. The BF_effect0 is the weighted average of the evidence for the common-effect versus the null model and the unconstrained model (varying effect) versus the null model. The x-axis refers to the evidence for between-studies heterogeneity in the data. The BF_{heterogeneity0} is calculated by taking the evidence for the unconstrained model versus the common-effect model (i.e., ${BF}_{f 1}$ ). The size of the points reflects the number of participants whose data are included in the analysis, and the color of the points highlights the key analyses described before.

Fig. 4.

Results from the Bayesian multiverse analysis. Bayes factors in favor of a mortality-salience effect are above the horizontal line; Bayes factors against the mortality-salience effect are below the horizontal line. All analyses provide evidence against between-studies heterogeneity as shown by all heterogeneity Bayes factors being smaller than 1 on the x-axis. The color of the points refers to the different key analyses sets, and the size of the points refers to the number of participants the analysis is based on. All but two of analyses provide evidence against the mortality-salience effect. BF_effect0 reflects the model-averaged evidence for common effect and the varying effect versus the null, and BF_{heterogeneity0} reflects the evidence for the varying effect versus the common effect (i.e., ${BF}_{f 1}$ ).

The majority of Bayes factors are in line with the absence of the mortality-salience effect. Because the Bayes factor depends on the sample size, more evidence against morality salience comes from analyses that are based on more data (i.e., larger number of included participants and studies). Only two constellations of exclusion criteria provide evidence for the mortality-salience effect. In addition, none of the analysis paths provide evidence for heterogeneity (all BF_{heterogeneity0} < 1).

In sum, the evidence against the mortality-salience effect appears relatively robust against choices of exclusion criteria. When conducting a large number of analyses on the same data, some of these analyses will almost inevitably lead to evidence in the opposite direction than the overall results. This is especially the case when the data provide relatively weak evidence (Bayes factors less than 5 to 1 against an effect). Bayes factors close to 1 signal a lack of resolution of the data and therefore the absence of evidence for or against an effect. When the number of participants is high and many studies are included, there is convincing evidence against the mortality-salience effect. The two Bayes factors that are weakly in favor of the mortality-salience effect are based on less than half of the original data and prefer the presence of the effect only by factors of 1.5 and 1.7.

Prior sensitivity

In addition to assessing the effects of various data-exclusion decisions, one might also investigate the role of prior choices on inference. Specifically, we looked at the dependence of the Bayes factors on the prior settings for the overall effect and for the between-studies variability. Although some researchers have argued that the influence of the prior on the results should be minimized (e.g., by using uninformative default settings; Aitkin, 1991; Gelman et al., 2004; Kruschke, 2013), we believe the influence of the prior is a meaningful and inherently informative element of Bayesian inference (Rouder et al., 2018; Vanpaemel, 2010). Nevertheless, the extent to which reasonable prior choices affect the results clearly speaks to the robustness of the conclusions.

For the main analysis, we used a scale of 0.4 on the overall effect $(μ_{θ})$ and 0.24 on the between-studies variability $(σ_{θ}^{2})$ , indicating an expected effect size of Cohen’s d around 0.4 and 60% of that effect size for the variability between labs (see top rows for each analysis set in Table 3). To examine the Bayes factors under different prior settings, we systematically both doubled and halved the scales on $μ_{θ}$ and $σ_{θ}^{2}$ , reflecting expectations of a small effect, a medium to large effect, very little between-studies variability, and medium between-studies variability.

Table 3.

Bayes Factors for Key Analyses (Participant-Level Exclusion Set 1) Under Different Prior Settings

Scale on $μ_{θ}$	Scale on $σ_{θ}^{2}$	${BF}_{01}$	${BF}_{0 f}$	${BF}_{0 +}$
Klein et al. (2022)
0.40	0.24	5.41	12.50	791.66
0.20	0.12	2.86	3.92	24.79
0.20	0.48	2.85	27.14	90,214.79
0.80	0.12	10.50	14.78	18.79
0.80	0.48	10.37	101.66	128,792.49
Chatard et al. (2020)
0.40	0.24	2.17	14.97	13.93
0.20	0.12	1.32	4.41	2.27
0.20	0.48	1.30	35.54	180.27
0.80	0.12	4.04	14.00	2.08
0.80	0.48	4.01	109.14	206.85
Current choice
0.40	0.24	10.28	35.75	12,127.13
0.20	0.12	5.28	8.70	166.42
0.20	0.48	5.27	119.11	∞
0.80	0.12	20.44	32.41	166.36
0.80	0.48	20.31	437.36	∞

Note: All Bayes factors are reported in favor of the null model.

Table 3 shows the Bayes factors resulting from crossing these combinations for the key analyses (for participant-level Exclusion Set 1), and Figure 5 shows the evidence across all 45 unique data-exclusion paths for each of the four different prior setting combinations. Most support for the effect is obtained under the expectation of a small effect, and most support for between-studies heterogeneity is obtained under the expectation of little between-studies variability. Nevertheless, across all settings, evidence somewhat in favor of a mortality-salience effect occurs only in 10 of 180 (5.6%) paths. Evidence in favor of heterogeneity across studies is obtained under three of 180 (1.7%) paths. Another observation from these plots is that although the prior setting for the overall effect changes the global strength of the evidence, it does not appear to affect the multiverse paths differentially, given that the dots seem to move upward or downward uniformly. In contrast, the prior setting for the between-studies variability not only affects the overall evidence for heterogeneity, but it also influences the range of the Bayes factors between multiverse paths. Specifically, the prior expectation of little variability reduces the evidence against heterogeneity and makes the Bayes factors more similar across paths, whereas the expectation of much variability not only leads to more evidence against heterogeneity but also enhances the differences between paths. In sum, choices of prior scales can slightly boost or reduce the evidence in favor of the effect. Yet in this case, the effects of reasonable prior choices are rather contained; the null model is still consistently preferred over models with a mortality-salience effect.

Fig. 5.

Results from the Bayesian multiverse analysis under different prior settings for the overall effect and the between-studies variance in the effect. The arrows show the overall trend relative to the main analysis with the primary prior settings.

Conclusion

We conducted a Bayesian reanalysis of the Many Labs 4 project with varying exclusion criteria and prior settings. In a Bayesian multiverse analysis using hierarchical models, we calculated a total of 45 sets of Bayes factors based on different combinations of five layers of data-exclusion criteria derived from the Many Labs 4 preregistration, the comment by Chatard et al. (2020), the published article by Klein et al. (2022), and our own judgments. Forty-three out of 45 Bayes factors provide evidence against an overall mortality-salience effect, ranging between 1.32 to 1 and 16.94 to 1 in favor of the absence of an effect. The remaining two Bayes factors provide only weak evidence for the presence of such an effect (i.e., Bayes factors of 1.45 to 1 and 1.68 to 1). In addition, we find some evidence against heterogeneity of effects across studies. Finally, the pattern of results remains qualitatively equal under different reasonable prior settings for the overall effect and the between-studies variability. In combination, we would argue we conducted a full-scale analysis of the data provided by the Many Labs 4 project, an inspection from various different angles. Even if we do not believe the evidence from this full-scale analysis and assume there is an effect, this effect is so small (between $d = - 0.04$ and $d = 0.18$ ) that it renders the entire field of mortality-salience studies as uninformative: Most of the studies conducted in the past would have been vastly underpowered, would require a very specific subgroup of participants, and would therefore also not be generalizable across a broader range of the population.

Our analyses revealed that the evidence is relatively consistent across different exclusion criteria. For the current analysis, we assumed that all exclusion criteria were equally plausible. With this assumption, we implicitly assigned an equal weight to all analyses. However, we admit that this may not be the case. Chatard et al. (2020) argued that their chosen criteria are superior when considering theoretical arguments and study planning. With their analysis, they implicitly introduced a weighing in which all other exclusion options received a weight of zero. Readers can choose these weights themselves when they consider how to interpret the results reported here.²

There are additional issues with selectively subsetting and reanalyzing data sets. A key danger is that for some subsets, one always finds results opposite of the conclusions from the analysis of the full data set. On the study level, researchers should therefore first ensure that there is evidence for variability of studies that warrants such subsetting (Harrer et al., 2021, Chapter 5). In the current analysis, we found evidence against study heterogeneity. When interpreting the results, we therefore recommend to rely mainly on the estimates from the full data set. In addition, subsetting the data inevitably reduces the resolution to detect an effect. The critics of the Many Labs 4 project (Chatard et al., 2020) based their main conclusions on analyses with smaller sample sizes. Ironically, although Chatard et al. (2020) argued that sample size should be considered when including studies, their exclusion criteria actually reduced the power of the meta-analysis overall. To tackle this issue—and if there was evidence for study heterogeneity—one could include some of the subsetting criteria as dummy-coded predictors in the hierarchical model instead of disregarding the data altogether (e.g., author-advised vs. in-house).

Furthermore, we believe the Many Labs 4 case and its development from preprint to published article highlights an important potential drawback of preregistration. In the final article, the Many Labs 4 lead team decided to discard all observations collected before the preregistration date, resulting in the removal of more than a quarter of the data. As mentioned, we consider this removal of data wasteful and unnecessary. In this case, we believe the fact that data collection was crowdsourced and the added value of retaining 556 perfectly valid observations justify “breaking the strict rules of preregistation” that data collection should start only after the analysis plan has been preregistered. As noted by DeHaven (2017), preregistration is “a plan, not a prison.” So rather than discarding a large portion of the data for the main analyses, we believe a transparent statement on the timing issue would have sufficed in this case. In general, preregistration by definition should not trump common sense and researchers’ judgment.

In summary, the multiverse analysis conducted here shows a certain convergence of results. Even though the degree of evidence varies, models with no effect of mortality salience are mostly preferred over models with an effect of mortality salience. This result highlights the robustness against choices of exclusion criteria. The Bayesian multiverse approach using hierarchical models provides rich results that go much beyond the original analyses by the Many Labs 4 team. In particular, we believe the current approach satisfies the desiderata of a full-scale analysis in team-science projects: (a) providing evidence on a continuous scale from evidence against the crucial effect through inconclusive evidence to evidence in favor of the crucial effect, (b) applying hierarchical modeling to appropriately account for the nested structure of the data, (c) evaluating both the evidence for the experimental effect and the evidence for between-labs heterogeneity, (d) reflecting theoretical constraints on the effect of interest (i.e., ordinal constraints), and (e) evaluating the robustness of the findings by exploring a multitude of relevant analysis paths.

Both Bayes-factor model comparison and Bayesian hierarchical modeling are gaining popularity in psychological science. Recent tutorial articles make these approaches more accessible; for instance, see Wagenmakers et al. (2018) and Rouder et al. (2018) for an introduction to Bayes-factor model comparison and Veenman et al. (2022) and Rouder and Province (2019) for tutorials on Bayesian hierarchical modeling. Finally, the ease and informativeness of Bayesian multiverse analyses show that this approach should be more generally used to analyze team-science projects. The current analyses were conducted in R, and the code is provided at github.com/SuzanneHoogeveen/ml4-reanalysis.

General Recommendations

In sum, we believe the amount of time and effort spent on team-science projects and the resulting wealth of data deserve a full-scale analysis. We believe a Bayesian hierarchical-modeling approach is ideally suited for such an analysis because it allows evidence to be quantified both for and against an effect of interest, and it facilitates the consideration of theoretical constraint in the data. In the following, we highlight four additional general recommendations for team science that facilitate a full-scale analysis.

Our first recommendation is to use all data that are available. Most directly, this means using a hierarchical model with all primary data nested in studies rather than a meta-analysis based on compressed and aggregated data. Furthermore, although participant-level exclusions may be explored (see Point b above), we would advise never to apply study-level exclusions based on sample size. In particular, more data always means more statistical power and more resolution. In addition, hierarchical shrinkage will automatically reduce the influence of outlying labs with relatively few observations by more strongly pulling these observations toward the global estimate.

Our second recommendation is to conduct a multiverse analysis (Steegen et al., 2016) to investigate the evidence across different reasonable exclusion criteria, model choices, or prior settings. As illustrated by the Many Labs 4 project, team-science efforts often involve a range of reasonable options for data-exclusion criteria, prior settings, and perhaps other analytic choices. To get a full picture of the robustness and potential relevant dimensions of the data affecting the outcomes, analysts could explore multiple analytic paths (see also Tierney et al., 2021, 2022). In some cases, it might make sense to apply different weights to different paths of the multiverse, for instance, based on theoretical or methodological grounds.

Our third recommendation is to preregister but remain open to justifiable deviations. Especially in highly complex projects with crowd-sourced data collection and many involved parties, unexpected events and deviations are the norm rather than the exception. At least in our personal experience, none of the team-science projects went exactly as planned, and many required reconsideration of preregistered choices (e.g., Hoogeveen, Haaf, et al., 2022; Hoogeveen, Sarafoglou, et al., 2023; Tierney et al., 2021, 2022). Although full transparency is clearly key in these situations, we believe the quality of the eventual analysis and hence the validity of the conclusions should outweigh the strict adherence to the preregistration. Another option to ensure uncontaminated data analysis would be to use “blinded analysis” (MacCoun & Perlmutter, 2015, 2018), in which analysts perform their analysis on an altered version of the data (e.g., shuffling the dependent variable, adding noise to the data, or switching labels of categorical variables). Only after the analysts are fully satisfied with the analysis, the blind is lifted and the real data are revealed (for more information on analysis blinding, see Dutilh et al., 2021; Sarafoglou et al., 2023).

Our fourth recommendation is to consider collaborating with methodologists on the statistical analysis. Typically, team-science efforts involve relatively extensive and complex data (e.g., hierarchically structured). We believe the time and effort put into data collection and study design also justify spending some additional time, effort, and resources on data-analysis expertise. For the sake of illustration, imagine that each participating lab in the Many Labs 4 project invested 15 min per participant; this comes down to 21 labs spending about 1,589 min on data collection for a total of 556 hr.³ Given this huge investment of time and effort, the overall project quality might benefit from also matching the investment into the analysis, potentially by outsourcing the analysis to methodological and statistical experts. At least in our personal experience, experts are often eager to help out (and get their hands on “real data” for a change). For example, we have been involved in the data analysis for a couple of team-science projects (e.g., Camerer et al., 2018; Tierney et al., 2021, 2022). Having an independent analysis team may also make it easier to justify deviations from the preregistration and to apply differential weights to paths in the multiverse analysis given that either of these decisions can be made independently from the analysts.

The idea of team-science efforts such as the Many Labs projects is that the robustness of empirical phenomena becomes clear when data are collected across several labs. Likewise, the robustness of statistical conclusions becomes clear when data are analyzed using several thoughtfully selected models in a full-scale analysis (Wagenmakers et al., 2022). A complete assessment of robustness and uncertainty therefore requires many labs, many models, perhaps many analysis paths, and ideally, many collaborating experts.

Footnotes

Appendix A

Appendix B

Appendix C

Appendix D

Appendix E

Table E1.

Full Table of Data-Exclusion Constellations

Participant level	N-based	Protocol	Timing-based	Apply P-based	Sample size	Labs
All	All	All	All	AA only	2,225	21
White & U.S.-born	All	All	All	AA only	1,880	21
U.S. identity > 7	All	All	All	AA only	1,699	21
All	N > 60	All	All	AA only	2,067	17
White & U.S.-born	N > 60	All	All	AA only	1,746	17
U.S. identity > 7	N > 60	All	All	AA only	1,593	17
All	N > 80	All	All	AA only	1,866	14
White & U.S.-born	N > 80	All	All	AA only	1,545	14
U.S. identity > 7	N > 80	All	All	AA only	1,392	14
All	All	AA	All	AA only	798	9
White & U.S.-born	All	AA	All	AA only	453	9
U.S. identity > 7	All	AA	All	AA only	272	9
All	N > 60	AA	All	AA only	699	7
White & U.S.-born	N > 60	AA	All	AA only	378	7
U.S. identity > 7	N > 60	AA	All	AA only	225	7
All	N > 80	AA	All	AA only	699	7
White & U.S.-born	N > 80	AA	All	AA only	378	7
U.S. identity > 7	N > 80	AA	All	AA only	225	7
All	All	All	After prereg	AA only	1,659	20
White & U.S.-born	All	All	After prereg	AA only	1,314	20
U.S. identity > 7	All	All	After prereg	AA only	1,133	20
All	N > 60	All	After prereg	AA only	1,544	17
White & U.S.-born	N > 60	All	After prereg	AA only	1,223	17
U.S. identity > 7	N > 60	All	After prereg	AA only	1,070	17
All	N > 80	All	After prereg	AA only	1,343	14
White & U.S.-born	N > 80	All	After prereg	AA only	1,022	14
U.S. identity > 7	N > 80	All	After prereg	AA only	869	14
All	All	AA	After prereg	AA only	797	9
White & U.S.-born	All	AA	After prereg	AA only	452	9
U.S. identity > 7	All	AA	After prereg	AA only	271	9
All	N > 60	AA	After prereg	AA only	698	7
White & U.S.-born	N > 60	AA	After prereg	AA only	377	7
U.S. identity > 7	N > 60	AA	After prereg	AA only	224	7
All	N > 80	AA	After prereg	AA only	698	7
White & U.S.-born	N > 80	AA	After prereg	AA only	377	7
U.S. identity > 7	N > 80	AA	After prereg	AA only	224	7
All	All	All	All	AA and IH	2,211	21
White & U.S.-born	All	All	All	AA and IH	983	16
U.S. identity > 7	All	All	All	AA and IH	272	9
All	N > 60	All	All	AA and IH	2,053	17
White & U.S.-born	N > 60	All	All	AA and IH	897	13
U.S. identity > 7	N > 60	All	All	AA and IH	225	7
All	N > 80	All	All	AA and IH	1,852	14
White & U.S.-born	N > 80	All	All	AA and IH	864	12
U.S. identity > 7	N > 80	All	All	AA and IH	225	7
All	All	AA	All	AA and IH	799	9
White & U.S.-born	All	AA	All	AA and IH	453	9
U.S. identity > 7	All	AA	All	AA and IH	272	9
All	N > 60	AA	All	AA and IH	700	7
White & U.S.-born	N > 60	AA	All	AA and IH	378	7
U.S. identity > 7	N > 60	AA	All	AA and IH	225	7
All	N > 80	AA	All	AA and IH	700	7
White & U.S.-born	N > 80	AA	All	AA and IH	378	7
U.S. identity > 7	N > 80	AA	All	AA and IH	225	7
All	All	All	After prereg	AA and IH	1,650	20
White & U.S.-born	All	All	After prereg	AA and IH	777	15
U.S. identity > 7	All	All	After prereg	AA and IH	271	9
All	N > 60	All	After prereg	AA and IH	1,535	17
White & U.S.-born	N > 60	All	After prereg	AA and IH	702	13
U.S. identity > 7	N > 60	All	After prereg	AA and IH	224	7
All	N > 80	All	After prereg	AA and IH	1,334	14
White & U.S.-born	N > 80	All	After prereg	AA and IH	669	12
U.S. identity > 7	N > 80	All	After prereg	AA and IH	224	7
All	All	AA	After prereg	AA and IH	798	9
White & U.S.-born	All	AA	After prereg	AA and IH	452	9
U.S. identity > 7	All	AA	After prereg	AA and IH	271	9
All	N > 60	AA	After prereg	AA and IH	699	7
White & U.S.-born	N > 60	AA	After prereg	AA and IH	377	7
U.S. identity > 7	N > 60	AA	After prereg	AA and IH	224	7
All	N > 80	AA	After prereg	AA and IH	699	7
White & U.S.-born	N > 80	AA	After prereg	AA and IH	377	7
U.S. identity > 7	N > 80	AA	After prereg	AA and IH	224	7

Note: Orange rows refer to Klein et al.’s (2022) key analyses; green rows refer to Chatard et al.’s (2020) key analyses; purple rows refer to our chosen analyses; gray rows are repeated data sets and not included in the multiverse analysis. “Apply p-based” indicates whether the participant-level exclusion criteria are applied to the author-advised labs only (retaining all in-house participants) or to both author-advised and in-house labs (missing data excluded). AA = author-advised; IH = in-house; prereg = preregistration.

Acknowledgements

For all analyses, we used R (Version 4.1.2; R Core Team, 2021) and the R-packages BayesFactor (Morey & Rouder, 2021), coda (Plummer et al., 2006), cowplot (Wilke, 2020), dplyr (Wickham et al., 2022), ggplot2 (Wickham, 2016), kableExtra (Zhu, 2021), knitr (Xie, 2015), ks (Duong, 2022), LaplacesDemon (Statisticat LLC, 2021), lemon (Edwards, 2020), MASS (Venables & Ripley, 2002), Matrix (Bates et al., 2022), MCMCpack (Martin et al., 2011), metaBMA (Heck et al., 2019), metafor (Viechtbauer, 2010), papaja (Aust & Barth, 2022), Rcpp (Eddelbuettel & Balamuta, 2018; Eddelbuettel & François, 2011), and tinylabels (Barth, 2021).

Transparency

Action Editor: Pamela Davis-Kean

Editor: David A. Sbarra

Author Contribution(s)

Suzanne Hoogeveen: Conceptualization; Formal analysis; Funding acquisition; Methodology; Visualization; Writing – original draft.

Sophie W. Berkhout: Formal analysis; Methodology; Visualization; Writing – review & editing.

Quentin F. Gronau: Funding acquisition; Methodology; Writing – review & editing.

Eric-Jan Wagenmakers: Conceptualization; Funding acquisition; Supervision; Writing – review & editing.

Julia M. Haaf: Conceptualization; Formal analysis; Funding acquisition; Methodology; Supervision; Visualization; Writing – original draft.

ORCID iDs

Suzanne Hoogeveen

Quentin F. Gronau

Eric-Jan Wagenmakers

Julia M. Haaf

Notes

References

Aitkin

(1991). Posterior Bayes factors. Journal of the Royal Statistical Society, Series B: Methodological, 53(1), 111–142.

Aust

Barth

(2022). papaja: Prepare reproducible APA journal articles with R Markdown (R package Version 0.1.1). https://github.com/crsh/papaja

Barth

(2021). tinylabels: Lightweight variable labels (R package Version 0.2.2). https://cran.r-project.org/package=tinylabels

Bates

Maechler

Jagan

(2022). Matrix: Sparse and dense matrix classes and methods (R package Version 1.4-1). https://CRAN.R-project.org/package=Matrix

Camerer

C. F.

Dreber

Holzmeister

Huber

Johannesson

Kirchler

Nave

Nosek

B. A.

Pfeiffer

Altmejd

Buttrick

Chan

Chen

Forsell

Gampa

Heikensten

Hummer

Imai

. . . Wu

(2018). Evaluating replicability of social science experiments in Nature and Science. Nature Human Behaviour, 2, 637–644. https://doi.org/10.1038/s41562-018-0399-z

Chartier

Kline

McCarthy

Nuijten

Dunleavy

D. J.

Ledgerwood

(2018, November 30). The cooperative revolution is making psychological science better. APS Observer, 31(10). https://www.psychologicalscience.org/observer/the-cooperative-revolution-is-making-psychological-science-better

Chatard

Hirschberger

Pyszczynski

(2020). A word of caution about Many Labs 4: If you fail to follow your preregistered plan, you may fail to find a real effect. PsyArXiv. https://doi.org/10.31234/osf.io/ejubn

Chen

S.-C.

Szabelska

Chartier

C. R.

Kekecs

Lynott

Bernabeu

Jones

B. C.

DeBruine

Levitan

Werner

K. M.

Wang

Milyavskaya

Musser

E. D.

Papadatou-Pastou

Coles

N. A.

Janssen

Ozdogru

Storage

Manley

. . . Schmidt

(2018). Investigating object orientation effects across 14 languages. PsyArXiv. https://doi.org/10.31234/osf.io/t2pjv

DeHaven

A. C.

(2017). Preregistration: A plan, not a prison. https://www.cos.io/blog/preregistration-plan-not-prison

10.

Duong

(2022). ks: Kernel smoothing (R package version 1.14.0). https://CRAN.R-project.org/package=ks

11.

Dutilh

Sarafoglou

Wagenmakers

E.-J.

(2021). Flexible yet fair: Blinding analyses in experimental psychology. Synthese, 198(Suppl. 23), 5745–5772. https://doi.org/10.1007/s11229-019-02456-7

12.

Ebersole

Atherton

Belanger

Skulborstad

Allen

Banks

Baranski

Bernstein

M. J.

Bonfiglio

D. B. V.

Boucher

Brown

E. R.

Budiman

N. I.

Cairo

A. H.

Capaldi

C. A.

Chartier

C. R.

Chung

J. M.

Cicero

D. C.

Coleman

J. A.

Conway

J. G.

. . . Nosek

(2016). Many Labs 3: Evaluating participant pool quality across the academic semester via replication. Journal of Experimental Social Psychology, 67, 68–82. https://doi.org/10.1016/j.jesp.2015.10.012

13.

Eddelbuettel

Balamuta

J. J.

(2018). Extending R with C++: A brief introduction to Rcpp. The American Statistician, 72(1), 28–36. https://doi.org/10.1080/00031305.2017.1375990

14.

Eddelbuettel

François

(2011). Rcpp: Seamless R and C++ integration. Journal of Statistical Software, 40(8), 1–18. https://doi.org/10.18637/jss.v040.i08

15.

Edwards

S. M.

(2020). lemon: Freshing up your ‘ggplot2’ plots (R package version 0.4.5). https://CRAN.R-project.org/package=lemon

16.

Efron

Morris

(1977). Stein’s paradox in statistics. Scientific American, 236, 119–127. https://doi.org/10.1038/scientificamerican0577-119

17.

Forscher

P. S.

Wagenmakers

E.-J.

Coles

N. A.

Silan

M. A.

Dutra

N. B.

Basnight-Brown

IJzerman

(2023). The benefits, barriers, and risks of big-team science. Perspectives on Psychological Science, 18(3), 607–623. https://doi.org/10.1177/17456916221082970

18.

Frank

M. C.

Bergelson

Bergmann

Cristia

Floccia

Gervain

Hamlin

J. K.

Hannon

E. E.

Kline

Levelt

Lew-Williams

Nazzi

Panneton

Rabagliati

Soderstrom

Sullivan

Waxman

Yurovsky

(2017). A collaborative approach to infant research: Promoting reproducibility, best practices, and theory-building. Infancy, 22(4), 421–435. https://doi.org/10.1111/infa.12182

19.

Gelman

Carlin

J. B.

Stern

H. S.

Rubin

D. B.

(2004). Bayesian data analysis (2nd ed.). Chapman & Hall/CRC.

20.

Gervais

W. M.

Xygalatas

McKay

R. T.

van Elk

Buchtel

E. E.

Aveyard

Schiavone

S. R.

Dar-Nimod

Svedholm-Häkkinen

A. M.

Riekki

Klocová

E. K.

Ramsay

J. E.

Bulbulia

(2017). Global evidence of extreme intuitive moral prejudice against atheists. Nature Human Behaviour, 1(8), Article 0151. https://doi.org/10.1038/s41562-017-0151

21.

Greenberg

Pyszczynski

Solomon

Simon

Breus

(1994). Role of consciousness and accessibility of death-related thoughts in mortality salience effects. Journal of Personality and Social Psychology, 67(4), 627–637. https://doi.org/10.1037/0022-3514.67.4.627

22.

Gronau

Q. F.

Heck

D. W.

Berkhout

S. W.

Haaf

J. M.

Wagenmakers

E.-J.

(2021). A primer on Bayesian model-averaged meta-analysis. Advances in Methods and Practices in Psychological Science, 4(3). https://doi.org/10.1177/25152459211031256

23.

Gronau

Q. F.

Wagenmakers

E.-J.

(2020). Informed Bayesian t-tests. The American Statistician, 74(2), 137–143. https://doi.org/10.1080/00031305.2018.1562983

24.

Gronau

Q. F.

van Erp

Heck

D. W.

Cesario

Jonas

K. J.

Wagenmakers

E.-J.

(2017). A Bayesian model-averaged meta-analysis of the power pose effect with informed and default priors: The case of felt power. Comprehensive Results in Social Psychology, 2, 123–138. https://doi.org/10.1080/23743603.2017.1326760

25.

Haaf

J. M.

Klaassen

Rouder

J. N.

(2018). Capturing ordinal theoretical constraint in Psychological Science. PsyArXiv. https://doi.org/10.31234/osf.io/a4xu9

26.

Haaf

J. M.

Rouder

J. N.

(2017). Developing constraint in Bayesian mixed models. Psychological Methods, 22(4), 779–798. https://doi.org/10.1037/met0000156

27.

Harrer

Cuijpers

Furukawa

Ebert

(2021). Doing meta-analysis with R: A hands-on guide. Chapman and Hall/CRC. https://doi.org/10.1201/9781003107347

28.

Heck

D. W.

Gronau

Q. F.

(2017). metaBMA: Bayesian model averaging for random-and fixed-effects meta-analysis [R Package].

29.

Heck

D. W.

Gronau

Q. F.

Wagenmakers

E.-J.

(2019). metaBMA: Bayesian model averaging for random and fixed effects meta-analysis. https://CRAN.R-project.org/package=metaBMA

30.

Henrich

Heine

S. J.

Norenzayan

(2010). The weirdest people in the world? Behavioral and Brain Sciences, 33(2–3), 61–83. https://doi.org/10.1017/S0140525X0999152X

31.

Hinne

Gronau

Q. F.

van den Bergh

Wagenmakers

E.-J.

(2020). A conceptual introduction to Bayesian model averaging. Advances in Methods and Practices in Psychological Science, 3(2), 200–215. https://doi.org/10.1177/2515245919898657

32.

Hoogeveen

Haaf

J. M.

Bulbulia

J. A.

Ross

R. M.

McKay

Altay

Bendixen

Berniu-nas

Cheshin

Gentili

Georgescu

Gervais

W. M.

Hagel

Kavanagh

Levy

Neely

Qiu

Rabelo

Ramsay

J. E.

. . . van Elk

(2022). The Einstein effect provides global evidence for scientific source credibility effects and the influence of religiosity. Nature Human Behaviour, 6(4), 523–535. https://doi.org/10.1038/s41562-021-01273-8

33.

Hoogeveen

Sarafoglou

Aczel

Aditya

Alayan

Allen

Altay

Alzahawi

Amir

Anthony

F.-V.

Appiah

O. K.

Atkinson

Q. D.

Baimel

Balkaya-Ince

Balsamo

Banker

Bartoš

Becerra

Beffara

. . . Wagenmakers

E.-J.

(2022). A many-analysts approach to the relation between religiosity and well-being. Religion, Brain & Behavior. Advance online publication. https://doi.org/10.31234/osf.io/pbfye

34.

Jeffreys

(1939). Theory of probability. Oxford University Press.

35.

Jones

B. C.

DeBruine

L. M.

Flake

J. K.

Liuzza

M. T.

Antfolk

Arinze

N. C.

Ndukaihe

I. L. G.

Bloxsom

N. G.

Lewis

S. C.

Foroni

Willis

M. L.

Cubillas

C. P.

Vadillo

M. A.

Turiegano

Gilead

Simchon

Saribay

S. A.

Owsley

N. C.

Jang

. . . Coles

N. A.

(2021). To which world regions does the valence–dominance model of social perception apply? Nature Human Behaviour, 5(1), 159–169. https://doi.org/10.1038/s41562-020-01007-2

36.

Kass

R. E.

Raftery

A. E.

(1995). Bayes factors. Journal of the American Statistical Association, 90, 773–795. https://doi.org/10.2307/2291091

37.

Keysers

Gazzola

Wagenmakers

E.-J.

(2020). Using Bayes factor hypothesis testing in neuroscience to establish evidence of absence. Nature Neuroscience, 23(7), 788–799. https://doi.org/10.1038/s41593-020-0660-4

38.

Klein

R. A.

Cook

C. L.

Ebersole

C. R.

Vitiello

C. A.

Nosek

B. A.

Hilgard

Ahn

P. A.

Brady

A. J.

Chartier

C. R.

Christopherson

C. D.

Clay

Collisson

Crawford

Cromar

Gardiner

Gosnell

Grahe

Hall

Howard

. . . Ratliff

K. A.

(2019). Many Labs 4: Failure to replicate mortality salience effect with and without original author involvement. PsyArXiv. https://doi.org/10.31234/osf.io/vef2c

39.

Klein

R. A.

Cook

C. L.

Ebersole

C. R.

Vitiello

C. A.

Nosek

B. A.

Hilgard

Ahn

P. H.

Brady

A. J.

Chartier

C. R.

Christopherson

C. D.

Clay

Collisson

Crawford

J. T.

Cromar

Gardiner

Gosnell

C. L.

Grahe

Hall

Howard

. . . Ratliff

K. A.

(2022). Many labs 4: Failure to replicate mortality salience effect with and without original author involvement. Collabra: Psychology, 8(1), Article 35271. https://doi.org/10.1525/collabra.35271

40.

Klein

R. A.

Ebersole

C. R.

Cook

C. L.

Nosek

B. A.

Ratliff

K. A.

(2019, February 15). ML4 registration draft. https://doi.org/10.17605/OSF.IO/4XX6W

41.

Klein

R. A.

Ratliff

K. A.

Vianello

Adams

R. B.

Jr. Bahnńik

Bernstein

Bocian

Brandt

M. J.

Brooks

Brumbaugh

C. C.

Cemalcilar

Chandler

Cheong

Davis

W. E.

Devos

Eisner

Frankowska

Furrow

Galliani

E. M.

. . . Nosek

(2014). Investigating variation in replicability: A “Many Labs” replication project. Social Psychology, 45, 142–152. https://doi.org/10.1027/1864-9335/a000178

42.

Klein

R. A.

Vianello

Hasselman

Adams

B. G.

Adams

R. B.

Alper

Aveyard

Axt

J. R.

Babalola

M. T.

Bahník

Š.

Batra

Berkics

Bernstein

M. J.

Berry

D. R.

Bialobrzeska

Binan

E. D.

Bocian

Brandt

M. J.

Busching

. . . Nosek

(2018). Many Labs 2: Investigating variation in replicability across sample and setting. Advances in Methods and Practices in Psychological Science, 1, 443–490. https://doi.org/10.1177/2515245918810225

43.

Kruschke

J. K.

(2013). Bayesian estimation supersedes the t test. Journal of Experimental Psychology: General, 142(2), 573–603. https://doi.org/10.1037/a0029146

44.

Landy

J. F.

Jia

M. L.

Ding

I. L.

Viganola

Tierney

Dreber

Johannesson

Pfeiffer

Ebersole

C. R.

Gronau

Q. F.

van den Bergh

Marsman

Derks

Wagenmakers

E. J.

Proctor

Bartels

D. M.

Bauman

C. W.

Brady

W. J.

. . . Uhlmann

E. L.

(2020). Crowdsourcing hypothesis tests: Making transparent how design choices shape research results. Psychological Bulletin, 146(5), 451–479. https://doi.org/10.1037/bul0000220

45.

MacCoun

Perlmutter

(2015). Hide results to seek the truth: More fields should, like particle physics, adopt blind analysis to thwart bias. Nature, 526, 187–189.

46.

MacCoun

Perlmutter

(2018). Blind analysis as a correction for confirmatory bias in physics and in psychology. In Lilienfeld

S. O.

Waldman

(Eds.), Psychological science under scrutiny: Recent challenges and proposed solutions (pp. 297–322). John Wiley and Sons.

47.

The ManyBabies Consortium. (2020). Quantifying sources of variability in infancy research using the infant-directed-speech preference: Advances in Methods and Practices in Psychological Science, 3(1), 24–52. https://doi.org/10.1177/2515245919900809

48.

Martin

A. D.

Quinn

K. M.

Park

J. H.

(2011). MCMCpack: Markov Chain Monte Carlo in R. Journal of Statistical Software, 42(9), Article 22. https://doi.org/10.18637/jss.v042.i09

49.

Morey

R. D.

Rouder

J. N.

(2018). BayesFactor: Computation of Bayes factors for common designs designs (R package version 0.9.12-4.3). https://CRAN.R-project.org/package=BayesFactor

50.

Morey

R. D.

Rouder

J. N.

(2021). BayesFactor: Computation of Bayes factors for common designs (R package version 0.9.12-4.3). https://CRAN.R-project.org/package=BayesFactor

51.

Moshontz

Campbell

Ebersole

C. R.

IJzerman

Urry

H. L.

Forscher

P. S.

Grahe

J. E.

McCarthy

R. J.

Musser

E. D.

Antfolk

Castille

C. M.

Evans

T. R.

Fiedler

Flake

J. K.

Forero

D. A.

Janssen

S. M. J.

Keene

J. R.

Protzko

Aczel

. . . Chartier

C. R.

(2018). The Psychological Science Accelerator: Advancing psychology through a distributed collaborative network. Advances in Methods and Practices in Psychological Science, 1(4), 501–515. https://doi.org/10.1177/2515245918797607

52.

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349, Article aac4716. https://doi.org/10.1126/science.aac4716

53.

Plummer

Best

Cowles

Vines

(2006). CODA: Convergence diagnosis and output analysis for MCMC. R News, 6(1), 7–11. https://journal.r-project.org/archive/

54.

R Core Team. (2021). R: A language and environment for statistical computing. https://www.R-project.org/

55.

Rouder

J. N.

Haaf

J. M.

Aust

(2018). From theories to models to predictions: A Bayesian model comparison approach. Communication Monographs, 85(1), 41–56. https://doi.org/10.1080/03637751.2017.1394581

56.

Rouder

J. N.

Haaf

J. M.

Davis-Stober

C. P.

Hilgard

(2019). Beyond overall effects: A Bayesian approach to finding constraints in meta-analysis. Psychological Methods, 24, 606–621. https://doi.org/10.1037/met0000216

57.

Rouder

J. N.

Province

J. M.

(2019). Bayesian hierarchical models in psychological science: A tutorial. In Spieler

Schumacher

(Eds.), New methods in cognitive psychology (pp. 32–66). Routledge.

58.

Sarafoglou

Hoogeveen

Wagenmakers

E.-J.

(2023). Comparing analysis blinding with preregistration in the many-analysts religion project. Advances in Methods and Practices in Psychological Science, 6(1). https://doi.org/10.1177/25152459221128319

59.

Scheibehenne

Gronau

Q. F.

Jamil

Wagenmakers

E.-J.

(2017). Fixed or random? A resolution through model-averaging. Reply to Carlsson, Schimmack, Williams, and Burkner. Psychological Science, 28, 1698–1701.

60.

Schweinsberg

Madan

Vianello

Sommer

S. A.

Jordan

Tierney

Awtrey

Zhu

L. L.

Diermeier

Heinze

J. E.

Srinivasan

Tannenbaum

Bivolaru

Dana

Davis-Stober

C. P.

du Plessis

Gronau

Q. F.

Hafenbrack

A. C.

Liao

E. Y

Uhlmann

E. L.

(2016). The pipeline project: Pre-publication independent replications of a single laboratory’s research pipeline. Journal of Experimental Social Psychology, 66, 55–67. https://doi.org/10.1016/j.jesp.2015.10.001

61.

Statisticat LLC. (2021). Laplacesdemon: Complete environment for Bayesian inference (R package Version 16.1.6). Bayesian-Inference.com. https://web.archive.org/web/20150206004624/http://www.bayesian-inference.com/software

62.

Steegen

Tuerlinckx

Gelman

Vanpaemel

(2016). Increasing transparency through a multiverse analysis. Perspectives on Psychological Science, 11(5), 702–712. https://doi.org/10.1177/1745691616658637

63.

Tierney

Cyrus-Lai

Hoogeveen

Haaf

J. M.

Landy

J. F.

Hardy

J. H.

III Ebersole

C. R.

van den Bergh

Clemente

Pfeiffer

Dreber

Johannesson

Clark

C. J.

Mitchell

Adam

Barasch

Frasca

T. J.

Levine

Livingston

. . . Uhlmann

E. L.

(2022). Who respects an angry woman? A pre-registered re-examination of the relationships between gender, emotion expression, and status conferral [Unpublished manuscript].

64.

Tierney

Hardy

III Ebersole

Viganola

Clemente

Gordon

Hoogeveen

Haaf

Dreber

Johannesson

Pfeiffer

Huang

J. L.

Vaughn

L. A.

DeMarree

Igou

Chapman

Gantman

Vanaman

Wylie

. . . Uhlmann

E. L.

(2021). A creative destruction approach to replication: Implicit work and sex morality across cultures. Journal of Experimental Social Psychology, 93, Article 104060. https://doi.org/10.1016/j.jesp.2020.104060

65.

Uhlmann

E. L.

Ebersole

C. R.

Chartier

C. R.

Errington

T. M.

Kidwell

M. C.

Lai

C. K.

McCarthy

R. J.

Riegelman

Silberzahn

Nosek

B. A.

(2019). Scientific Utopia III: Crowdsourcing science. Perspectives on Psychological Science, 14(5), 711–733. https://doi.org/10.1177/1745691619850561

66.

van Erp

Verhagen

Grasman

R. P. P. P.

Wagenmakers

E.-J

. (2017). Estimates of between-study heterogeneity for 705 meta-analyses reported in Psychological Bulletin from 1990–2013. Journal of Open Psychology Data, 5(1), Article 4. https://doi.org/10.5334/jopd.33

67.

Vanpaemel

(2010). Prior sensitivity in theory testing: An apologia for the Bayes factor. Journal of Mathematical Psychology, 54(6), 491–498. https://doi.org/10.1016/j.jmp.2010.07.003

68.

Veenman

Stefan

Haaf

J. M.

(2022). Bayesian hierarchical modeling: An introduction and reassessment. PsyArXiv. https://doi.org/10.31234/osf.io/pskvx

69.

Venables

W. N.

Ripley

B. D.

(2002). Modern applied statistics with S (4th ed.). Springer. https://www.stats.ox.ac.uk/pub/MASS4/

70.

Viechtbauer

(2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36(3), 1–48. https://doi.org/10.18637/jss.v036.i03

71.

Vohs

K. D.

Schmeichel

B. J.

Lohmann

Gronau

Q. F.

Finley

A. J.

Ainsworth

S. E.

Alquist

J. L.

Baker

M. D.

Brizi

Bunyi

Butschek

G. J.

Campbell

Capaldi

Cau

Chambers

Chatzisarantis

N. L. D.

Christensen

W. J.

Clay

S. L.

Curtis

. . . Albarracín

(2021). A multisite preregistered paradigmatic test of the ego-depletion effect. Psychological Science, 32(10), 1566–1581. https://doi.org/10.1177/0956797621989733

72.

Wagenmakers

E.-J.

Marsman

Jamil

Verhagen

Love

Selker

Gronau

Q. F.

Šmíra

Epskamp

Matzke

Rouder

J. N.

Morey

R. D.

(2018). Bayesian inference for psychology. Part I: Theoretical advantages and practical ramifications. Psychonomic Bulletin & Review, 25(1), 35–57. https://doi.org/10.3758/s13423-017-1343-3

73.

Wagenmakers

E.-J.

Sarafoglou

Aczel

(2022). One statistical analysis must not rule them all. Nature, 605(7910), 423–425. https://doi.org/10.1038/d41586-022-01332-8

74.

Wickham

(2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag. https://ggplot2.tidyverse.org

75.

Wickham

François

Henry

Müller

(2022). dplyr: A grammar of data manipulation (R package Version 1.0.10). https://CRAN.R-project.org/package=dplyr

76.

Wilke

C. O.

(2020). cowplot: Streamlined plot theme and plot annotations for ‘ggplot2’ (R package Version 1.1.1). https://CRAN.R-project.org/package=cowplot

77.

Xie

(2015). Dynamic documents with R and knitr (2nd ed.). Chapman and Hall/CRC.

78.

Zhu

(2021). kableextra: Construct complex table with ‘kable’ and pipe syntax (R package Version 1.3.4). https://CRAN.R-project.org/package=kableExtra

Improving Statistical Analysis in Team Science: The Case of a Bayesian Multiverse of Many Labs 4

Abstract

Keywords

Bayesian Hierarchical Modeling

Many Labs 4 Reanalysis

A brief history

Include or exclude?

Disclosures

Preregistration

Data and materials

Reporting

Ethical approval

Method

Box 1

Hierarchical Model Specifications

Results

Bayesian reanalysis of Klein et al.’s (2022) key findings

Bayesian reanalysis of Chatard et al.’s (2020) key findings

Bayesian analysis of our current choice

Bayesian multiverse analysis across all exclusion criteria

Prior sensitivity

Conclusion

General Recommendations

Footnotes

Appendix A

Appendix B

Appendix C

Appendix D

Appendix E

Acknowledgements

Transparency

ORCID iDs

Notes

References