Abstract
In psychology, preregistration is the most widely used method to ensure the confirmatory status of analyses. However, the method has disadvantages: Not only is it perceived as effortful and time-consuming, but reasonable deviations from the analysis plan demote the status of the study to exploratory. An alternative to preregistration is analysis blinding, in which researchers develop their analysis on an altered version of the data. In this experimental study, we compare the reported efficiency and convenience of the two methods in the context of the Many-Analysts Religion Project. In this project, 120 teams answered the same research questions on the same data set, either preregistering their analysis (
Keywords
The “crisis of confidence” in psychological science (Pashler & Wagenmakers, 2012) inspired a variety of methodological reforms that aim to increase the quality and credibility of confirmatory empirical research. Among these reforms, preregistration is arguably one of the most vigorous. Preregistration protects the confirmatory status of the study by restricting the researchers’ degrees of freedom in conducting a study and analyzing the data (e.g., Chambers, 2017; Munafò et al., 2017; Wagenmakers et al., 2012). When preregistering studies, researchers specify in detail the study design, sampling plan, measures, and analysis plan before data collection. By specifying these aspects beforehand, researchers protect themselves against their (subconscious) tendencies to select favorable—that is, statistically significant—results.
Preregistration is fair in the sense that it restricts the researchers’ degrees of freedom. However, this implies that researchers must anticipate all possible peculiarities of the data and define analysis paths for each scenario, which can be perceived as effortful and time-consuming (Nosek & Lindsay, 2018; Sarafoglou, Kovacs, et al., 2022). Indeed, it is rare for researchers to adhere fully to their preregistration plan. Two recent studies compared preregistrations with published manuscripts and found that only a small minority did not contain any deviations from the preregistration: two out of 27 in Claesen et al. (2021) and seven out of 20 in Heirene et al. (2021). More serious still is the dilemma that preregistration does not distinguish between significance seeking and selecting appropriate methods to analyze the data. Such reasonable deviations include, for instance, removing outliers, transforming skewed data, or accounting for measurement invariance. From our personal experience, such deviations are usually small and do not affect the main conclusions of the study. However, if an analysis is adjusted to properties of the data, then the analysis will be demoted from “confirmatory” to “exploratory” even when the adjustments were entirely appropriate and independent from any significance test that was entertained. This makes preregistration a challenge for research that includes any sort of nontrivial statistical modeling (e.g., Dutilh et al., 2017).
An alternative to preregistration is analysis blinding (Dutilh et al., 2019; MacCoun, 2020; MacCoun & Perlmutter, 2015, 2018). Just like preregistration, analysis blinding safeguards the confirmatory status of the analysis. However, the analysts do not specify their analysis before data collection. Instead, the analysts develop their analysis plan using a blinded version of the data, that is, a data set in which a collaborator or an independent researcher has removed any potentially biasing information (e.g., potential treatment effects or differences across conditions).
An overview on different blinding techniques for common study designs in experimental psychology is provided in Dutilh et al. (2019). One can create a blinded version of the data, for instance, by equalizing the group means across experimental conditions in factorial designs, by adding random noise to all values of the key outcome measure, or by shuffling the key outcome measures in regression designs. The latter technique was used in the present project. Shuffling the key outcome measures in regression designs implies reordering the dependent-variable columns in the data set while leaving all other columns untouched. The resulting blinded data are therefore complete, the column names are identical, and the data have the same structure as the real data. Note that in contrast to the analysis of simulated data or data from a previously conducted (pilot) study, blinding of the analysis concerns the use of the actual data from a study.
Thus, the analysts can examine the demographic characteristics of the sample, visualize the distribution of the variables, identify outliers, handle missing cases, or explore the factor structure of relevant measures. The analysts are thus able to create a reproducible analysis script including all steps in the analysis pipeline: from preprocessing the data to executing the appropriate statistical analysis. Most importantly, the analysts develop their analytic strategy without being able to determine how their analytic choices affect the significance level of the predictors. The blinding procedure has destroyed the relationship with the selected outcome variable so that any analysis performed using this outcome variable will not be significant. After the analysts are satisfied with their analysis plan, they receive access to the real data and execute their script without any changes. To make this process transparent, the analysts may choose to publish their analytic script to a public repository, such as the OSF (Center for Open Science, 2021), before accessing the data.
The benefit of analysis blinding is that it offers the flexibility to explore the data and fit statistical models to its idiosyncrasies yet prevents an analysis that is tailored to the outcomes. In addition, it could save researchers time and effort because the additional step of creating a preregistration document is omitted.
Analysis blinding can be used either as stand-alone practice for data analysis or as a complement to preregistration. The latter was implemented, for example, in the study by Dutilh et al. (2017). The authors preregistered their analysis but anticipated deviations in the analysis plan because of the complexity of the statistical model and data structure. Analysis blinding allowed the authors to adjust the analysis plan to the specific peculiarities of the collected data while still maintaining its confirmatory status. In the current project, which evaluates the differences between the two experimental conditions, we also deployed both strategies. That is, we preregistered our analysis plan on the OSF before data collection but validated them on a blinded version the data.
Current Study
In the current study, we assessed the potential benefits of analysis blinding over the preregistration of analysis plans in terms of efficiency and convenience. As part of the Many-Analysts Religion Project (MARP; Hoogeveen, Sarafoglou, Aczel, et al., 2022), we invited teams to answer two research questions on the relationship between religiosity and well-being. Specifically, the teams investigated (a) whether religious people self-report higher well-being and (b) whether the relation between religiosity and self-reported well-being depends on perceived cultural norms of religion. Relevant to this study is that we assigned the teams to two conditions, that is, they either preregistered their analysis plan or used analysis blinding.
To complete the project, the teams had to go through two distinct stages. In Stage 1, the teams had to conceptualize, write, and submit their analysis plan. They did so either by submitting a completed preregistration template or by submitting an executable analysis script based on the blinded version of the data. In Stage 2, the teams were granted access to the real data set to execute their planned analysis. After the sign-up and after each stage of the project, the teams completed brief surveys on their experiences with planning and executing the analysis and on their change of beliefs on the two MARP research questions.
Research Question and Hypotheses
Our overarching research question was as follows: Does analysis blinding have benefits over preregistration in terms of workload and convenience? We predicted four benefits of analysis blinding, which led to the following hypotheses:
Disclosures
Preregistration and analysis blinding
Before collecting data, we preregistered the intended analyses on the OSF. These analyses were then verified and adjusted—if necessary—using the blinded version of the data. S. Hoogeveen acted as data manager (i.e., blinded the data set), and A. Sarafoglou verified and adjusted the data analysis. The final analysis pipeline was uploaded to the OSF project page before the analysis on the real data was carried out. Any deviations from the preregistration are mentioned in this article.
Data and materials
Table 1 shows an overview of important resources of the study. Readers can access the preregistration, the materials for the study, the blinded and real data (including relevant documentation), and the R code to conduct all analyses (including all figures) in our OSF folder at https://osf.io/vy8z7/.
Overview of This Study’s Materials Available on OSF
Reporting
We report how we determined our sample size, all data exclusions, and all manipulations in the study. However, because this project was part of the MARP, we will not describe all measures in this study. Here, we describe only measures relevant to the research question. The description of the remaining measures can be found in Hoogeveen, Sarafoglou, Aczel, et al. (2022).
Ethical approval
The study was approved by the local ethics board of the University of Amsterdam (Registration No. 2019-PML-12707). All participants were treated in accordance with the Declaration of Helsinki.
Method
Participants and recruitment
The analysis teams were recruited through advertisements in various newsletters and email lists (e.g., the International Association for the Psychology of Religion, Cognitive Science of Religion, Society for Personality and Social Psychology, and the Society for the Psychology of Religion and Spirituality [Division 36 of the American Psychological Association]), on social media platforms (i.e., blogposts and Twitter), and through the authors’ personal networks. We invited researchers from all career stages (i.e., from doctoral student to full professor). Teams were allowed to include graduate and undergraduate students in their teams as long as each team also included a PhD candidate or a more-senior researcher. Initially, 173 teams signed up to participate in the MARP. From those teams, 127 submitted an analysis plan, and 120 completed the whole project. Out of the final sample of 120 teams, 61 were assigned to the preregistration condition, and 59 were assigned to the analysis-blinding condition. As compensation, the members from each analysis team were included as coauthors on the MARP article. No teams were excluded from the study.
Sampling plan
The preregistered sample size target was set to a minimum of 20 participating teams, which was based on the number of recruited teams in the many-analysts project from Silberzahn and Uhlmann (2015). However, we did not set a maximum number of participating teams. The recruitment of teams was ended on December 22, 2020.
Study design
The current design was a between-subjects design (at the team level). Our dependent variables were (a) total hours worked, (b) perceived effort, (c) perceived frustration, and (d) deviation from the analysis plan. Our independent variable was the assigned analytic strategy, which had two levels (preregistration, analysis blinding).
Randomization
The assignment of teams to conditions was done with block randomization. After sign-up, each analysis team was randomly assigned to one of the two conditions in blocks of four so that the groups were approximately equal size at all times. In four cases, members from different teams requested to collaborate. When those teams were assigned to different conditions and they had not yet submitted an analysis plan, they were instructed not to fill out the preregistration template but to follow the instructions of the analysis-blinding condition instead. We assigned these teams to the preregistration condition because the blinded data were already available to them.
Materials
In Stage 1, teams received the research questions, a project description and a brief summary of the theoretical background on the relationship between religiosity and well-being, the original materials, the documentation for the MARP data, and instructions specific to their assigned condition. In Stage 2, teams were granted access to the MARP data. After sign-up and after completing Stages 1 and 2, the teams were instructed to fill out surveys, further referred to as the presurvey, midsurvey, and postsurvey. The presurvey included questions about the background of the teams. The midsurvey and the postsurvey included questions about the hours worked and about their perceived level of frustration and effort during the process. The postsurvey also inquired whether and how the teams deviated from their submitted analysis plan. Only one survey per analysis team was required, and the teams were instructed to either sum up the responses from each team member (when indicating their hours worked) or give joint answers depending on the consensus within the team. The presurvey, midsurvey, and postsurvey were generated using Google Forms.
Project description and theoretical background
Teams received a five-page document with an overview of the MARP, the research questions, two paragraphs on the theoretical background on the relationship between religiosity and well-being, and a description of the measures and some features in the MARP data (i.e., number of participants, number of countries).
Original materials
The teams received the cross-cultural survey used to collect the MARP data. This survey was provided in English and contained all items and answer options.
MARP data and data documentation
The MARP data featured information of 10,535 participants from 24 countries collected in 2019. The data were collected as part of the cross-cultural religious replication project (see also Hoogeveen et al., 2021; Hoogeveen & van Elk, 2018). The MARP data contained measures of religiosity, well-being, perceived cultural norms of religion, and some demographics.
To achieve analysis blinding, we shuffled the key outcome variable, that is, the well-being scores. In the blinded data, we ensured that the scores on a country level remained intact to facilitate hierarchical modeling and outlier detection. That is, we shuffled well-being within countries so that the average well-being score for each country was the same in the real and blinded data. In addition, we ensured that the well-being scores within each individual remained intact, that is, well-being scores associated with one individual were shuffled together.
The data documentation featured a detailed description for each of the 46 columns in the data. It disclosed the scaling of the items and whether and how many missing values there were in each variable.
Independent variable: assigned analytic strategy
Teams were randomly assigned to the preregistration condition or to the analysis-blinding condition. These conditions differed with respect to the instructions and materials they received in Stage 1. Teams in the preregistration condition received a document that briefly explained preregistration and a preregistration template (see Appendix). The template was a shortened version of the “OSF Preregistration” template from the Center of Open Science. It entailed only the aspects of preregistration related to the analysis plan, that is, the (a) operationalization of the variables, (b) the analytic approach, (c) outlier removal and handling of missing cases, and (d) inference criteria.
Teams in the analysis-blinding condition received a blinded version of the MARP data and a document that briefly explained what analysis blinding is, why analysis blinding can be beneficial, what analysts need to take into account when working with blinded data (e.g., analyses on blinded data may yield different results than when performed on the real data), and which blinding strategy was applied on the MARP data. Specifically, participants received the following information about the blinding strategy:
In this blinded dataset, we made sure that • The relationship between well-being and all other independent variables is destroyed. • Data on the country level are intact. This means that, for instance, the mean religiosity we measured in Germany is identical in the blinded version of the data as well as in the real data. • All well-being scores are intact within a person. • All religiosity scores are intact within a person.
Dependent variables: hours worked, experienced effort, experienced frustration, and deviations from the planned analysis
In the midsurvey and in the postsurvey, we asked participants to indicate their experiences, effort, and frustration to accomplish the tasks from Stage 1 (i.e., writing and submitting the analysis plan) and Stage 2 (i.e., executing the analysis), respectively.
One item asked to indicate how many hours it took the team to accomplish the tasks at the respective stage of the project. The hours of work required to complete a stage thus goes beyond simply writing the preregistration or developing the analysis script and also encompasses potential research that went into finding the appropriate analysis strategies and discussions among team members. The teams could respond by giving numerical values and were instructed to add up the work hours for each team member.
One item asked participants to indicate how hard the team had to work to accomplish the task during the respective stage. This item was answered using a 7-point Likert-type scale from 1 (
In the postsurvey, we asked teams whether they deviated from their analysis plan after they received the real data. For researchers in the preregistration condition, deviations from the analysis plan concerned deviations from the analysis described in the preregistration document. For researchers in the analysis-blinding condition, deviations from the analysis plan concerned adjustments of the analysis script they had developed for the blinded dataset. If researchers answered “yes” to that question, they indicated out of a catalogue of eight aspects which aspects they deviated on. These aspects were (a) hypothesis, (b) included variables, (c) operationalization of dependent variables, (d) operationalization of independent variables, (e) exclusion criteria, (f) statistical test, (g) statistical model, and (h) direction of the effect.
The items concerning the deviations from the analysis plan were based on a subset of the catalogue presented in Claesen et al. (2021). In addition, the teams could describe in a text field which peculiarities caused them to deviate from their analysis plan. 1
Reflection on hours worked
As an additional exploratory variable, we measured whether the indicated work hours were more time than the team had anticipated. This item was answered using a 5-point Likert-type scale from 1 (
Respondents’ research background
In the presurvey, five items asked respondents about their research background. The first item asked how many people the analysis team consisted of. In the final data set, this number was updated for teams that requested to collaborate, meaning that in these cases, the number of team members was summed. The second item asked to describe the represented subfield or subfields of research in the team. The third item asked about what positions were represented in the team. The answer options were (a) doctoral student, (b) postdoc, (c) assistant professor, (d) associate professor, and (e) full professor. The fourth item asked the teams to rate their theoretical knowledge on the topic of religion and well-being. The fifth item asked the teams to rate their knowledge on methodology and statistics. The fourth and fifth items were answered using a 5-point Likert-type scale from 1 (
Respondents’ prior beliefs
In the presurvey, one item asked respondents about their subjective beliefs about the plausibility of the research questions before analyzing the data. This item was answered using a 7-point Likert-type scale from 1 (
Procedure
We started advertising MARP on September 11, 2020. After teams had signed up to the project, we asked them to complete the presurvey. The teams then received their analysis-team number, access to their OSF project folder, and all materials and instructions needed to complete Stage 1 of the project. To complete Stage 1, the teams had to upload their analysis plans to their OSF project page and complete the midsurvey. That is, researchers in the preregistration condition uploaded the filled-out preregistration template, and researchers in the analysis-blinding condition uploaded their analysis script. We then “checked out” the submitted analysis plans (i.e., created a file in their OSF project folder that cannot be edited or deleted). The deadline to complete Stage 1 was December 22, 2020. In Stage 2, the teams then were granted access to the real data. To finalize Stage 2 of the project, the teams had to complete the postsurvey. We also encouraged the teams to upload all relevant files, together with a brief “ReadMe” document and a summary of their results to their project folder. We discouraged the open communication of analysis strategies or results (e.g., through Twitter) until after the official deadline of Stage 2 of the project, which was February 28, 2021.
Statistical model
We used Bayesian inference for all statistical analyses. As we noted in our preregistration, we aimed to collect at least strong evidence (i.e., a Bayes factor [BF] of at least 10) in favor for our hypotheses. Each hypothesis was tested against the null hypothesis that the respective outcomes are the same under both conditions. To test Hypotheses 1 and 2, we conducted one-sided Bayesian independent samples
To test Hypothesis 4, we fitted two zero-inflated Poisson regression models as defined by Lambert (1992) and implemented in McElreath (2016). This model assumes that with probability θ, a team will report zero deviations and that with probability 1 – θ, the number of reported deviations (i.e., zero or higher) are estimated using a Poisson (λ) distribution. The first model included analysis method as predictor, and the second model did not. McElreath expressed the logit-transformed parameter θ1 as the additive term of an intercept and a predictor variable. Following their recommendations, we assigned a standard normal distribution as prior to both the intercept parameter and the predictor variable. Likewise, McElreath expressed the log-transformed parameter λ′ as the additive term of an intercept and a predictor variable, to which we assigned a Normal(0, 10) distribution and a standard normal distribution as prior, respectively.
We then estimated the log marginal likelihoods of these models using bridge sampling and computed the BF for these two models (Gronau et al., 2017, 2020). This BF compared the null hypothesis with the encompassing hypothesis that lets all parameters free to vary. Afterward, we applied the unconditional encompassing method on the first model to estimate the proportion of prior and posterior samples in agreement with our hypothesis and again computed a BF (Gelfand et al., 1992; Hoijtink, 2011; Klugkist, 2008; Klugkist et al., 2005; Sedransk et al., 1985). This BF compared Hypothesis 4 with the encompassing hypothesis that lets all parameters free to vary. Finally, we received the BF comparing Hypothesis 4 with the null hypothesis by multiplying the two BFs. The analysis was conducted in R (R Core Team, 2021).
Deviations from the preregistration
The following deviations from the analysis plan were decided on the basis of the blinded data. In our preregistration, we mentioned that the catalogue listing on which aspects the teams deviated on would span six items. However, when preparing the study materials, we decided to split the aspects “operationalization of variables” into “operationalization of dependent variables” and “operationalization of independent variables” and to add the aspect “statistical test.”
We preregistered that we would exclude no teams from the analyses. However, some teams did not complete all surveys, and thus we were unable to calculate all relevant outcome measures. These teams were excluded from the analysis of those hypotheses for which no outcome measures could be calculated.
Concerning Hypothesis 1, we preregistered to conduct a one-sided Bayesian independent samples
Concerning Hypothesis 2, we preregistered to conduct a one-sided Bayesian Mann-Whitney test with perceived effort as dependent variable and analysis method as independent variable. After inspecting the blinded data, we decided that a Bayesian independent samples
Concerning Hypothesis 3, we preregistered that we test this hypothesis using a one-sided Bayesian Mann-Whitney test with perceived frustration as dependent variable and analysis method as independent variable. We did not change the preregistered analysis plan. Even though we treat the variable perceived frustration as continuous, a Mann-Whitney test seemed most appropriate because the variable did not meet the normality assumption even after we applied transformations.
Results
Sample characteristics
The career stages and research backgrounds featured in each team are shown in Table 2. As apparent from Figure 1, for both conditions, the teams reported less knowledge on the topic of religion and well-being (25% and 31% of teams reported to have [some] expertise on this topic in the preregistration and analysis-blinding condition, respectively) than on their knowledge on methodology and statistics (75% and 89% of teams reported to have [some] expertise on this topic in the preregistration and analysis-blinding condition, respectively).
Positions and Domains Featured in the Analysis Teams per Condition
Note: Teams may include multiple members of the same position and in the same domain.

Responses to the survey questions on the teams’ reported knowledge regarding religion and well-being (left) and knowledge regarding methodology and statistics (right). In each panel, the left bar represents responses from teams who did analysis blinding, and the right bar represents responses from teams preregistered.
Prior beliefs for Research Question 1 were slightly higher in the preregistration group (
Exclusions
One team in the analysis-blinding condition and one team in the preregistration condition did not fill in the Stage 1 survey and therefore could not be included in the analysis. In addition, one team in the preregistration condition did not report its perceived effort in the survey from Stage 1 and was therefore excluded from the analysis regarding Hypothesis 2. Note that one team did not report deviations because it did not submit a final analysis
Confirmatory analyses
Table 3 shows the descriptive statistics of the dependent variables for each condition for the entire project duration and separately for each stage.
For Each Condition, Means and Standard Deviations for the Hours Worked (Workload), Perceived Effort, Perceived Frustration, and Reflection on Hours Worked
Note: Statistics are shown for the total project duration and separately for each stage. For each stage, the number represents the mean with the standard deviation in parentheses. For correlations, the number represents the median estimate for the Bayesian Pearson correlation coefficient, and the numbers in brackets represent the 95% credible interval. The last column shows the median estimate for the Bayesian Pearson correlation coefficient ρ for values in Stages 1 and 2.
The measures hours worked, perceived effort, and reflection on hours worked were positively correlated yet not so strongly to suggest they measured the exact same concept. The Bayesian Kendall’s τ correlations were as follows: For workload in hours and perceived effort, τ = .49, BF+0 = 2.6 × 1012. For hours worked and reflection on hours worked, τ = .32, BF+0 = 83,476. Finally, for perceived effort and reflection on hours worked, τ = .40, BF+0 = 2.3 × 108. Subsequently, for the
Hours worked
Hypothesis 1 stated that the total hours worked of planning and executing the analysis is lower for teams in the analysis-blinding condition than for teams in the preregistration condition. We collected strong evidence for the null hypothesis, that is, that both teams take the same amount of time: BF0

Reported total hours worked of Stage 1 and Stage 2 for each analysis team. The upper panel shows (in orange) responses of teams in the preregistration condition. The lower panel shows (in green) responses of teams in the analysis-blinding condition. The data suggest strong evidence in favor of the null hypothesis that both teams take an equal amount of time planning and executing the analysis. Points are jittered to enhance visibility.
Perceived effort and frustration
Hypothesis 2 stated that the perceived effort of planning and executing the analysis is lower for teams in the analysis-blinding condition than for teams in the preregistration condition. The data were inconclusive. We found no evidence either in favor or against our hypothesis: BF

Responses to the survey questions about the perceived effort (left) and frustration (right) of planning and executing the analysis. The top panel shows responses of teams in the preregistration condition. The bottom panel shows responses of teams in the analysis-blinding condition. The data suggest no or moderate evidence on whether analysis blinding was perceived as less effortful and frustrating, respectively. Points are jittered to enhance visibility.
Hypothesis 3 stated that the perceived frustration when planning and executing the analysis is lower for teams in the analysis-blinding condition than for teams in the preregistration condition. We collected moderate evidence for the null hypothesis: BF0
Deviation from analysis plan
Hypothesis 4 stated that teams in the preregistration condition deviate more often from their planned analysis than teams in the analysis-blinding condition and that when they deviate from their analysis plan, teams in the preregistration condition deviate on more aspects than teams in the analysis-blinding condition. An overview of the reported deviations is given in Table 4 and the number of deviations per condition are depicted in Figure 4. We collected strong evidence in favor for our hypothesis, that is, BF
Reported Deviations From Planned Analysis per Condition
Note: Teams may report multiple deviations.

Reported deviations from planned analysis per condition. The green bars represent teams in the analysis-blinding condition, and the orange bars represent teams in the preregistration condition. More teams in the analysis-blinding condition reported no deviations from their planned analysis, and if they had deviated, they did so on fewer aspects than teams in the preregistration condition.
The aspect most teams deviated from was their exclusion criteria (11 teams), the included variables in the model (nine teams), the operationalization of the independent variables (eight teams), and the statistical model (eight teams). A difference between teams who did analysis blinding and preregistration was most apparent in the exclusion criteria; from 11 teams, 10 were in the preregistration condition. In addition, in the operationalization of the independent variable, almost all deviations were reported by teams who preregistered (eight out of nine).
Exploratory analysis
Differences of the many-analysts’ conclusions per condition
Elaborate results of the many-analysts’ conclusions about the substantive research questions are reported in Hoogeveen, Sarafoglou, Aczel, et al. (2022). Here, we briefly show the analysis teams’ findings split per experimental condition. In Figure 5, the standardized effect sizes (βs) reported by the analysis teams are displayed per condition and research question. 2 For Research Question 1 (“Do religious people self-report higher well-being?”), all teams in the blinding condition reported positive effect sizes for which the 95% CI excludes zero. The median reported β = 0.125, and the median absolute deviation (MAD) = 0.030. Likewise, for the teams in the preregistration condition, all teams reported positive effect sizes with 95% CIs excluding zero. The median reported β = 0.114, and the MAD = 0.039. For Research Question 2 (“Does the relation between religiosity and self-reported well-being depend on perceived cultural norms of religion?”), the majority of teams again reported positive effect sizes with CIs excluding zero. That is, in the blinding condition, 97.9% of the βs were positive, 66.0% of the intervals excluded zero, median β = 0.040, and MAD = 0.030. In the preregistration condition, 94.4% of the βs were positive, 64.8% of the intervals excluded zero, median β = 0.037, and MAD = 0.020.

Effect sizes (βs) with 95% confidence or credible intervals for the two research questions reported by the analysis teams in the Many-Analysts Religion Project. The top row shows the βs for the effect of religiosity on self-reported well-being (Research Question 1), and the bottom row shows βs for the effect of cultural norms of religion on the relation between religiosity and self-reported well-being (Research Question 2). Left are the βs for teams in the blinding condition (in green), and right are panels for teams in the preregistration condition (in orange). The βs are ordered from smallest to largest.
Total hours worked
We conducted an exploratory analysis to test whether the effect of total hours worked goes in the direction opposite to our predictions, that is, whether the total hours worked to plan and execute the task is higher for teams in the analysis-blinding condition than for teams in the preregistration condition. The data suggest inconclusive evidence for this hypothesis, BF+0 = 1.511.
In addition, we compared the reported hours worked between the two project stages. Figure 6 illustrates the responses of the reported work hours separately for Stage 1 and Stage 2. The difference in total hours worked was the largest in Stage 1 of the project, that is, when preregistering the analysis or analyzing the blinded data. Here, teams in the analysis-blinding condition took about twice as much time (

Reported total hours worked of (top) Stage 1 and (bottom) Stage 2 for each analysis team. (Upper, in orange) Responses of teams in the preregistration condition. (Lower, in green) Responses of teams in the analysis-blinding condition. In Stage 1, teams required more time on creating an executable script using the blinded data than teams who created a preregistration. In Stage 2, teams in both conditions required approximately the same amount of time for executing their analysis. Points are jittered to enhance visibility.
Reflection on hours worked
For Stage 1, 25.0% of teams who preregistered reported that completing the task was more work than anticipated, compared with 48.3% of teams who did analysis blinding. When executing the analysis (i.e., Stage 2 of the project), teams in both conditions needed approximately 15 hr to complete the task (i.e., teams in the analysis-blinding condition:
Independently coded deviations
In an additional exploratory analysis, we compared the deviations reported by the analysis teams with the deviations we identified. For this purpose, we (S. Hoogeveen and A. Sarafoglou) independently coded deviations from the analysis plan for each team (see Table 5). For the teams in the preregistration condition, we compared the analysis plan from the preregistration form with the responses from the postsurvey. Only when information did not emerge from the postsurvey did we review the authors’ report or the final analysis scripts. For teams in the analysis-blinding condition, we compared the analysis scripts from the blinded data with the analysis scripts on the real data. Initially, we evaluated three teams independently using the same checklist as we presented in the postsurvey. Subsequently, we discussed their results and agreed on the following adjustments. We decided not to consider it a deviation if teams had planned to conduct their statistical analyses with multiple dependent variables but reported only one of them in the postsurvey. This was because we had explicitly instructed the teams to provide us with only one effect size. For aspects for which we did not know the answer (e.g., because the analysis plan was too vague), we coded it as “Not Available” (NA). In addition, two teams were excluded because they did not submit a final analysis (although they completed the postsurvey and self-reported deviations). The interclass correlation (ICC) between the two independently coded deviations was satisfactory: ICC = .71. We then resolved any disagreement by discussion and then used the combined coding to test Hypothesis 4.
Reported Deviations From Planned Analysis per Condition as Coded by Two Independent Raters
Note: Teams may report multiple deviations.
The result of this exploratory analysis is presented in Table 6. On the basis of the independent coding, we found extreme evidence for the hypothesis that teams in the analysis-blinding condition deviated less from their planned analysis than teams who preregistered (BF
Robustness Checks for the Analysis of the Four Main Hypotheses
Note: For each hypothesis (columns) and robustness set (rows), the Bayes factor (BF) in favor of the restricted alternative hypothesis versus the null hypothesis is given. See the main text for an explanation of the different robustness sets. Empty cells indicate that the adjustments were not relevant for the particular hypothesis.
Robustness checks
In this study, we deviated from our preregistration at several points. First, we have adapted our analyses to the properties of the data (e.g., transformations that were due to the skewness of the data). Second, we deviated from our sampling plan by assigning teams that merged to the analysis-blinding condition (
Constraints on Generality
We believe that our results can be generalized to other research designs (i.e., experimental studies) and do not apply only to correlational studies. However, the outcomes of this study might depend on the complexity of the data and hypotheses researchers are investigating. Specifically, we expect data with a simpler structure than the MARP data (i.e., nonnested structure, no composite measures) to lead to fewer deviations from the analysis plans, whereas data with a more complex structure (e.g., requiring an extensive amount of preprocessing, such as in functional MRI analyses) to magnify the present results.
In addition, our results may not generalize to paradigms and topics that analysis teams are very familiar with. That is, researchers are better at anticipating analysis plans for paradigms they often work with than developing an analysis plan for a completely new data set, measures, and theories. At the same time, most deviations in the present study occurred for the data exclusions, mostly related to unexpected peculiarities of the data that are unrelated to the topic or paradigm (e.g., some participants provided a nonsensical age). Moreover, we cannot determine to which extent the results of the current study generalize beyond multiteam projects. It is possible that researchers conducting their own studies need to perform more preparatory steps than researchers in our study, especially when preregistering or blinding for their own projects. Specifically, we cannot draw conclusions about the perceived workload and convenience when researchers are required to preregister the whole study, including the study design, sampling plan, and materials, or when researchers need to blind a data set first themselves before it is handed to the analysts.
Discussion
In the current study, we investigated whether analysis blinding has benefits over the preregistration of the analysis plan in terms of efficiency and convenience. We analyzed data from 120 teams participating in the MARP who either preregistered their analysis or created a reproducible script using blinded data. We hypothesized that analysis blinding would save researchers time and reduce their perceived effort and frustration to complete the project. In addition, we hypothesized that analysis blinding would lead to fewer deviations from the analysis plan.
One of the four hypotheses was supported. Compared with teams who preregistered, teams who did analysis blinding deviated less often from the analysis plan, and if they did, they did so for fewer aspects. Teams in the analysis-blinding condition better anticipated their final analysis strategies, particularly with respect to exclusion criteria and operationalization of the independent variable. We regard the finding that analysis blinding has a protective effect against deviations as good news for the field of metascience because (fear of) deviation is a well-known problem of preregistration (Claesen et al., 2021; Heirene et al., 2021; Nosek et al., 2019).
Contrary to our prediction, we found strong evidence against our hypothesis that analysis blinding would reduce the hours worked. Teams who did analysis blinding and teams who preregistered spent approximately the same amount of time planning and executing the analysis. We assumed that teams who preregistered would need to work more hours because they were required to create a preregistration document in Stage 1 and write and execute this plan in Stage 2. Teams who did analysis blinding wrote their analysis scripts already in Stage 1 and only had to execute it in Stage 2. This workload benefit for analysis blinding was expected, especially because some of the proposed analyses were quite complex (including factor analyses, structural equation models, and hierarchical regression models). Finally, we cannot draw conclusions about the hypotheses on perceived effort and frustration because the data did not provide strong evidence either in favor of or against our hypotheses. Our data suggested moderate evidence for the hypothesis that teams in both conditions experienced equal amounts of frustration and no evidence either in favor of or against the hypothesis that analysis blinding would be experienced as less effortful. Why were the hours worked approximately equal under preregistration versus analysis blinding? Descriptives on Stage 1 showed that teams who preregistered were in fact quicker than teams who did analysis blinding. In itself, this result is not surprising: One would expect preregistration to be somewhat faster in Stage 1 and that the expected benefit of analysis blinding would mostly occur in Stage 2. What was surprising, however, was how much faster the teams who preregistered were in Stage 1: They took only about half as much time than teams who did analysis blinding.
One explanation is that in the current study, the preregistration of the analysis was particularly simple. The literature is recommending structured workflows and templates to assist researchers with their preregistrations (Nosek et al., 2019; van ’t Veer & Giner-Sorolla, 2016). That applied to the MARP in that the researchers adhered to a highly structured workflow. That is, the research questions were fixed, the teams were provided with a preregistration template, and they had access to the theoretical background of the research question and a comprehensive data documentation. In addition, because the teams analyzed preexisting data, they preregistered only their analysis plan instead of all aspects of the study (i.e., study design, sampling plan, materials).
Descriptives on Stage 2 showed that teams who preregistered and teams who did analysis blinding took about the same amount of time to execute the analysis. We speculate that this result may be due to an improper communication to the teams. To complete Stage 2, the teams were instructed to execute their planned analyses on the real data and fill out the postsurvey to indicate their conclusions and summarize their results. We also provided teams with the type of information required to fill in the postsurvey and recommendations about how to organize their OSF folder. These recommendations included to add a “ReadMe” file that documents the uploaded files and a brief summary of the main conclusions. The time associated with creating these files might have distorted our measure on hours worked. It may be that in Stage 2, most of the time was spent not on conducting the analyses but on writing the report so that differences in workload related to the execution of the analysis may have gone undetected. If true, this would imply that differences between the two methods may not be as relevant in real-world research, for which, again, most of the time may be spent on writing up the results rather than executing the analyses. To gain more insight into the time it takes teams to execute the analysis, future research should provide teams with instructions on how to document their files and results (or more generally speaking, how to complete the project) only after the teams reported their hours worked.
The current study has several limitations, the first of which concerns the measurements. Although our measures of workload, effort, and frustration have high face validity and were taken from a previous study (Hart, 2006), their validity in the present context is unknown. Especially the reported number of hours spent on the project should be interpreted with caution because this was filled out in retrospect by one team member. Future projects could opt for a more objective measure and ask teams in advance to log their work hours (Parry et al., 2021).
The analysis teams, although coauthors of the article, may have been less invested in this large-scale collaboration project than if it were their own research. On the one hand, less emotional commitment to the research hypotheses may be advantageous because it lowers the motivation to engage in questionable research practices, such as
We consider an analysis plan to be of high quality if it is “specific, precise, and exhaustive” (Wicherts et al., 2016, p. 2). The quality of the submitted preregistrations could be rated with the coding protocol used by Wicherts et al. (2016). However, to our knowledge, there exists no comparable coding protocol for submitted analysis code, checking, for instance, its clarity and reproducibility. Such a protocol would still have to be developed and validated so that the assessments of preregistrations and analysis scripts are comparable. Along the same lines, future research could assess the quality of the final analysis, for instance, by letting participating teams rate the work of their peers. However, such a quality check should be done with caution: Assessing the quality of an analysis imposes significant additional work on participating teams, is highly sensitive to subjective analytic preferences, and ignores theoretical considerations.
Although adherence to the analysis plan is desirable to ensure the confirmatory status of an analysis, we speculate that the teams’ deviations are not consequential. As the main results of the MARP show, almost all teams found a positive effect for Research Question 1. Thus, the fact that teams in the preregistration condition deviated from their analysis plans more often than teams in the analysis-blinding condition most likely had no practical consequences. The extent to which this pattern of inconsequential deviations also holds for other data and research questions (e.g., an experiment in which the null hypothesis is true) needs to be investigated in future studies.
The current study focused on planning and executing an analysis whose confirmatory status could be guaranteed. Thus, we are unable to determine how analysis blinding and preregistration compare with standard research. We deliberately decided not to include such a baseline condition because the teams answered a theoretically relevant research question, and thus, we saw the necessity to safeguard the confirmatory status of all analyses.
Regardless of our results, the decision whether to prefer preregistration or blinding of analyses is always a matter of circumstance and research design. In the MARP, analysis blinding has been particularly suitable because the data managers (i.e., the team with access to the real data) were completely independent of the analysis teams. From our subjective experience, we also found that researchers who had access to the blinded data asked us data managers fewer questions in Stage 1 than researchers who had access only to the data documentation. Therefore, we can imagine that especially many-analysts projects can benefit greatly from analysis blinding. It would also be worth considering giving researchers access to blinded data first when they want to perform reanalyses or meta-analyses rather than providing them directly with the real data.
In contrast, in very small research groups, there is often no guarantee that the analysis blinding has actually been done effectively. For instance, it cannot be ruled out that data managers and analysts discuss certain data patterns and thus develop new analyses that presumably lead to desirable results. Preregistrations allow for better control because they are time-stamped and it is possible to find out exactly in which time period data were collected. 3
However, even in cases in which researchers solely preregister their study, the analysis plan can be developed on the basis of simulated data or on data from previous work (which was recommended, for instance, in Nosek et al., 2019). The resulting syntax can then be added to the preregistration document. Refining an analysis plan on simulated data helps researchers anticipate an analytic strategy and removes ambiguities from the preregistration.
We emphasize again, however, that researchers can also use preregistration and analysis blinding in combination. In a survey by Sarafoglou, Kovacs, et al. (2022), researchers reported that preregistration benefited multiple aspects of the research process, including the research hypothesis, study design, and preparatory work. We therefore regard it as most beneficial if researchers preregister the study but finalize the statistical analysis on a blinded version of the data—in fact, this was the procedure we used in the present report. To our knowledge, this is the first study that sought to investigate analysis blinding empirically in the social and behavioral sciences. Analysis blinding ties in with current methodological reforms for more transparency because it safeguards the confirmatory status of the analyses while simultaneously allowing researchers to explore peculiarities of the data and account for them in their analysis plan. Our results showed that analysis blinding and preregistration imply approximately the same amount of work but that in addition, analysis blinding reduced deviations from analysis plans. Thus, analysis blinding constitutes an important addition to the toolbox of effective methodological reforms to combat the crisis of confidence.
Footnotes
Appendix
Acknowledgements
The analyses were conducted in JASP (JASP Team, 2021) and in R (Version 4.0.3; R Core Team, 2021) using the following packages:
Transparency
Contributorship was documented with CRediT taxonomy using tenzing (Holcombe et al., 2020).
