Abstract
Preregistration can help to restrict researcher degrees of freedom and thereby ensure the integrity of research findings. However, its ability to restrict such flexibility depends on whether researchers specify their study plan in sufficient detail and adhere to this plan. Previous research indicates higher restrictiveness when preregistrations are based on structured versus unstructured template formats, although there is room for further improvement. In this study, we built on these findings and investigated the restrictiveness of preregistrations based on the Psychological Research Preregistration-Quantitative (PRP-QUANT) Template, an extensive template that aids the preregistration of quantitative studies in psychology. Preregistrations were sampled from PsychArchives and coded for their level of restrictiveness using the coding schemes of Bakker et al. and Heirene et al. We predicted that preregistrations based on the PRP-QUANT Template (
While conducting studies, researchers hold a substantial degree of flexibility in decision-making, often referred to as “researcher degrees of freedom” (RDF; Simmons et al., 2011; for an illustration, see Huntington-Klein et al., 2021). This flexibility can potentially compromise the validity of findings and drawn conclusions, especially in the event of data-driven decisions or other forms of exploitation (Simmons et al., 2011).
Preregistration, the practice of publishing a time-stamped research plan before data collection or analysis (see Parsons et al., 2022), helps limit RDF by predetermining and transparently disclosing decisions concerning the research process (as argued by Forstmeier et al., 2017; Hardwicke & Wagenmakers, 2023; Wicherts et al., 2016) and allows others to evaluate the severity of the hypothesis test (Lakens, 2019). In practice, it is not always possible to make all research decisions in advance and thus completely limit RDF, for example, if the focus is on hypothesis generation rather than testing. In these cases, brief preregistrations can already substantially increase transparency by signaling which decisions were made in advance and which were not. Nonetheless, whenever feasible, more extensive and detailed preregistrations may be particularly effective in restricting RDF (as proposed by Wicherts et al., 2016).
Preregistration templates, prompting for information to include in the preregistration, can assist researchers in creating such restrictive preregistrations, but they vary in the level of detail that is requested. In their study, Bakker et al. (2020) compared preregistrations created using a structured versus unstructured template format regarding their ability to restrict RDF. The inspected unstructured format was the “Standard Pre-Data Collection Registration” (https://osf.io/9j6d7), which inquires only about whether data have already been collected or examined, leaving other descriptions open. This was compared with the structured format of the “OSF Preregistration” (formerly “Prereg Challenge Registration,” Version 4, https://osf.io/jea94), which consists of 26 items more closely assessing the hypotheses, sampling plan, variables, design, and planned analyses. To evaluate the inspected preregistrations’ restrictiveness, they devised an extensive coding scheme based on the RDF defined by Wicherts et al. (2016). Based on this, they found better but not yet exhaustive restriction of RDF with the structured- compared with the unstructured-template format (Bakker et al., 2020). Other studies that compared the OSF Preregistration Template with less extensive templates found similar results (Toth et al., 2021; Van Den Akker et al., 2023). These findings suggest that structured templates are associated with higher RDF restriction while also indicating room for further improvement.
Restrictiveness of Preregistrations Created With the Psychological Research Preregistration-Quantitative Template
In 2022, the “Psychological Research Preregistration-Quantitative (PRP-QUANT) Template” was published by a Joint Psychological Societies Preregistration Task Force (Bosnjak et al., 2022). It was developed based on the American Psychological Association’s Journal Article Reporting Standards (Appelbaum et al., 2018) and previous preregistration templates. In contrast to the OSF Template, whose scope covers various disciplines, the PRP-QUANT Template is specifically tailored to the field of psychology. Compared with previous templates, various items underwent description revisions, some items were divided into smaller subquestions, and new items were introduced. Because the PRP-QUANT Template is very extensive (including overall 45 items) and was specifically designed to prompt for many details and enable precise planning (see Bosnjak et al., 2022), our objective was to investigate whether it can indeed contribute to achieving higher restrictiveness.
By inspecting preregistrations created with this template, we investigated the extent to which it restricts RDF and which RDF are more restricted than others (Research Question 1) and compared its restrictiveness with the OSF Preregistration Template inspected by Bakker et al. (2020; Research Question 2). Because of its level of detail, we predicted that preregistrations created with the PRP-QUANT Template restrict RDF more than preregistrations based on the OSF Preregistration Template (Hypothesis 1).
Furthermore, we assessed whether peer review of preregistrations further restricts RDF (as suggested by Bakker et al., 2020; Research Question 3), for example, by reviewers identifying gaps in the preregistration and recommending that the authors provide additional information. To answer this question, we inspected PRP-QUANT preregistrations that were submitted to Leibniz Institute for Psychology’s (ZPID) service, PsychLab, to apply for a free-of-charge data collection. Because PsychLab aimed to promote preregistration by offering this incentive for high-quality preregistrations, the submitted preregistrations underwent evaluation by external reviewers before acceptance, assessing their (a) originality and incremental value, (b) relationship to the literature, (c) methodology, (d) quality of the questionnaire and definition of research constructs, and (e) implications of the proposed study. We compared PRP-QUANT preregistrations that were peer reviewed as part of this service with PRP-QUANT preregistrations published by authors without any additional review and predicted that peer-reviewed preregistrations restrict RDF more than non-peer-reviewed preregistrations (Hypothesis 2).
Adherence to the Preregistered Plan and Reporting of Deviations
Deviations from the preregistered plan can be useful and necessary for improving studies; however, it is important that such deviations are transparently reported to ensure interpretability. Given the emerging evidence of insufficient disclosure of deviations in research articles (Chan et al., 2004, 2008; Chen et al., 2019; e.g., Claesen et al., 2021; Goldacre et al., 2019; Ofosu & Posner, 2023; Van Den Akker et al., 2023; for a review, see TARG Meta-Research Group & Collaborators et al., 2023), we inspected the published research articles associated with the sampled PRP-QUANT preregistrations, following the procedure of Heirene et al. (2024), who investigated the restriction of RDF in gambling studies’ preregistrations. We descriptively assessed the extent to which researchers that used the PRP-QUANT Template adhered to their preregistered plan and how they reported deviations in their articles (Research Question 4).
Method
Transparency statement
This Stage 2 Registered Report (RR) was recommended by Peer Community in Registered Reports (PCI RR) on September 21, 2025 (Lakens, 2025). The recommendation letter and reviews are available at https://doi.org/10.24072/pci.rr.101013.
We report how we determined our sample size, all data exclusions, all inclusion/exclusion criteria, whether inclusion/exclusion criteria were established before data analysis, all manipulations, and all measures in the study. We meet Level 3 of the PCI RR bias control (PCI RR, n.d.). Our study design is displayed in Table A1 in the appendix. All study materials, including the R-Markdown-document (RMD) file underlying this article (https://doi.org/10.23668/psycharchives.21201), analysis scripts (https://doi.org/10.23668/psycharchives.21202), coding schemes (https://doi.org/10.23668/psycharchives.16152), and the data, that is, the list of all included PRP-QUANT preregistrations and coded RDF (as scientific-use file, https://doi.org/10.23668/psycharchives.16151), have been published alongside this article (https://doi.org/10.23668/psycharchives.21200) on PsychArchives. The Stage 1 RR, recommended by PCI RR on February 12, 2024 (Lakens, 2024a), and all materials and preliminary data are also available on PsychArchives at https://doi.org/10.23668/psycharchives.14119. All deviations from the Stage 1 RR are displayed in Table 1. For each deviation, a justification is provided.
Deviations From the Stage 1 Registered Report
Note: For explanations of coding abbreviations (e.g., C4, D2), see Table 2. RR = registered report; RDF = researcher degrees of freedom.
Sample
In this observational study, we sampled preregistrations that were created with the PRP-QUANT Template and published in the digital research repository PsychArchives (https://psycharchives.org/). We conducted a search for PRP-QUANT preregistrations in PsychArchives using the corresponding metadata tag (“zpid.tags.visible:PRP-QUANT”) because the PRP-QUANT Template is made available through and closely linked to this repository (https://www.psycharchives.org/en/item/088c79cb-237c-4545-a9e2-3616d6cc8453). In addition, we inspected all studies conducted via ZPID’s PsychLab service by referring to our internal documentation and conducting a search on PsychArchives (“zpid.tags.visible:PsychLab”).
From all identified preregistrations, we included those in our coding that were based on the PRP-QUANT Template, written in English or German, publicly accessible (i.e., not under embargo), and empirical studies that included at least one testable hypothesis (see Bakker et al., 2020; Heirene et al., 2024). To inspect researchers’ adherence to the preregistered plan and reporting of deviations, we also searched for associated publications for all included preregistrations (e.g., by inspecting the PsychArchives record and conducting a Google search using the preregistration DOI).
For the Stage 1 RR, we performed an initial search to assess the feasibility of our search strategy, yielding a total of 74 eligible preregistrations (peer reviewed:
All PRP-QUANT preregistrations were compared with the 52 OSF preregistrations sampled by Bakker et al. (2020) to test Hypothesis 1 (accessible at Veldkamp et al., 2020). In the Stage 1 RR, our sample size of 74 PRP-QUANT preregistrations already surpassed that of Bakker et al., which they determined through a power analysis for a Wilcoxon-Mann-Whitney test with α = .05 and a power of .8 to detect a medium effect size of Cohen’s

Sensitivity curves. (a) Hypothesis 1 (PRP-QUANT vs. OSF preregistrations). (b) Hypothesis 2 (peer-reviewed vs. non-peer-reviewed PRP-QUANT preregistrations). The calculations were based on the preliminary sample sizes reported in the Stage 1 Registered Report. Power simulations were conducted in R (R Core Team, 2023). PRP-QUANT = Psychological Research Preregistration-Quantitative Template.
To test Hypothesis 2, we compared all PRP-QUANT preregistrations that were peer reviewed as part of PsychLab with the remaining PRP-QUANT preregistrations uploaded directly by researchers to PsychArchives without undergoing external review. For this comparison, the group sizes were limited by the number of available (non-)peer-reviewed preregistrations. However, the sensitivity curve in Figure 1b shows that even with the preliminary group sizes of 27 reviewed preregistrations and 47 nonreviewed preregistrations, we would still have had a power of .89 to detect small effects of
To compare the study types of both samples, all preregistrations were coded as to whether an experiment, quasi-experiment, or nonexperiment (e.g., observational, correlational, survey) was preregistered. In both samples, experiments were the most common study type, but their percentage was higher in the OSF sample (PRP-QUANT = 49.51%, OSF = 73.08%). Resultingly, nonexperiments were more prominent in PRP-QUANT (44.66%) than OSF preregistrations (25%). The same was true for quasi-experiments (PRP-QUANT = 5.83%, OSF = 1.92%).
Measures and coding procedure
To ensure comparability, we used the protocols provided by Heirene et al. (2024), which they adapted from Bakker et al. (2020), to code restrictiveness in the PRP-QUANT preregistrations and adherence in their associated articles. These protocols are based on the 34 RDF defined by Wicherts et al. (2016), which encompass flexibility across five key stages: theorizing, design, collection, analyses, and reporting (see Table 2).
Overview of RDF Inspected When Assessing Restrictiveness and Adherence
Note: Questions are abbreviated. The full coding scheme is available in the supplemental material. RDF = researcher degrees of freedom; T = theorizing; D = design; C = collection; A = analyses; R = reporting; IV = independent variable; DV = dependent variable; HARKing = hypothesizing after the results are known.
For assessing restrictiveness and adherence, we focused on the RDF that are applicable to preregistrations (cf. Table 2; restrictiveness: T1–A15, R6; adherence: T1–A15). For example, for the RDF “T1: Conducting exploratory research without any hypothesis,” restrictiveness was coded with the question “Is at least one hypothesis specified such that it is clear what are the IV(s) [independent variable(s)] and DV(s) [dependent variable(s)]?”; adherence was coded with “Are the hypotheses reported the same as in the preregistration?”
Overall, 23 questions were used to code restrictiveness (i.e., there were dependencies in that some questions informed multiple RDF). The coding was based on the dimensions outlined in Table 3. As an additional measure of restrictiveness, we assessed the clarity and distinctiveness of preregistered hypotheses, similar to Heirene et al. (2024). Specifically, we examined the number of preregistrations in which the number of hypotheses differed depending on whether they were interpreted as single or as several linked but autonomous predictions (e.g., in cases in which several predicted effects were mentioned in a single statement).
Scoring of Restrictiveness, Adherence, and Deviation Type
Note: Scores were adapted from Heirene et al. (2024). When multiple hypotheses, variables, statistical models, and so on were described in the preregistration and relevant for an RDF, the overall score for that RDF was based on the lowest evaluation. For some RDF, only a subset of restrictiveness scores was possible (see coding scheme in the supplemental material). RDF = researcher degrees of freedom.
Scores of 3 were coded for comparability with Bakker et al. (2020) but were recoded to 2 because explicit statements that authors will adhere to their planned methods and avoid additional processes are not common in preregistrations. Note that the coding of the deviation types was slightly altered, as described in Table 1.
Twenty-four questions were used to code adherence. If an article comprised multiple studies, adherence was assessed based on the level of preregistrations (i.e., if an article included two preregistered studies, adherence was evaluated for each preregistration-article pair). We distinguished between three types of deviations from preregistration to article: modifying, additive, and omitting (see Table 3). If the methods presented in the article differed from those outlined in the preregistration, deviations were coded as modifying. They were labeled as additive if the article introduced information not included in the preregistration and as omitting if information provided in the preregistration was absent in the associated article. For modifying deviations, we furthermore examined in more detail whether they were disclosed and justified (i.e., whether the authors provided a reason for why the deviation occurred). The full coding scheme is available in the supplemental material (Spitzer et al., 2025b).
Each preregistration was coded independently by two persons (L. Spitzer, A. Kroeger). Inconsistencies were discussed and solved in pairs. As a measure of intercoder reliability, a pilot coding phase was conducted using a randomly selected 10% of the sample. Krippendorff’s α was calculated to assess intercoder reliability. We planned to proceed with the coding process as planned if α would exceed the threshold of .7 and revise the coding protocols and strategies by discussing ambiguities if the intercoder reliability would fall below this threshold. For the restrictiveness coding, Krippendorff’s α was acceptable based on this criterion (α = .72). We therefore left the coding scheme unchanged after the pilot coding phase but added decision rules in a few places in which individual cases had previously been difficult to categorize (highlighted in the coding scheme, see Spitzer et al., 2025b). The adherence coding displayed more ambiguities and consequently, a low intercoder reliability (α = .52). However, discussion revealed that this was not because of the coding scheme but rather, the high level of complexity of both preregistrations and articles. The coding scheme was therefore not adapted. Instead, the coders discussed and resolved discrepancies as defined in the Stage 1 RR, increasing the accuracy of the agreed-on scores compared with the individual ones.
Data analysis
R packages and scripts
This article was written with the R package
Our analysis scripts are based on the scripts provided by Heirene et al. (2024). To adapt and test these, we used a blinded version of the OSF Preregistration data provided by Bakker et al. (2020) in which all numbers were replaced with random values within the coding range. A dummy data set was used for the coded PRP-QUANT preregistrations. The preliminary analysis scripts (Spitzer & Mueller, 2024a), the blinded/dummy data employed for testing them (Spitzer & Mueller, 2024c), and its corresponding RMD file (Spitzer & Mueller, 2024d) are available alongside the Stage 1 RR (Spitzer & Mueller, 2024e). The final analysis scripts (Spitzer et al., 2025a) and the R Markdown file that underlies this article—incorporating the code used to generate all outputs displaying the results (Spitzer et al., 2025c)—are accessible in the supplemental material.
Preprocessing
For each preregistration, the responses to the questions in our coding scheme were translated into restrictiveness scores for each RDF.
Subsequently, we adjusted all restrictiveness scores of 3 to 2 for both the PRP-QUANT and OSF preregistrations. A score of 3 required an explicit statement from authors that they would adhere to their planned methods and avoid additional processes. Heirene et al. (2024) reported that scores of 3 were rarely achieved because of the scarcity of these explicit statements from the authors and thus suggested this adjustment for future studies. To evaluate the impact of this decision on the results, we conducted sensitivity analyses by rerunning the hypothesis tests with the nonrecoded data and report differences.
Restrictiveness
To assess the extent to which the PRP-QUANT Template restricts RDF (Research Question 1), we inspected the distribution of restrictiveness scores of PRP-QUANT preregistrations across all RDF. In addition, stacked bar plots of restrictiveness scores for each RDF are displayed for PRP-QUANT and OSF preregistrations in Figure 2 and for peer-reviewed and non-peer-reviewed PRP-QUANT preregistrations in Figure 3. We also examined the number of preregistrations in which the minimum and maximum number of hypotheses varied when viewed as single versus interconnected but independent predictions, providing means, standard deviations, medians, and minimum and maximum values for both interpretations.

Distribution of restrictiveness scores for Psychological Research Preregistration-Quantitative (PRP-QUANT) Template and OSF Template preregistrations.

Distribution of restrictiveness scores for (non-)peer-reviewed Psychological Research Preregistration-Quantitative (PRP-QUANT) Template preregistrations.
To test our two hypotheses (Research Question 2/Hypothesis 1: higher restrictiveness in PRP-QUANT than OSF preregistrations; Research Question 3/Hypothesis 2: higher restrictiveness in peer-reviewed than non-peer-reviewed preregistrations), we largely adopted the methods employed by Bakker et al. (2020) and Heirene et al. (2024). Duplicate information (i.e., RDF based on the same questions as others: C4, A5, A10, A12, R6) were excluded from these analyses.
First, we imputed missing values using a two-way imputation procedure based on row and column means. Specifically, the overall mean, the mean for each RDF, and the mean for each preregistration were computed based on available values, and missing values were imputed using the formula RDF mean + preregistration mean – overall mean (Bernaards & Sijtsma, 2000).
To compare the restrictiveness scores between (a) PRP-QUANT and OSF preregistrations and (b) peer-reviewed and non-peer-reviewed PRP-QUANT preregistrations, we performed one-tailed nested Wilcoxon-Mann-Whitney tests using the R package
Adherence
Adherence to the preregistered plans and reporting of deviations (Research Question 4) were analyzed descriptively. We focused on two aspects: the number of preregistration-article pairs with deviations and the total deviations across all pairs. At the level of preregistration-article pairs, we analyzed the number of studies that included modifying, additive, or omitting deviations. We provide the average number of deviations and their corresponding standard deviations and minimum and maximum values. At the level of total deviations across pairs, we report percentages and frequencies of different deviation types (see Table 6). For modifying deviations, we also assessed the proportion of justified, unjustified, and nondisclosed deviations.
Results
Restrictiveness
Overall restriction of RDF through the PRP-QUANT Template
Across all PRP-QUANT preregistrations, 968 of the 2,987 coded RDF were not restricted (32.41%), and 479 were partially restricted (16.04%). For 1,105 RDF, full restriction according to the used coding scheme was achieved (36.99%). In 435 cases (14.56%), RDF were not applicable for the coded preregistrations. Full restrictiveness was particularly prevalent for T1 (hypothesis), T2 (direction of hypothesis), D3 (multiple DV measures), and A5 (selected DV measure). Meanwhile, D2 (additional IVs), D4 (additional constructs), A7 (primary outcome selection), A10 (adding additional IVs), and R6 (hypothesizing after results are known) were often not restricted (i.e., they had the highest/lowest score for > 75% of coded RDF). The distribution of restrictiveness scores for PRP-QUANT compared with the OSF preregistrations is displayed in Figure 2.
Even though T1 and T2 (hypothesis and direction of hypothesis) reached a high level of restrictiveness according to the coding scheme, for 79 preregistrations (76.70%), we still identified that the hypotheses were not specified clearly. Specifically, the number of hypotheses differed depending on whether they were interpreted as single predictions (
Higher RDF restriction in PRP-QUANT than OSF preregistrations
Our first hypothesis was that preregistrations based on the PRP-QUANT Template constrain RDF more than preregistrations based on the OSF Preregistration Template. In line with our hypothesis, the PRP-QUANT preregistrations had a significantly higher restrictiveness than the OSF preregistrations,
Comparisons Between PRP-QUANT and OSF Preregistration Restrictiveness Scores for Individual RDF
Note: Hypothesis tests were conducted with imputed data. The
A sensitivity analysis showed that recoding the restrictiveness scores from 3 to 2 did not affect the results of the nested Wilcoxon-Mann-Whitney test,
Higher restriction of RDF in peer-reviewed than in non-peer-reviewed preregistrations
Second, we predicted that peer-reviewed PRP-QUANT preregistrations restrict RDF more than non-peer-reviewed preregistrations created with the same format. Consistent with our hypothesis, restrictiveness was significantly higher for peer-reviewed preregistrations than non-peer-reviewed preregistrations,
Comparisons Between Peer-Reviewed and Non-Peer-Reviewed PRP-QUANT Preregistration Restrictiveness Scores for Individual RDF
Note: Hypothesis tests were conducted with imputed data. The
As shown in a sensitivity analysis, recoding the restrictiveness scores from 3 to 2 had no effect on the nested Wilcoxon-Mann-Whitney test,
High occurrence of deviations
In 19 of the preregistration-article pairs (100%), the preregistration, the article, or both were not specified in sufficient detail for completely assessing the adherence between them. For 5.04% of RDF, no information was provided in the preregistration (UP scores per preregistration-article pair:
Two of the 19 inspected research articles made no modifying deviations (10.53%), that is, the information provided in the preregistration and article were consistent (not considering additive and omitting deviations). Meanwhile, 17 displayed modifying deviations (89.47%). In this group, eight articles contained declared deviations. On average, the articles included 1.06 declared and justified deviations (
Examining the adherence scores across preregistration-article pairs at the level of RDF, we observed that for 233 RDF, no deviations were present (51.10% of the 456 coded RDF). Meanwhile, a total of 59 modifying deviations were found (12.94%). Out of these, 15 were justified (25.42%), and five were not justified (8.47%). We identified a total of 39 undeclared deviations, which accounted for 66.10% of all modifying deviations (see Table 6). Undeclared deviations were most often related to the hypothesis (T1, present in 42.11% of the publications), the statistical models (A13, present in 36.84%), and the exclusion criteria (D5, present in 21.05%). In addition, we identified 23 additive (5.04%) and 48 omitting deviations (10.53%).
Deviation Types Present in the PRP-QUANT Preregistrations by RDF
Note: Twenty-four questions were used to code adherence for 29 RDF (i.e., there were some dependencies in that the same questions informed multiple RDF). Duplicate answers were excluded from analyses. Table shows percentage (frequency) of different deviation types made with respect to each RDF. Modifying = deviation occurred between preregistration and article (adherence = 0); additive = RDF was not restricted in the preregistration, but related information was described in the article (adherence = UP); omitting = RDF was restricted in the preregistration but not mentioned in the article (adherence = UA); unable to determine = no information in neither the preregistration nor the article (adherence = UB); NA = not applicable; RDF = researcher degrees of freedom; IVs = independent variables; DV = dependent variables; PRP-QUANT = Psychological Research Preregistration-Quantitative Template.
Exploratory analyses
In addition to the confirmatory analyses, we conducted two unplanned exploratory analyses to examine the influence of peer review on the preregistrations in greater detail.
First, it is possible that the peer review of some of the PRP-QUANT preregistrations contributed to their higher scores compared with the OSF sample (note, however, that these were also checked but only for completeness, not quality; see Center for Open Science, n.d.). To investigate this further, we created a plot comparing the scores between PRP-QUANT and OSF preregistrations only for non-peer-reviewed PRP-QUANT preregistrations (see Fig. 4). Visual inspection indicates that even for this subsample, for many RDF, preregistrations based on the PRP-QUANT Template tended to have descriptively higher restrictiveness compared with the OSF sample.

Distribution of restrictiveness scores for nonreviewed Psychological Research Preregistration-Quantitative (PRP-QUANT) Template versus OSF Template preregistrations.
Second, we were interested in whether the deviation types differed between peer-reviewed and non-peer-reviewed PRP-QUANT preregistrations. Indeed, peer-reviewed preregistrations tended to show fewer deviations (see Table 7). This could indicate another positive effect of peer review of preregistrations because reviewers might support the creation of preregistration so that fewer deviations are necessary later (e.g., because something does not work as intended, information was not provided in the preregistration).
Deviation Types Present in the Non-Peer-Reviewed Versus Peer-Reviewed PRP-QUANT Preregistrations
Note: Twenty-four questions were used to code adherence for 29 RDF (i.e., there were some dependencies in that the same questions informed multiple RDF). Duplicate answers were excluded from analyses. Table shows percentage (frequency) of different deviation types. Modifying = deviation occurred between preregistration and article (adherence = 0); additive = RDF was not restricted in the preregistration, but related information was described in the article (adherence = UP); omitting = RDF was restricted in the preregistration but not mentioned in the article (adherence = UA); unable to determine = no information in neither the preregistration nor the article (adherence = UB); NA = not applicable; RDF = researcher degrees of freedom; PRP-QUANT = Psychological Research Preregistration-Quantitative Template.
Discussion
In our study, we examined the extent to which preregistrations based on the extensive PRP-QUANT Template (Bosnjak et al., 2022) restrict RDF (Research Question 1). We compared these preregistrations with those using the earlier OSF Template (Research Question 2) and investigated whether restrictiveness could be further enhanced through peer review (Research Question 3). In addition, we evaluated the degree to which researchers adhered to their PRP-QUANT preregistrations in the related articles (Research Question 4).
Higher restrictiveness in PRP-QUANT and peer-reviewed preregistrations
Our results show that around a third of RDF was fully restricted in PRP-QUANT preregistrations and that around half remained only partially or unrestricted. Furthermore, even though T1 and T2 (hypothesis and direction of hypothesis) achieved high scores based on the coding scheme, we still found that hypotheses were not specified clearly in 76.70% of preregistrations. The reason for this discrepancy is that the coding scheme awarded high values if IV and DV were defined clearly within single hypotheses, whereas our deeper investigation focused on the whole set of hypotheses, that is, whether the minimum and maximum number of hypotheses might vary when viewed as single versus interconnected but independent predictions. For the latter, a high degree of ambiguity was found, meaning that readers might evaluate the same statements as fewer or more hypotheses based on their subjective perception. This aligns with earlier findings (Bakker et al., 2020; Heirene et al., 2024) and suggests that there is still room for improvement because flexibility persisted in these preregistrations for both the RDF in general and the hypotheses in particular.
However, compared with the earlier OSF preregistrations, 18 of the 23 tested RDF were more restricted in PRP-QUANT preregistrations (17 significantly so), which also resulted in an overall higher restrictiveness in the latter. Our effect size of
A higher restrictiveness was also found for 22 of the 23 tested RDF in peer-reviewed versus non-peer-reviewed preregistrations (of these, 14 comparisons were significant after correction). This suggests that peer review is indeed a valuable tool for enhancing the quality of preregistrations, a potential that is currently underused.
High occurrence of deviations and need for more transparent reporting
Only two of the 19 inspected research articles adhered completely to their preregistration, providing further evidence that deviations from preregistrations are common. Importantly, 13 articles contained undeclared deviations, which accounted for around two-thirds of all modifying deviations. This and the facts that researchers report continued insecurity about how to deal with deviations and readers of articles in psychology do not typically inspect the preregistrations (Spitzer & Mueller, 2023) highlight the need for a more transparent—and potentially standardized—handling of deviations. Fortunately, this issue has been recognized by the psychological-research community, that is, there have been first attempts to address this problem in the research literature. For example, Lakens (2024b) described in which cases it makes sense to deviate from the preregistered plan, and there are now also templates for reporting deviations (Spitzer & Mueller, 2024b; Willroth & Atherton, 2024).
In addition, we observed a high occurrence of both additive and omitting deviations. Additive deviations suggest that the preregistrations were either lacking in detail or incomplete, and omitting deviations may indicate outcome-reporting bias. Alternatively, they may reflect a shift in practice in which authors no longer provide a comprehensive description of all methods in the article but refer to the preregistration. This shows the importance of the preregistrations for fully understanding the evolution of the research, further underscoring the need for transparent reporting strategies.
Limitations
Our study has several limitations that need to be considered when interpreting these results. First, it is important to recognize that our coding does not provide a definitive assessment of restrictiveness—because this would require a complete understanding of the garden of forking paths (i.e., of all possible decisions that could be taken during conducting the research; see Gelman & Loken, 2013)—but an approximation. In addition, we found the coding scheme to be overly strict in some cases, leading to lower scores than we deemed appropriate for some RDF (e.g., D2 “additional IVs,” which explicitly asked whether preregistration authors indicated that no further covariates would be used and could be coded only with either 0 = no or 3 = yes).
Despite explicit decision rules, some ambiguity remained in the coding process, leading to low interrater reliability, particularly in the adherence assessment. To mitigate this, both coders discussed and resolved discrepancies, ensuring the final scores were more accurate than the initial individual ratings. Overall, although the coding scheme displayed some challenges, we believe that it still provides a useful basis for comparing the two preregistration samples because both were coded using the same criteria. Nevertheless, it might be useful to further revise the coding schemes in the future. Additional RDF could then also be considered, such as how hypotheses are linked to theories and what conclusions are drawn from each statistical test (as suggested by Reviewer 2).
We were also not able to blind coders to the identity of the templates because we coded only PRP-QUANT preregistrations and compared them with an existing sample of OSF preregistrations. This introduces the possibility of bias during coding, which we sought to minimize by employing a detailed and structured coding scheme adapted from earlier research (Heirene et al., 2024).
A further limitation concerns the procedure used to impute the NA values for the hypothesis tests, which favored groups with a higher proportion of NA values. If, for example, the same number of scores of 2 were available for both compared groups (e.g., PRP-QUANT and OSF preregistrations) and some additional scores of 1 were available for one group and the other group had more NA values, the imputation procedure would favor the second group in that the imputed values there would be formed based on the higher scores and would therefore be higher. Although this should be considered when interpreting the results (especially for RDF that had an overall high amount of NA values), it is also important to note that these values indicate that the authors of the coded preregistration had specified that an RDF was not relevant to them (e.g., in cases such as blinding). Therefore, this favoring might make sense in that NA values (i.e., deliberate indications that something is not relevant) are better than lower restrictiveness values.
We cannot rule out the potential influence of confounding variables in our study. Foremost, the PRP-QUANT Template was introduced at the beginning of 2021, meaning that the PRP-QUANT preregistrations in our sample are more recent than the OSF preregistrations, which were published in 2016, used for our comparison. In addition, our PRP-QUANT sample consisted partly of peer-reviewed preregistrations submitted in response to a call for free-of-charge data collection. It is conceivable that researchers put more effort into such preregistrations. However, the inspected OSF preregistrations were also part of a call, that is, the “Preregistration Challenge” organized by the Center for Open Science. Here, researchers also applied for funding, and the preregistrations were reviewed for completeness (but not quality; see Center for Open Science, n.d.). Both samples therefore appear to be comparable, although confounding influences cannot be ruled out.
Furthermore, for the deviation analyses, we note that we were able to identify articles for only a fraction of the preregistration sample, that is, only for the older ones from 2021 to 2022. No articles were identified for newer preregistrations, probably because they had not yet been published. This may have had an impact on the rate of identified deviations.
Finally, in the Stage 1 RR, we specified that we would consider all existing PRP-QUANT preregistrations published in the digital research repository PsychArchives by searching for the corresponding metadata tag. This was based on the assumption that the preregistrations in the archive were tagged accordingly. However, it turned out that 41 eligible preregistrations were missing the tag and therefore incorrectly not included in our data set. Although we do not assume that the unidentified preregistrations differ systematically from the sampled ones, this might still be another confounding factor.
Future developments
Continuous evaluation of open-science practices, such as preregistration, is essential to ensure they achieve their intended goals. Future research in this area could inspect preregistrations based on other templates or compare them across different research areas (see Heirene et al., 2024). It might also be interesting to compare a current sample of OSF preregistrations with our PRP-QUANT sample to rule out the potential confounding influence of time. In addition, preregistration templates could also be evaluated directly, for example, regarding their usability (similar to our approach in Spitzer et al., 2024).
Another aspect we did not address in our study but that could be of interest would be a closer examination of the deviations from preregistrations to final articles. Specifically, we assessed whether modifying deviations were disclosed and justified but not whether they constituted an improvement in methodology. It could be interesting to explore why researchers choose to deviate and whether such changes ultimately enhance the quality of the study.
Meta-analytical investigations of preregistrations, especially the comparison between preregistrations and associated articles, could be facilitated by publishing preregistrations in machine-readable form (see Lakens & DeBruine, 2021). In addition, this could help to ensure that preregistrations are published more in accordance with the FAIR (i.e. findable, accessible, interoperable, and reusable) principles (Wilkinson et al., 2016).
Conclusion
In our study, PRP-QUANT preregistrations were associated with greater RDF restriction than OSF preregistrations, suggesting that developing and using highly structured, detailed templates may effectively help reduce unwanted flexibility in preregistrations. Furthermore, restrictiveness was greater in peer-reviewed than non-peer-reviewed preregistrations, highlighting the potential benefit of peer review in this context. Meanwhile, deviations from preregistered plans—both declared and undeclared—were common in the inspected articles, emphasizing the persisting lack of transparent reporting.
Footnotes
Appendix
Study Design, Based on the Template Provided by Peer Community in Registered Reports
| Question | Hypothesis | Sampling plan | Analysis plan | Rationale for deciding the sensitivity of the hypothesis test | Interpretation given different outcomes | Theory that could be shown wrong by the outcomes |
|---|---|---|---|---|---|---|
| Research Question 1: To what extent does the PRP-QUANT Template restrict RDF, and which RDF are more restricted than others? | None | We sampled all PRP-QUANT preregistrations published on PsychArchives which contained the corresponding metadata tag. We included all preregistrations that met our inclusion criteria (i.e., preregistrations that were based on the PRP-QUANT Template, were written in English or German, were publicly accessible, were empirical studies, and included at least one testable hypothesis). An initial search identified 74, to which all other preregistrations published up to the start of coding were added (final sample: |
The distribution of restrictiveness scores of PRP-QUANT preregistrations across all RDF was inspected. In addition, stacked bar plots of restrictiveness scores for each RDF are displayed for PRP-QUANT and OSF preregistrations and for peer-reviewed and non-peer-reviewed PRP-QUANT preregistrations. We also examined the number of preregistrations in which the minimum and maximum number of hypotheses varied when viewed as single versus interconnected but independent predictions, providing means, standard deviations, medians, and minimum and maximum values for both interpretations. | Descriptive analyses of the PRP-QUANT preregistrations’ restrictiveness scores were used to answer this research question. No hypothesis tests were conducted. | The results are reported descriptively. | N/A |
| Research Question 2: Are RDF more restricted in preregistrations created with the PRP-QUANT Template compared | Hypothesis 1 (primary): Preregistrations created with the PRP-QUANT Template restrict RDF more (i.e., have higher restrictiveness scores) than preregistrations based on the format inspected by Bakker | All included PRP-QUANT preregistrations ( |
We conducted a nested one-tailed Wilcoxon-Mann-Whitney test to compare restrictiveness scores between PRP-QUANT and OSF preregistrations using the R package |
Bakker et al. (2020) determined their sample size of 53 by conducting a power analysis for a Wilcoxon-Mann-Whitney test with α = .05 and a power of .8 to detect a medium effect size | We preregistered the following interpretation in Stage 1: If the preregistrations created with the PRP-QUANT format restrict RDF more (i.e., have an overall higher restrictiveness score) | This test was not grounded in a clear-cut theory but was based on the assumption that employing more structured templates is linked to higher |
| with the OSF Preregistration Template studied by Bakker et al. (2020)? | et al. (2020; i.e., the OSF Preregistration Template). | Stage 1 RR indicated that with the preliminary sample sizes (PRP-QUANT preregistrations: |
model, template was treated as a fixed effect, and RDF was treated as a random effect. First, group-specific |
of Cohen’s |
compared with the OSF preregistrations sampled by Bakker et al. (2020; support for Hypothesis 1), it will be concluded that the PRP-QUANT format is indeed more effective in reducing RDF than the previous format in the field of psychology. It therefore appears worthwhile to develop/use highly structured templates in the future. However, if contrary to our predictions, the PRP-QUANT preregistrations do not have significantly higher restrictiveness scores than the OSF ones, we will conclude that there is no evidence that the PRP-QUANT Template achieves a higher level of restrictiveness. We will also further examine for how many of the individual RDF restrictiveness is higher in PRP-QUANT than OSF preregistrations and will conclude that the benefit of the PRP-QUANT Template might be most pronounced for all RDF showing significant differences. | restrictiveness, as initially described by Bakker et al. (2020). Our objective was to examine whether a template even more structured and detailed than the one previously studied by Bakker et al. (2020) can even better restrict RDF. |
| Research Question 3: Can peer review of preregistrations help to restrict RDF? | Hypothesis 2 (secondary): Peer-reviewed preregistrations created with the PRP-QUANT Template restrict RDF more (i.e., have higher restrictiveness scores) than non-peer-reviewed preregistrations created with the same format. | All PRP-QUANT preregistrations that were reviewed were compared with the remaining non-peer-reviewed PRP-QUANT preregistrations. A sensitivity analysis conducted for the Stage 1 RR showed that with the preliminary group sizes of 27 reviewed and 47 nonreviewed preregistrations, we would have had a power of .89 to detect small effects of |
Similar to the analysis of Hypothesis 1, we conducted a one-tailed nested Wilcoxon-Mann-Whitney test to compare the restrictiveness scores between peer-reviewed versus non-peer-reviewed PRP-QUANT preregistrations (procedure is detailed above). Review status was treated as a fixed effect, and RDF was treated as a random effect. To determine significance, a criterion of α = .05 was applied. In addition, we conducted 23 more Wilcoxon-Mann-Whitney tests to compare the restrictiveness scores for the individual RDF. For these follow-up tests, |
For this comparison, the group sizes were limited by the number of available (non-)peer-reviewed preregistrations. However, our sensitivity analysis in the Stage 1 RR indicated that we still had a high power to detect even small effects (e.g., a power of .89 to detect effects of |
We preregistered the following interpretation in Stage 1: If our analysis reveals that peer-reviewed preregistrations exhibit a higher level of restrictiveness (i.e., have an overall higher restrictiveness score) compared with non-peer-reviewed preregistrations (supporting Hypothesis 2), we will conclude that peer review is indeed a valuable tool for enhancing the quality of preregistrations, a potential that is currently underused. If we find no significant difference in the overall restrictiveness between peer-reviewed and non-peer-reviewed preregistrations, we will conclude that there is insufficient evidence to support the necessity of peer review for achieving high restrictiveness. As for Hypothesis 1, we will also inspect for how many of the individual RDF restrictiveness is higher in peer-reviewed than non-peer-reviewed preregistrations. Based on these analyses, we will conclude that | This test was also not based on a formulated theory but rather on the observation made by Bakker et al. (2020) that peer review could potentially have a positive effect on the restrictiveness of preregistrations. |
| the benefit of peer review for increasing restrictiveness might be most evident for RDF exhibiting significant differences. | ||||||
| Research Question 4: To what degree do researchers that used the PRP-QUANT Template adhere to their preregistered plan, what deviations occur, and how are these reported? | None | We searched for associated publications for all included preregistrations by examining the PsychArchives record of each preregistration and searching for the preregistration DOI on the internet ( |
Researchers’ adherence to their preregistered plans and reporting of deviations were analyzed descriptively. We focused on two aspects: the number of preregistration-article pairs with deviations and the total deviations across all pairs. At the level of preregistration-article pairs, we analyzed the number of studies that included modifying, additive, or omitting deviations. We provide the average number of deviations and their corresponding standard deviations and minimum and maximum values. At the deviations level, we calculated percentages and frequencies of different types of deviations for each RDF and overall across all preregistration-article pairs, presenting the results in a table. For modifying deviations, we also assessed the proportion of justified, unjustified, and nondisclosed deviations. | Descriptive analyses of the PRP-QUANT preregistrations’ adherence and deviation type scores were used to answer this research question. No hypothesis tests were conducted. | The results are reported descriptively. | N/A |
Note: PRP-QUANT = Psychological Research Preregistration-Quantitative Template; RDF = researcher degrees of freedom; RR = registered report.
Acknowledgements
The grammar of individual text sections was improved with the help of artificial intelligence, but no text sections were generated by artificial intelligence. Registered reports (RR) involving existing data at Peer Community in Registered Reports: For our study, we compared a new data set coded using PRP-QUANT preregistrations with existing data from Bakker et al. (2020). We assume a bias level of 3. For our Stage 1 RR, we had already downloaded the data from Bakker et al.; however, we did not look at them and blinded these data sets to write and test our analysis scripts (the script used for blinding is available in the supplemental material of the Stage 1 RR, Spitzer & Mueller, 2024a). In addition, we had already downloaded the PRP-QUANT preregistrations that existed to date for the Stage 1 RR submission but did not begin coding until receiving in-principle acceptance. For additional supporting information, see Spitzer and Mueller (2024c).
Transparency
