Sage Journals: Discover world-class research

Abstract

Preregistration can help to restrict researcher degrees of freedom and thereby ensure the integrity of research findings. However, its ability to restrict such flexibility depends on whether researchers specify their study plan in sufficient detail and adhere to this plan. Previous research indicates higher restrictiveness when preregistrations are based on structured versus unstructured template formats, although there is room for further improvement. In this study, we built on these findings and investigated the restrictiveness of preregistrations based on the Psychological Research Preregistration-Quantitative (PRP-QUANT) Template, an extensive template that aids the preregistration of quantitative studies in psychology. Preregistrations were sampled from PsychArchives and coded for their level of restrictiveness using the coding schemes of Bakker et al. and Heirene et al. We predicted that preregistrations based on the PRP-QUANT Template (N = 103) are more restrictive than preregistrations based on the OSF Preregistration Template (N = 52; Hypothesis 1). We also inspected whether peer review can contribute further to restricting flexibility using nested Wilcoxon-Mann-Whitney tests and predicted higher restrictiveness for peer-reviewed (n = 29) than non-peer-reviewed preregistrations (n = 74; Hypothesis 2). In addition, we examined adherence to the preregistered plans in the associated publications (N = 19). In line with Hypothesis 1, PRP-QUANT preregistrations had significantly higher restrictiveness scores than OSF preregistrations. Moreover, consistent with Hypothesis 2, peer-reviewed preregistrations had significantly higher restrictiveness than non-peer-reviewed ones. Of the associated articles, 73.68% included undeclared deviations. We discuss the implications of our findings for the PRP-QUANT Template and structured templates in general.

Keywords

preregistration open science metaresearch reproducibility replicability open data open materials

While conducting studies, researchers hold a substantial degree of flexibility in decision-making, often referred to as “researcher degrees of freedom” (RDF; Simmons et al., 2011; for an illustration, see Huntington-Klein et al., 2021). This flexibility can potentially compromise the validity of findings and drawn conclusions, especially in the event of data-driven decisions or other forms of exploitation (Simmons et al., 2011).

Preregistration, the practice of publishing a time-stamped research plan before data collection or analysis (see Parsons et al., 2022), helps limit RDF by predetermining and transparently disclosing decisions concerning the research process (as argued by Forstmeier et al., 2017; Hardwicke & Wagenmakers, 2023; Wicherts et al., 2016) and allows others to evaluate the severity of the hypothesis test (Lakens, 2019). In practice, it is not always possible to make all research decisions in advance and thus completely limit RDF, for example, if the focus is on hypothesis generation rather than testing. In these cases, brief preregistrations can already substantially increase transparency by signaling which decisions were made in advance and which were not. Nonetheless, whenever feasible, more extensive and detailed preregistrations may be particularly effective in restricting RDF (as proposed by Wicherts et al., 2016).

Preregistration templates, prompting for information to include in the preregistration, can assist researchers in creating such restrictive preregistrations, but they vary in the level of detail that is requested. In their study, Bakker et al. (2020) compared preregistrations created using a structured versus unstructured template format regarding their ability to restrict RDF. The inspected unstructured format was the “Standard Pre-Data Collection Registration” (https://osf.io/9j6d7), which inquires only about whether data have already been collected or examined, leaving other descriptions open. This was compared with the structured format of the “OSF Preregistration” (formerly “Prereg Challenge Registration,” Version 4, https://osf.io/jea94), which consists of 26 items more closely assessing the hypotheses, sampling plan, variables, design, and planned analyses. To evaluate the inspected preregistrations’ restrictiveness, they devised an extensive coding scheme based on the RDF defined by Wicherts et al. (2016). Based on this, they found better but not yet exhaustive restriction of RDF with the structured- compared with the unstructured-template format (Bakker et al., 2020). Other studies that compared the OSF Preregistration Template with less extensive templates found similar results (Toth et al., 2021; Van Den Akker et al., 2023). These findings suggest that structured templates are associated with higher RDF restriction while also indicating room for further improvement.

Restrictiveness of Preregistrations Created With the Psychological Research Preregistration-Quantitative Template

In 2022, the “Psychological Research Preregistration-Quantitative (PRP-QUANT) Template” was published by a Joint Psychological Societies Preregistration Task Force (Bosnjak et al., 2022). It was developed based on the American Psychological Association’s Journal Article Reporting Standards (Appelbaum et al., 2018) and previous preregistration templates. In contrast to the OSF Template, whose scope covers various disciplines, the PRP-QUANT Template is specifically tailored to the field of psychology. Compared with previous templates, various items underwent description revisions, some items were divided into smaller subquestions, and new items were introduced. Because the PRP-QUANT Template is very extensive (including overall 45 items) and was specifically designed to prompt for many details and enable precise planning (see Bosnjak et al., 2022), our objective was to investigate whether it can indeed contribute to achieving higher restrictiveness.

By inspecting preregistrations created with this template, we investigated the extent to which it restricts RDF and which RDF are more restricted than others (Research Question 1) and compared its restrictiveness with the OSF Preregistration Template inspected by Bakker et al. (2020; Research Question 2). Because of its level of detail, we predicted that preregistrations created with the PRP-QUANT Template restrict RDF more than preregistrations based on the OSF Preregistration Template (Hypothesis 1).

Furthermore, we assessed whether peer review of preregistrations further restricts RDF (as suggested by Bakker et al., 2020; Research Question 3), for example, by reviewers identifying gaps in the preregistration and recommending that the authors provide additional information. To answer this question, we inspected PRP-QUANT preregistrations that were submitted to Leibniz Institute for Psychology’s (ZPID) service, PsychLab, to apply for a free-of-charge data collection. Because PsychLab aimed to promote preregistration by offering this incentive for high-quality preregistrations, the submitted preregistrations underwent evaluation by external reviewers before acceptance, assessing their (a) originality and incremental value, (b) relationship to the literature, (c) methodology, (d) quality of the questionnaire and definition of research constructs, and (e) implications of the proposed study. We compared PRP-QUANT preregistrations that were peer reviewed as part of this service with PRP-QUANT preregistrations published by authors without any additional review and predicted that peer-reviewed preregistrations restrict RDF more than non-peer-reviewed preregistrations (Hypothesis 2).

Adherence to the Preregistered Plan and Reporting of Deviations

Deviations from the preregistered plan can be useful and necessary for improving studies; however, it is important that such deviations are transparently reported to ensure interpretability. Given the emerging evidence of insufficient disclosure of deviations in research articles (Chan et al., 2004, 2008; Chen et al., 2019; e.g., Claesen et al., 2021; Goldacre et al., 2019; Ofosu & Posner, 2023; Van Den Akker et al., 2023; for a review, see TARG Meta-Research Group & Collaborators et al., 2023), we inspected the published research articles associated with the sampled PRP-QUANT preregistrations, following the procedure of Heirene et al. (2024), who investigated the restriction of RDF in gambling studies’ preregistrations. We descriptively assessed the extent to which researchers that used the PRP-QUANT Template adhered to their preregistered plan and how they reported deviations in their articles (Research Question 4).

Method

Transparency statement

This Stage 2 Registered Report (RR) was recommended by Peer Community in Registered Reports (PCI RR) on September 21, 2025 (Lakens, 2025). The recommendation letter and reviews are available at https://doi.org/10.24072/pci.rr.101013.

We report how we determined our sample size, all data exclusions, all inclusion/exclusion criteria, whether inclusion/exclusion criteria were established before data analysis, all manipulations, and all measures in the study. We meet Level 3 of the PCI RR bias control (PCI RR, n.d.). Our study design is displayed in Table A1 in the appendix. All study materials, including the R-Markdown-document (RMD) file underlying this article (https://doi.org/10.23668/psycharchives.21201), analysis scripts (https://doi.org/10.23668/psycharchives.21202), coding schemes (https://doi.org/10.23668/psycharchives.16152), and the data, that is, the list of all included PRP-QUANT preregistrations and coded RDF (as scientific-use file, https://doi.org/10.23668/psycharchives.16151), have been published alongside this article (https://doi.org/10.23668/psycharchives.21200) on PsychArchives. The Stage 1 RR, recommended by PCI RR on February 12, 2024 (Lakens, 2024a), and all materials and preliminary data are also available on PsychArchives at https://doi.org/10.23668/psycharchives.14119. All deviations from the Stage 1 RR are displayed in Table 1. For each deviation, a justification is provided.

Table 1.

Deviations From the Stage 1 Registered Report

Section	Description and justification
Manuscript	The text of the Stage 1 RR manuscript was adapted for the Stage 2 RR, that is, we changed the tense to past tense, updated the results (actual instead of dummy data), and appended discussion and conclusion sections. In addition, the color of the plot legends were changed slightly to improve their readability.
	When reporting the adherence results, we changed the wording of one sentence slightly to emphasize that the reported results refer specifically to modifying deviations, not to the other deviation types.
Study-type coding	We had originally planned to code study types in detail (e.g., experiment, quasi-experiment, survey research, correlational, observational, cross-sectional) but realized during coding that the categories were not completely exclusive. We therefore decided to differentiate only between experiment, quasi-experiment, and nonexperiment. The original categories can still be found in the separately coded data of the individual coders; these were then adapted accordingly for the joint coding. The categories for the study-type coding were not explicitly described in the Stage 1 RR and also have no effect on the hypothesis tests, but we disclose this development for reasons of transparency.
Adherence coding	There was an error in Table 1 and the adherence coding scheme: C4 is the same as D6, not as D7. This is clear when inspecting the corresponding coding instructions. Therefore, the adherence question for C4 was corrected.
	In the Stage 1 RR and the associated analysis scripts, the coding of the deviation types was determined by a strict combination of restrictiveness and adherence scores (e.g., for modifying deviations: restrictiveness > 0 and adherence = 0). However, we discovered that there was not always a clear correspondence of restrictiveness and adherence scores based on the coding schemes, meaning that the adherence score could be 1 (= no deviation) even if the restrictiveness score was 0 (= not restricted). For example, for D2, the restrictiveness coding explicitly asks whether it is said that no further covariates will be used (either 0 = no or 3 = yes), but for the corresponding adherence score, the coding could still be 1 (= no deviation) if the covariates described in the publication match those in the preregistration. Therefore, the coding of the deviation types is now based on only the adherence score. However, this should not change the coding of the deviation types content-wise and is not part of any hypothesis tests.
Analyses	For the restrictiveness analyses concerning individual RDF, A7 (selecting another construct as the primary outcome) was excluded because all values were 0, that is, not restricted.
Analysis scripts	The analysis script was updated slightly to be applicable to the final, nonblinded data, and errors were corrected (e.g., erroneous creation of NA values). All edits are commented in the analysis script.

Note: For explanations of coding abbreviations (e.g., C4, D2), see Table 2. RR = registered report; RDF = researcher degrees of freedom.

Sample

In this observational study, we sampled preregistrations that were created with the PRP-QUANT Template and published in the digital research repository PsychArchives (https://psycharchives.org/). We conducted a search for PRP-QUANT preregistrations in PsychArchives using the corresponding metadata tag (“zpid.tags.visible:PRP-QUANT”) because the PRP-QUANT Template is made available through and closely linked to this repository (https://www.psycharchives.org/en/item/088c79cb-237c-4545-a9e2-3616d6cc8453). In addition, we inspected all studies conducted via ZPID’s PsychLab service by referring to our internal documentation and conducting a search on PsychArchives (“zpid.tags.visible:PsychLab”).

From all identified preregistrations, we included those in our coding that were based on the PRP-QUANT Template, written in English or German, publicly accessible (i.e., not under embargo), and empirical studies that included at least one testable hypothesis (see Bakker et al., 2020; Heirene et al., 2024). To inspect researchers’ adherence to the preregistered plan and reporting of deviations, we also searched for associated publications for all included preregistrations (e.g., by inspecting the PsychArchives record and conducting a Google search using the preregistration DOI).

For the Stage 1 RR, we performed an initial search to assess the feasibility of our search strategy, yielding a total of 74 eligible preregistrations (peer reviewed: n = 27; not peer reviewed: n = 47) and 17 associated publications (for an overview of the preliminary sample, see supplemental material of the Stage 1 RR, Spitzer & Muller, 2024c). After receiving the in-principle acceptance and just before coding began, we repeated our search, as preregistered, and identified an additional two peer-reviewed and 27 non-peer-reviewed preregistrations and two publications. Therefore, our final sample consisted of 103 preregistrations (peer reviewed: n = 29, not peer reviewed: n = 74) and 19 publications.

All PRP-QUANT preregistrations were compared with the 52 OSF preregistrations sampled by Bakker et al. (2020) to test Hypothesis 1 (accessible at Veldkamp et al., 2020). In the Stage 1 RR, our sample size of 74 PRP-QUANT preregistrations already surpassed that of Bakker et al., which they determined through a power analysis for a Wilcoxon-Mann-Whitney test with α = .05 and a power of .8 to detect a medium effect size of Cohen’s d = 0.5 (which corresponds to Cliff’s D of approximately 0.33; Romano et al., 2006), a difference they defined as practically meaningful between two samples of preregistrations. Because our sample size was already determined by the number of available PRP-QUANT preregistrations, we conducted sensitivity analyses for our hypothesis tests (Lakens, 2022) based on the preliminary sample sizes. Figure 1a shows a sensitivity curve depicting the relationship between effect size and power for testing Hypothesis 1 given the preliminary sample sizes (PRP-QUANT registrations: n = 74, OSF preregistrations: n = 52), which was created in R (R Core Team, 2023) based on a power simulation with 1,000 repetitions that incorporated the variability in the data from Bakker et al. (see R script in the supplemental material of the Stage 1 RR, Spitzer & Mueller, 2024a). This curve suggested that we would have had a power of .97 to detect small effects of d = 0.2 for the overall difference in restrictiveness between templates, employing a nested Wilcoxon-Mann-Whitney test and α = .05. Meanwhile, an effect size of d = 0.5 would have been detectable with a power above .99. Because the effect size found in Bakker et al. was even higher (D = 0.49, which resembles d of about 0.8; Romano et al., 2006), an effect of similar size could therefore also have been detected with a high power. We anticipated that the difference between two structured templates was likely smaller than that between a structured and an unstructured template, but we were still able to detect this effect in the nested Wilcoxon-Mann-Whitney test with our final sample size.

Fig. 1.

Sensitivity curves. (a) Hypothesis 1 (PRP-QUANT vs. OSF preregistrations). (b) Hypothesis 2 (peer-reviewed vs. non-peer-reviewed PRP-QUANT preregistrations). The calculations were based on the preliminary sample sizes reported in the Stage 1 Registered Report. Power simulations were conducted in R (R Core Team, 2023). PRP-QUANT = Psychological Research Preregistration-Quantitative Template.

To test Hypothesis 2, we compared all PRP-QUANT preregistrations that were peer reviewed as part of PsychLab with the remaining PRP-QUANT preregistrations uploaded directly by researchers to PsychArchives without undergoing external review. For this comparison, the group sizes were limited by the number of available (non-)peer-reviewed preregistrations. However, the sensitivity curve in Figure 1b shows that even with the preliminary group sizes of 27 reviewed preregistrations and 47 nonreviewed preregistrations, we would still have had a power of .89 to detect small effects of d = 0.2 with α = .05, whereas an effect size of d = 0.5 could have been detected with a power above .99. Because we extended the sample through our second search, the final sample size exceeded these previous calculations.

To compare the study types of both samples, all preregistrations were coded as to whether an experiment, quasi-experiment, or nonexperiment (e.g., observational, correlational, survey) was preregistered. In both samples, experiments were the most common study type, but their percentage was higher in the OSF sample (PRP-QUANT = 49.51%, OSF = 73.08%). Resultingly, nonexperiments were more prominent in PRP-QUANT (44.66%) than OSF preregistrations (25%). The same was true for quasi-experiments (PRP-QUANT = 5.83%, OSF = 1.92%).

Measures and coding procedure

To ensure comparability, we used the protocols provided by Heirene et al. (2024), which they adapted from Bakker et al. (2020), to code restrictiveness in the PRP-QUANT preregistrations and adherence in their associated articles. These protocols are based on the 34 RDF defined by Wicherts et al. (2016), which encompass flexibility across five key stages: theorizing, design, collection, analyses, and reporting (see Table 2).

Table 2.

Overview of RDF Inspected When Assessing Restrictiveness and Adherence

Code	RDF	Restrictiveness question	Adherence question
T1	Conducting exploratory research without any hypothesis	Is at least one hypothesis specified such that it is clear what are the IV(s) and DV(s)?	Are the hypotheses reported the same as in the preregistration?
T2	Studying a vague hypothesis that fails to specify the direction of the effect	Is the direction of the hypothesis specified?	Is the direction of each hypothesis the same?
D1	Creating multiple manipulated IVs and conditions	Does the text exclude the possibility that at least one of the manipulated variables will be omitted in the test of the hypothesis?	Are the manipulated IVs operationalized in the same way as stated in the protocol?
		Does it specify exactly how the manipulated variable will be used in the analysis to test the hypothesis?
D2	Measuring additional variables that can later be selected as covariates, IVs, mediators, or moderators	Does it exclude the possibility that at least one other variable (e.g., covariate) is included in the analysis?	Are all variables included in analyses testing hypotheses, consistent with the preregistered analysis plan?
D3	Measuring the same DV in several alternative ways	Does it specify which measurement instrument will be used as the main outcome variable?	Are the DVs measured in the same way as stated in the preregistration?
D4	Measuring additional constructs that could potentially act as primary outcomes	Does it specify that the confirmatory-analysis section of the paper will not include another DV than the ones specified in all hypotheses?	Are all DVs included in analyses reported in the preregistration?
D5	Measuring additional variables that enable later exclusion of participants from the analysis (e.g., awareness or manipulation checks)	Does the preregistration indicate inclusion and exclusion criteria in selecting data points?	Are the criteria for including data points in analyses consistent?
D6	Failing to conduct a well-founded power analysis	Is a power analysis reported?	Is the sample size involved in analyses consistent with the outcomes of the power analysis reported in the preregistration?
D7	Failing to specify the sampling plan and allowing for running (multiple) small studies	Is the sampling protocol outlined, including the exact number of participants, recruitment strategy, eligibility criteria, and stopping rules?	Is the sampling protocol stated in the preregistration followed?
C1	Failing to randomly assign participants to conditions	Is it specified how randomization is implemented?	Is the randomization procedure used consistent with that reported in the preregistration?
C2	Insufficient blinding of the participants and/or experimenters	Does it describe procedures to blind participants and/or experimenters to conditions?	Is the blinding procedure used consistent with that reported in the preregistration?
C3	Correcting, coding, or discarding data during data collection in nonblinded manner	Does it include protocols concerning coding of data, discarding of cases, or correction of scores during data collection?	Are the procedures used to code and manage data during the data-collection process consistent?
C4	Determining the data-collection stopping rule on the basis of desired results or intermediate significance testing	Is the sampling protocol outlined, including the exact number of participants, recruitment strategy, eligibility criteria, and stopping rules? (Same as D7)	Is the sample size involved in analyses consistent with the outcomes of the power analysis reported in the preregistration? (Same as D6)
A1	Choosing between different options of dealing with incomplete or missing data on ad hoc grounds	Does it indicate how the study deals with incomplete or missing data?	Are the procedures used to deal with missing data consistent with those reported in the preregistration?
A2	Specifying preprocessing of data (e.g., cleaning, normalization, smoothing, and motion correction) in an ad hoc manner	Does it offer a protocol for preprocessing the data when required (e.g., corrected for motion and other artifacts)?	Are the procedures used to preprocess data consistent?
A3	Deciding how to deal with violations of statistical assumptions in an ad hoc manner	Does it indicate how to test for and deal with violations of statistical assumptions?	Are the procedures used to test for statistical assumptions consistent?
A4	Deciding on how to deal with outliers in an ad hoc manner	Does it indicate how to detect outliers and how they should be dealt with?	Are the procedures used to identify and deal with outliers consistent?
A5	Selecting the DV out of several alternative measures of the same construct	Does it specify which measurement instrument will be used as the main outcome variable? (Same as D3)	Are the DVs measured in the same way as stated in the preregistration? (Same as D3)
A6	Trying out different ways to score the chosen primary DV	Is the method used to measure the primary outcome variable(s) fully described?	Are the DVs scored in a way that is consistent?
A7	Selecting another construct as the primary outcome	Does it specify that the confirmatory-analysis section of the paper will not include another DV than the ones specified in all hypotheses? (Similar to D4)	Are the DVs used in primary analyses all the same as reported in the preregistration?
A8	Selecting IVs out of the set of manipulated IVs	Does the text exclude the possibility that at least one of the manipulated variables will be omitted in the test of the hypothesis? (Similar to D1)	Are the IVs used in primary analyses all the same?
A9	Operationalizing manipulated IVs in different ways (e.g., by discarding or combining levels of factors)	Does it specify exactly how the manipulated variable will be used in the analysis to test the hypothesis? (Similar to D1)	Are the manipulated IVs operationalized in the same way as stated in the protocol? (Same as D1)
A10	Choosing to include different measured variables as covariates, IVs, mediators, or moderators	Does it exclude the possibility that at least one other variable (e.g., covariate) is included in the analysis? (Same as D2)	Are all variables included in analyses testing hypotheses, consistent with the preregistered analysis plan? (same as D2)
A11	Operationalizing nonmanipulated IVs in different ways	Are the methods to measure the nonmanipulated IV(s) fully described?	Are nonmanipulated IVs operationalized in a way consistent with the preregistration?
A12	Using alternative inclusion and exclusion criteria for selecting participants in analyses	Does the preregistration indicate inclusion and exclusion criteria in selecting data points? (Same as D5)	Are the criteria for including data points in analyses consistent? (Same as D5)
A13	Choosing between different statistical models	Does it specify the statistical model(s) that will be used to test the hypothesis (e.g., logistic regression)?	Are the statistical tests used to test hypotheses consistent?
A14	Choosing the estimation method, software package, and computation of standard errors	Does it indicate details of the estimation technique used to estimate the statistical model and compute standard errors?	Are the estimation techniques used to estimate the statistical model(s) consistent?
		Does it specify which statistical software package and version is used for running the analyses?	Is the statistical software used to conduct analyses consistent with the preregistered plan?
A15	Choosing inference criteria (e.g., Bayes’s factors, alpha level)	Does it indicate the inference criteria (e.g., Bayes’s factors, alpha level)?	Are the inference criteria used consistent?
R6	Presenting exploratory analyses as confirmatory (HARKing)	Does it specify that the confirmatory-analysis section of the paper will not include another DV than the ones specified in all hypotheses? (Same as A7)

Note: Questions are abbreviated. The full coding scheme is available in the supplemental material. RDF = researcher degrees of freedom; T = theorizing; D = design; C = collection; A = analyses; R = reporting; IV = independent variable; DV = dependent variable; HARKing = hypothesizing after the results are known.

For assessing restrictiveness and adherence, we focused on the RDF that are applicable to preregistrations (cf. Table 2; restrictiveness: T1–A15, R6; adherence: T1–A15). For example, for the RDF “T1: Conducting exploratory research without any hypothesis,” restrictiveness was coded with the question “Is at least one hypothesis specified such that it is clear what are the IV(s) [independent variable(s)] and DV(s) [dependent variable(s)]?”; adherence was coded with “Are the hypotheses reported the same as in the preregistration?”

Overall, 23 questions were used to code restrictiveness (i.e., there were dependencies in that some questions informed multiple RDF). The coding was based on the dimensions outlined in Table 3. As an additional measure of restrictiveness, we assessed the clarity and distinctiveness of preregistered hypotheses, similar to Heirene et al. (2024). Specifically, we examined the number of preregistrations in which the number of hypotheses differed depending on whether they were interpreted as single or as several linked but autonomous predictions (e.g., in cases in which several predicted effects were mentioned in a single statement).

Table 3.

Scoring of Restrictiveness, Adherence, and Deviation Type

Coding	Score	Description
Restrictiveness	0	Not specified: opportunistic use of RDF not restricted at all
	1	Some specification but lacking details: opportunistic use of RDF is restricted to some extent
	2	Detailed specification: opportunistic use of RDF is completely restricted but no explicit statement confirming that authors will not deviate from this plan by adding additional methods/processes
	3^a	Detailed specification and statement that authors will not deviate from their plan by adding additional methods/processes: opportunistic use of RDF is completely restricted
	NA	RDF item not relevant to preregistration
Adherence	0	Not consistent with preregistration—deviation
	1	Consistent with preregistration—no deviation
	U_P	Unable to conclusively assess deviations because information is not provided in the preregistration
	U_A	Unable to conclusively assess deviations because information is not provided in the article
	U_B	Unable to conclusively assess deviations because information is not provided in both the preregistration and article
	NA	Not applicable
Deviation type	Modifying	Information differs between preregistration and article (adherence = 0), for example, different randomization procedures are described in the preregistration and article
	Additive	No information about an RDF was provided in the preregistration, but this information appears in the article (adherence = U_P), for example, randomization procedure is not described in the preregistration but in the article
	Omitting	Information about an RDF was included in the preregistration but was subsequently omitted in the article (adherence = U_A), for example, randomization procedure is described in the preregistration but not mentioned in the article
	U	No information provided in both the preregistration and article (adherence = U_B)
	NA	Not applicable

Note: Scores were adapted from Heirene et al. (2024). When multiple hypotheses, variables, statistical models, and so on were described in the preregistration and relevant for an RDF, the overall score for that RDF was based on the lowest evaluation. For some RDF, only a subset of restrictiveness scores was possible (see coding scheme in the supplemental material). RDF = researcher degrees of freedom.

Scores of 3 were coded for comparability with Bakker et al. (2020) but were recoded to 2 because explicit statements that authors will adhere to their planned methods and avoid additional processes are not common in preregistrations. Note that the coding of the deviation types was slightly altered, as described in Table 1.

Twenty-four questions were used to code adherence. If an article comprised multiple studies, adherence was assessed based on the level of preregistrations (i.e., if an article included two preregistered studies, adherence was evaluated for each preregistration-article pair). We distinguished between three types of deviations from preregistration to article: modifying, additive, and omitting (see Table 3). If the methods presented in the article differed from those outlined in the preregistration, deviations were coded as modifying. They were labeled as additive if the article introduced information not included in the preregistration and as omitting if information provided in the preregistration was absent in the associated article. For modifying deviations, we furthermore examined in more detail whether they were disclosed and justified (i.e., whether the authors provided a reason for why the deviation occurred). The full coding scheme is available in the supplemental material (Spitzer et al., 2025b).

Each preregistration was coded independently by two persons (L. Spitzer, A. Kroeger). Inconsistencies were discussed and solved in pairs. As a measure of intercoder reliability, a pilot coding phase was conducted using a randomly selected 10% of the sample. Krippendorff’s α was calculated to assess intercoder reliability. We planned to proceed with the coding process as planned if α would exceed the threshold of .7 and revise the coding protocols and strategies by discussing ambiguities if the intercoder reliability would fall below this threshold. For the restrictiveness coding, Krippendorff’s α was acceptable based on this criterion (α = .72). We therefore left the coding scheme unchanged after the pilot coding phase but added decision rules in a few places in which individual cases had previously been difficult to categorize (highlighted in the coding scheme, see Spitzer et al., 2025b). The adherence coding displayed more ambiguities and consequently, a low intercoder reliability (α = .52). However, discussion revealed that this was not because of the coding scheme but rather, the high level of complexity of both preregistrations and articles. The coding scheme was therefore not adapted. Instead, the coders discussed and resolved discrepancies as defined in the Stage 1 RR, increasing the accuracy of the agreed-on scores compared with the individual ones.

Data analysis

R packages and scripts

This article was written with the R package papaja (Version 0.1.1.9001; Aust & Barth, 2022). We used R (Version 4.3.1; R Core Team, 2023) and the R packages effsize (Version 0.8.1; Torchiano, 2020), irr (Version 0.84.1; Gamer et al., 2019), lme4 (Version 1.1.34; Bates et al., 2015), mice (Version 3.16.0; van Buuren & Groothuis-Oudshoorn, 2011), nestedRanksTest (Version 0.2.9000; Scofield, 2016), pastecs (Version 1.3.21; Grosjean & Ibanez, 2018), psych (Version 2.3.6; Revelle, 2023), RColorBrewer (Version 1.1.3; Neuwirth, 2022), tidyverse (Version 2.0.0; Wickham et al., 2019), and xfun (Version 0.39; Xie, 2023) for all our analyses.

Our analysis scripts are based on the scripts provided by Heirene et al. (2024). To adapt and test these, we used a blinded version of the OSF Preregistration data provided by Bakker et al. (2020) in which all numbers were replaced with random values within the coding range. A dummy data set was used for the coded PRP-QUANT preregistrations. The preliminary analysis scripts (Spitzer & Mueller, 2024a), the blinded/dummy data employed for testing them (Spitzer & Mueller, 2024c), and its corresponding RMD file (Spitzer & Mueller, 2024d) are available alongside the Stage 1 RR (Spitzer & Mueller, 2024e). The final analysis scripts (Spitzer et al., 2025a) and the R Markdown file that underlies this article—incorporating the code used to generate all outputs displaying the results (Spitzer et al., 2025c)—are accessible in the supplemental material.

Preprocessing

For each preregistration, the responses to the questions in our coding scheme were translated into restrictiveness scores for each RDF.

Subsequently, we adjusted all restrictiveness scores of 3 to 2 for both the PRP-QUANT and OSF preregistrations. A score of 3 required an explicit statement from authors that they would adhere to their planned methods and avoid additional processes. Heirene et al. (2024) reported that scores of 3 were rarely achieved because of the scarcity of these explicit statements from the authors and thus suggested this adjustment for future studies. To evaluate the impact of this decision on the results, we conducted sensitivity analyses by rerunning the hypothesis tests with the nonrecoded data and report differences.

Restrictiveness

To assess the extent to which the PRP-QUANT Template restricts RDF (Research Question 1), we inspected the distribution of restrictiveness scores of PRP-QUANT preregistrations across all RDF. In addition, stacked bar plots of restrictiveness scores for each RDF are displayed for PRP-QUANT and OSF preregistrations in Figure 2 and for peer-reviewed and non-peer-reviewed PRP-QUANT preregistrations in Figure 3. We also examined the number of preregistrations in which the minimum and maximum number of hypotheses varied when viewed as single versus interconnected but independent predictions, providing means, standard deviations, medians, and minimum and maximum values for both interpretations.

Fig. 2.

Distribution of restrictiveness scores for Psychological Research Preregistration-Quantitative (PRP-QUANT) Template and OSF Template preregistrations.

Fig. 3.

Distribution of restrictiveness scores for (non-)peer-reviewed Psychological Research Preregistration-Quantitative (PRP-QUANT) Template preregistrations.

To test our two hypotheses (Research Question 2/Hypothesis 1: higher restrictiveness in PRP-QUANT than OSF preregistrations; Research Question 3/Hypothesis 2: higher restrictiveness in peer-reviewed than non-peer-reviewed preregistrations), we largely adopted the methods employed by Bakker et al. (2020) and Heirene et al. (2024). Duplicate information (i.e., RDF based on the same questions as others: C4, A5, A10, A12, R6) were excluded from these analyses.

First, we imputed missing values using a two-way imputation procedure based on row and column means. Specifically, the overall mean, the mean for each RDF, and the mean for each preregistration were computed based on available values, and missing values were imputed using the formula RDF mean + preregistration mean – overall mean (Bernaards & Sijtsma, 2000).

To compare the restrictiveness scores between (a) PRP-QUANT and OSF preregistrations and (b) peer-reviewed and non-peer-reviewed PRP-QUANT preregistrations, we performed one-tailed nested Wilcoxon-Mann-Whitney tests using the R package nestedRanksTest (Scofield, 2016). The nested ranks test treated the template (PRP-QUANT vs. OSF) as a fixed effect and the 24 RDF as random effects. First, group-specific Z scores were calculated by comparing the ranks between templates. In addition, distributions of Z scores were generated by bootstrapping, for which ranks were assigned without considering the template. The Z scores were then aggregated across groups. Finally, the p value was determined by assessing the percentage of cases in which the bootstrapped aggregated Z score was higher than the observed one (for more information, see Scofield, 2015). To determine significance, a criterion of α = .05 was applied. Besides these nested tests, we assessed restrictiveness in individual RDF by conducting 23 additional one-tailed Wilcoxon-Mann-Whitney tests for each of the two hypotheses (note that A7 was excluded from these analyses; see Table 1). For these analyses, p values were corrected for multiple tests using the Benjamini-Hochberg correction technique (Benjamini & Hochberg, 1995). As effect size, we used Cliff’s delta (D; Cliff, 1993).

Adherence

Adherence to the preregistered plans and reporting of deviations (Research Question 4) were analyzed descriptively. We focused on two aspects: the number of preregistration-article pairs with deviations and the total deviations across all pairs. At the level of preregistration-article pairs, we analyzed the number of studies that included modifying, additive, or omitting deviations. We provide the average number of deviations and their corresponding standard deviations and minimum and maximum values. At the level of total deviations across pairs, we report percentages and frequencies of different deviation types (see Table 6). For modifying deviations, we also assessed the proportion of justified, unjustified, and nondisclosed deviations.

Results

Restrictiveness

Overall restriction of RDF through the PRP-QUANT Template

Across all PRP-QUANT preregistrations, 968 of the 2,987 coded RDF were not restricted (32.41%), and 479 were partially restricted (16.04%). For 1,105 RDF, full restriction according to the used coding scheme was achieved (36.99%). In 435 cases (14.56%), RDF were not applicable for the coded preregistrations. Full restrictiveness was particularly prevalent for T1 (hypothesis), T2 (direction of hypothesis), D3 (multiple DV measures), and A5 (selected DV measure). Meanwhile, D2 (additional IVs), D4 (additional constructs), A7 (primary outcome selection), A10 (adding additional IVs), and R6 (hypothesizing after results are known) were often not restricted (i.e., they had the highest/lowest score for > 75% of coded RDF). The distribution of restrictiveness scores for PRP-QUANT compared with the OSF preregistrations is displayed in Figure 2.

Even though T1 and T2 (hypothesis and direction of hypothesis) reached a high level of restrictiveness according to the coding scheme, for 79 preregistrations (76.70%), we still identified that the hypotheses were not specified clearly. Specifically, the number of hypotheses differed depending on whether they were interpreted as single predictions (M = 3.94, SD = 2.74, Mdn = 3, minimum = 1, maximum = 21) or multiple linked but autonomous predictions that could be tested separately (M = 12, SD = 13.45, Mdn = 7, minimum = 1, maximum = 69).

Higher RDF restriction in PRP-QUANT than OSF preregistrations

Our first hypothesis was that preregistrations based on the PRP-QUANT Template constrain RDF more than preregistrations based on the OSF Preregistration Template. In line with our hypothesis, the PRP-QUANT preregistrations had a significantly higher restrictiveness than the OSF preregistrations, Z = 0.25, p < .001, Mdn_D = 0.22. For 18 of the 23 individually tested RDF, restrictiveness was descriptively higher in the PRP-QUANT preregistrations. The difference was statistically significant for 17 RDF based on the sensitivity of our test and remained significant in 17 cases after correcting for multiple tests (see Table 4). Specifically, PRP-QUANT preregistrations displayed a significantly higher level of restrictiveness in the following RDF: T2 (direction of hypothesis), D3 (multiple DV measures), D5 (adding exclusion variables), D7 (sampling plan), C1 (random assignment), C2 (blinding), C3 (data handling/collection), A1 (missing data), A2 (data preprocessing), A3 (statistical assumptions), A4 (outliers), A6 (DV scoring), A9 (defining manipulated IVs), A11 (defining nonmanipulated IVs), A13 (statistical model selection), A14 (method and package), and A15 (inference criteria).

Table 4.

Comparisons Between PRP-QUANT and OSF Preregistration Restrictiveness Scores for Individual RDF

RDF	W	p	Corrected p	D	95% CIs
T1: hypothesis	2,600.00	.894	.979	−0.03	[−0.06, 0]
T2: direction of hypothesis	3,319.00	< .001	< .001	0.24	[0.09, 0.38]
D1: multiple manipulated IVs	1,188.00	> .999	> .999	−0.56	[−0.68, −0.4]
D2: additional IVs / A10: adding additional IVs	2,679.00	.498	.603	0.00	[−0.06, 0.06]
D3: multiple DV measures / A5: selected DV measure	3,115.00	< .001	.001	0.16	[0.05, 0.27]
D4: additional constructs	2,600.00	.894	.979	−0.03	[−0.06, 0]
D5: adding exclusion variables / A12: eligibility criteria	3,720.00	< .001	< .001	0.39	[0.22, 0.54]
D6: power analysis / C4: stopping rule	2,943.00	.133	.17	0.10	[−0.08, 0.27]
D7: sampling plan	4,201.50	< .001	< .001	0.57	[0.43, 0.68]
C1: random assignment	4,336.50	< .001	< .001	0.62	[0.43, 0.76]
C2: Blinding	5,218.00	< .001	< .001	0.95	[0.89, 0.98]
C3: data handling/collection	3,899.00	< .001	< .001	0.46	[0.34, 0.56]
A1: missing data	3,614.00	< .001	< .001	0.35	[0.19, 0.49]
A2: data preprocessing	5,292.00	< .001	< .001	0.98	[0.91, 0.99]
A3: statistical assumptions	3,207.50	.005	.009	0.20	[0.06, 0.33]
A4: outliers	3,500.00	< .001	.001	0.31	[0.13, 0.46]
A6: DV scoring	3,876.00	< .001	< .001	0.45	[0.28, 0.59]
A8: IV selection	1,253.50	> .999	> .999	−0.53	[−0.67, −0.36]
A9: defining manipulated IVs	3,232.00	.012	.017	0.21	[0.03, 0.37]
A11: defining nonmanipulated IVs	4,377.50	< .001	< .001	0.64	[0.47, 0.75]
A13: statistical-model selection	3,186.50	.014	.019	0.19	[0.02, 0.35]
A14: method and package	3,115.50	.009	.013	0.16	[0.05, 0.28]
A15: inference criteria	3,278.50	.003	.006	0.22	[0.08, 0.36]

Note: Hypothesis tests were conducted with imputed data. The p values were corrected using the Benjamini-Hochberg method. Significant p values after correction are displayed in bold. W = test statistic of the Wilcoxon-Mann-Whitney test; D = Cliff’s delta, for which values can range between −1 (all PRP-QUANT preregistrations score lower than all OSF preregistrations) and 1 (all PRP-QUANT preregistrations score higher than all OSF preregistrations); CIs = 95% confidence intervals of effect sizes; RDF = researcher degrees of freedom; IV = independent variable; DV = dependent variable; PRP-QUANT = Psychological Research Preregistration-Quantitative Template.

A sensitivity analysis showed that recoding the restrictiveness scores from 3 to 2 did not affect the results of the nested Wilcoxon-Mann-Whitney test, Z = 0.23, p < .001, Mdn _D = 0.22. Here, 15 RDF were significantly more restricted in PRP-QUANT preregistration (i.e., T2 and A9 became nonsignificant in these analyses).

Higher restriction of RDF in peer-reviewed than in non-peer-reviewed preregistrations

Second, we predicted that peer-reviewed PRP-QUANT preregistrations restrict RDF more than non-peer-reviewed preregistrations created with the same format. Consistent with our hypothesis, restrictiveness was significantly higher for peer-reviewed preregistrations than non-peer-reviewed preregistrations, Z = 0.22, p < .001, Mdn_D = 0.25. Twenty-two of the 23 individually tested RDF showed a descriptively higher restrictiveness for peer-reviewed preregistrations. For 14 RDF, this difference reached statistical significance, which remained significant in 14 cases after correcting for multiple tests (see Table 5). The more restrictive RDF comprised D1 (multiple manipulated IVs), D2 (additional IVs), D6 (power analysis), D7 (sampling plan), C1 (random assignment), C2 (blinding), A1 (missing data), A2 (data preprocessing), A6 (DV scoring), A8 (IV selection), A9 (defining manipulated IVs), A11 (defining nonmanipulated IVs), A13 (statistical-model selection), and A14 (method and package). Figure 3 shows the distribution of restrictiveness scores for peer-reviewed and non-peer-reviewed PRP-QUANT preregistrations.

Table 5.

Comparisons Between Peer-Reviewed and Non-Peer-Reviewed PRP-QUANT Preregistration Restrictiveness Scores for Individual RDF

RDF	W	p	Corrected p	D	95% CIs
T1: hypothesis	1,116.50	.14	.16	0.04	[0, 0.09]
T2: direction of hypothesis	1,181.00	.07	.099	0.10	[0, 0.2]
D1: multiple manipulated IVs	1,463.00	.001	.004	0.36	[0.16, 0.54]
D2: additional IVs / A10: adding additional IVs	1,221.00	.001	.004	0.14	[0.01, 0.26]
D3: multiple DV measures / A5: selected DV measure	1,065.00	.585	.585	−0.01	[−0.08, 0.07]
D4: additional constructs	1,116.50	.14	.16	0.04	[0, 0.09]
D5: adding exclusion variables / A12: eligibility criteria	1,205.00	.111	.142	0.12	[−0.06, 0.3]
D6: power analysis / C4: stopping rule	1,346.50	.014	.024	0.26	[0.04, 0.45]
D7: sampling plan	1,341.00	.014	.024	0.25	[0.03, 0.44]
C1: random assignment	1,376.00	.009	.019	0.28	[0.06, 0.47]
C2: blinding	1,438.00	.004	.009	0.34	[0.09, 0.55]
C3: data handling/collection	1,133.00	.315	.346	0.06	[−0.18, 0.29]
A1: missing data	1,303.50	.027	.045	0.21	[0, 0.41]
A2: data preprocessing	1,650.00	< .001	< .001	0.54	[0.31, 0.71]
A3: statistical assumptions	1,239.50	.073	.099	0.15	[−0.08, 0.37]
A4: outliers	1,084.50	.465	.486	0.01	[−0.22, 0.24]
A6: DV scoring	1,399.00	.005	.011	0.30	[0.08, 0.5]
A8: IV selection	1,428.00	.003	.008	0.33	[0.12, 0.51]
A9: defining manipulated IVs	1,467.00	.001	.004	0.37	[0.17, 0.54]
A11: defining nonmanipulated IVs	1,438.50	.003	.008	0.34	[0.11, 0.54]
A13: statistical-model selection	1,571.00	< .001	< .001	0.46	[0.28, 0.61]
A14: method and package	1,408.00	.001	.004	0.31	[0.1, 0.5]
A15: inference criteria	1,272.00	.05	.077	0.18	[−0.05, 0.4]

Note: Hypothesis tests were conducted with imputed data. The p values were corrected using the Benjamini-Hochberg method. Significant p values after correction are displayed in bold. W = test statistic of the Wilcoxon-Mann-Whitney test; D = Cliff’s delta, for which values can range between −1 (all peer-reviewed preregistrations score lower than all non-peer-reviewed preregistrations) and 1 (all peer-reviewed preregistrations score higher than all non-peer-reviewed preregistrations); CIs = 95% confidence intervals of effect sizes; RDF = researcher degrees of freedom; IV = independent variable; DV = dependent variable; PRP-QUANT = Psychological Research Preregistration-Quantitative Template.

As shown in a sensitivity analysis, recoding the restrictiveness scores from 3 to 2 had no effect on the nested Wilcoxon-Mann-Whitney test, Z = 0.23, p < .001, Mdn_D = 0.25. A similar picture emerged for the individual RDF as in the main analysis except that T2 was also significant here (p_cor = .018).

High occurrence of deviations

In 19 of the preregistration-article pairs (100%), the preregistration, the article, or both were not specified in sufficient detail for completely assessing the adherence between them. For 5.04% of RDF, no information was provided in the preregistration (U_P scores per preregistration-article pair: M = 1.32, SD = 1.89), and for 10.53%, information was lacking in the article (U_A scores: M = 2.84, SD = 2.43). In 8.55% of cases, the information was not provided in both (U_B scores: M = 2.16, SD = 1.38).

Two of the 19 inspected research articles made no modifying deviations (10.53%), that is, the information provided in the preregistration and article were consistent (not considering additive and omitting deviations). Meanwhile, 17 displayed modifying deviations (89.47%). In this group, eight articles contained declared deviations. On average, the articles included 1.06 declared and justified deviations (SD = 1.75, minimum = 0, maximum = 5) and 0.41 declared but unjustified deviations (SD = 1.06, minimum = 0, maximum = 4). In the case of 14 articles, undeclared deviations were present (73.68%), with an average of 3 undeclared deviations per article (SD = 2.76, minimum = 0, maximum = 8). In addition, 11 articles included additive deviations (57.89%), that is, information not prespecified in the preregistration appeared in the article, and 18 articles comprised omitting deviations (94.74%), meaning that information provided in the preregistration was absent in the article. On average, articles included 1.32 additive (SD = 1.89, minimum = 0, maximum = 7) and 2.84 omitting deviations (SD = 2.43, minimum = 0, maximum = 10).

Examining the adherence scores across preregistration-article pairs at the level of RDF, we observed that for 233 RDF, no deviations were present (51.10% of the 456 coded RDF). Meanwhile, a total of 59 modifying deviations were found (12.94%). Out of these, 15 were justified (25.42%), and five were not justified (8.47%). We identified a total of 39 undeclared deviations, which accounted for 66.10% of all modifying deviations (see Table 6). Undeclared deviations were most often related to the hypothesis (T1, present in 42.11% of the publications), the statistical models (A13, present in 36.84%), and the exclusion criteria (D5, present in 21.05%). In addition, we identified 23 additive (5.04%) and 48 omitting deviations (10.53%).

Table 6.

Deviation Types Present in the PRP-QUANT Preregistrations by RDF

Code	Abbreviated question	No deviation	Modifying	Additive (U_P)	Omitting (U_A)	Unable to determine (U_B)	NA
T1	Are the hypotheses reported the same as in the preregistration?	31.58 (6)	57.89 (11)	0 (0)	10.53 (2)	0 (0)	0 (0)
T2	Is the direction of each hypothesis the same?	73.68 (14)	0 (0)	5.26 (1)	15.79 (3)	5.26 (1)	0 (0)
D1	Are the manipulated IVs operationalized in the same way as stated in the protocol?	57.89 (11)	5.26 (1)	5.26 (1)	0 (0)	0 (0)	31.58 (6)
D2	Are all variables included in analyses testing hypotheses, consistent with the preregistered analysis plan?	36.84 (7)	21.05 (4)	0 (0)	15.79 (3)	10.53 (2)	15.79 (3)
D3	Are the DVs measured in the same way as stated in the preregistration?	68.42 (13)	15.79 (3)	5.26 (1)	10.53 (2)	0 (0)	0 (0)
D4	Are all DVs included in analyses reported in the preregistration?	89.47 (17)	10.53 (2)	0 (0)	0 (0)	0 (0)	0 (0)
D5	Are the criteria for including data points in analyses consistent?	63.16 (12)	31.58 (6)	0 (0)	5.26 (1)	0 (0)	0 (0)
D6	Is the sample size involved in analyses consistent with the outcomes of the power analysis reported in the preregistration?	78.95 (15)	21.05 (4)	0 (0)	0 (0)	0 (0)	0 (0)
D7	Is the sampling protocol stated in the preregistration followed?	68.42 (13)	15.79 (3)	0 (0)	15.79 (3)	0 (0)	0 (0)
C1	Is the randomization procedure used consistent with that reported in the preregistration?	52.63 (10)	10.53 (2)	0 (0)	10.53 (2)	0 (0)	26.32 (5)
C2	Is the blinding procedure used consistent with that reported in the preregistration?	10.53 (2)	0 (0)	0 (0)	15.79 (3)	5.26 (1)	68.42 (13)
C3	Are the procedures used to code and manage data during the data-collection process consistent?	31.58 (6)	5.26 (1)	5.26 (1)	10.53 (2)	42.11 (8)	5.26 (1)
A1	Are the procedures used to deal with missing data consistent with those reported in the preregistration?	26.32 (5)	5.26 (1)	0 (0)	31.58 (6)	5.26 (1)	31.58 (6)
A2	Are the procedures used to preprocess data consistent?	5.26 (1)	5.26 (1)	0 (0)	0 (0)	0 (0)	89.47 (17)
A3	Are the procedures used to test for statistical assumptions consistent?	10.53 (2)	10.53 (2)	5.26 (1)	10.53 (2)	63.16 (12)	0 (0)
A4	Are the procedures used to identify and deal with outliers consistent?	36.84 (7)	5.26 (1)	0 (0)	31.58 (6)	26.32 (5)	0 (0)
A6	Are the DVs scored in a way that is consistent?	68.42 (13)	0 (0)	5.26 (1)	21.05 (4)	5.26 (1)	0 (0)
A7	Are the DVs used in primary analyses all the same as reported in the preregistration?	84.21 (16)	10.53 (2)	0 (0)	5.26 (1)	0 (0)	0 (0)
A8	Are the IVs used in primary analyses all the same?	78.95 (15)	15.79 (3)	0 (0)	5.26 (1)	0 (0)	0 (0)
A11	Are nonmanipulated IVs operationalized in a way consistent with the preregistration?	68.42 (13)	10.53 (2)	5.26 (1)	0 (0)	0 (0)	15.79 (3)
A13	Are the statistical tests used to test hypotheses consistent?	52.63 (10)	42.11 (8)	5.26 (1)	0 (0)	0 (0)	0 (0)
A14.1	Are the estimation techniques used to estimate the statistical model(s) consistent?	47.37 (9)	5.26 (1)	26.32 (5)	0 (0)	21.05 (4)	0 (0)
A14.2	Is the statistical software used to conduct analyses consistent with the preregistered plan?	31.58 (6)	0 (0)	42.11 (8)	10.53 (2)	15.79 (3)	0 (0)
A15	Are the inference criteria used consistent?	52.63 (10)	5.26 (1)	10.53 (2)	26.32 (5)	5.26 (1)	0 (0)
	% of total scores (summation)	51.1 (233)	12.94 (59)	5.04 (23)	10.53 (48)	8.55 (39)	11.84 (54)

Note: Twenty-four questions were used to code adherence for 29 RDF (i.e., there were some dependencies in that the same questions informed multiple RDF). Duplicate answers were excluded from analyses. Table shows percentage (frequency) of different deviation types made with respect to each RDF. Modifying = deviation occurred between preregistration and article (adherence = 0); additive = RDF was not restricted in the preregistration, but related information was described in the article (adherence = U_P); omitting = RDF was restricted in the preregistration but not mentioned in the article (adherence = U_A); unable to determine = no information in neither the preregistration nor the article (adherence = U_B); NA = not applicable; RDF = researcher degrees of freedom; IVs = independent variables; DV = dependent variables; PRP-QUANT = Psychological Research Preregistration-Quantitative Template.

Exploratory analyses

In addition to the confirmatory analyses, we conducted two unplanned exploratory analyses to examine the influence of peer review on the preregistrations in greater detail.

First, it is possible that the peer review of some of the PRP-QUANT preregistrations contributed to their higher scores compared with the OSF sample (note, however, that these were also checked but only for completeness, not quality; see Center for Open Science, n.d.). To investigate this further, we created a plot comparing the scores between PRP-QUANT and OSF preregistrations only for non-peer-reviewed PRP-QUANT preregistrations (see Fig. 4). Visual inspection indicates that even for this subsample, for many RDF, preregistrations based on the PRP-QUANT Template tended to have descriptively higher restrictiveness compared with the OSF sample.

Fig. 4.

Distribution of restrictiveness scores for nonreviewed Psychological Research Preregistration-Quantitative (PRP-QUANT) Template versus OSF Template preregistrations.

Second, we were interested in whether the deviation types differed between peer-reviewed and non-peer-reviewed PRP-QUANT preregistrations. Indeed, peer-reviewed preregistrations tended to show fewer deviations (see Table 7). This could indicate another positive effect of peer review of preregistrations because reviewers might support the creation of preregistration so that fewer deviations are necessary later (e.g., because something does not work as intended, information was not provided in the preregistration).

Table 7.

Deviation Types Present in the Non-Peer-Reviewed Versus Peer-Reviewed PRP-QUANT Preregistrations

Deviation type	Not peer reviewed	Peer reviewed
No deviation	42.59 (92)	58.75 (141)
Modifying	15.28 (33)	10.83 (26)
Additive (U_P)	7.87 (17)	2.5 (6)
Omitting (U_A)	12.96 (28)	8.33 (20)
Unable to determine (U_B)	9.72 (21)	7.5 (18)
NA	11.57 (25)	12.08 (29)

Note: Twenty-four questions were used to code adherence for 29 RDF (i.e., there were some dependencies in that the same questions informed multiple RDF). Duplicate answers were excluded from analyses. Table shows percentage (frequency) of different deviation types. Modifying = deviation occurred between preregistration and article (adherence = 0); additive = RDF was not restricted in the preregistration, but related information was described in the article (adherence = U_P); omitting = RDF was restricted in the preregistration but not mentioned in the article (adherence = U_A); unable to determine = no information in neither the preregistration nor the article (adherence = U_B); NA = not applicable; RDF = researcher degrees of freedom; PRP-QUANT = Psychological Research Preregistration-Quantitative Template.

Discussion

In our study, we examined the extent to which preregistrations based on the extensive PRP-QUANT Template (Bosnjak et al., 2022) restrict RDF (Research Question 1). We compared these preregistrations with those using the earlier OSF Template (Research Question 2) and investigated whether restrictiveness could be further enhanced through peer review (Research Question 3). In addition, we evaluated the degree to which researchers adhered to their PRP-QUANT preregistrations in the related articles (Research Question 4).

Higher restrictiveness in PRP-QUANT and peer-reviewed preregistrations

Our results show that around a third of RDF was fully restricted in PRP-QUANT preregistrations and that around half remained only partially or unrestricted. Furthermore, even though T1 and T2 (hypothesis and direction of hypothesis) achieved high scores based on the coding scheme, we still found that hypotheses were not specified clearly in 76.70% of preregistrations. The reason for this discrepancy is that the coding scheme awarded high values if IV and DV were defined clearly within single hypotheses, whereas our deeper investigation focused on the whole set of hypotheses, that is, whether the minimum and maximum number of hypotheses might vary when viewed as single versus interconnected but independent predictions. For the latter, a high degree of ambiguity was found, meaning that readers might evaluate the same statements as fewer or more hypotheses based on their subjective perception. This aligns with earlier findings (Bakker et al., 2020; Heirene et al., 2024) and suggests that there is still room for improvement because flexibility persisted in these preregistrations for both the RDF in general and the hypotheses in particular.

However, compared with the earlier OSF preregistrations, 18 of the 23 tested RDF were more restricted in PRP-QUANT preregistrations (17 significantly so), which also resulted in an overall higher restrictiveness in the latter. Our effect size of Mdn_D = 0.22 is smaller than the difference found by Bakker et al. (2020; D = 0.49). However, because two structured templates were compared here instead of a structured versus unstructured one, the smaller effect nevertheless speaks of an advantage of the more extensive template. Overall, the PRP-QUANT format seems to be more effective in reducing RDF than the previous format in the field of psychology. It therefore appears worthwhile to develop and use highly structured templates in the future.

A higher restrictiveness was also found for 22 of the 23 tested RDF in peer-reviewed versus non-peer-reviewed preregistrations (of these, 14 comparisons were significant after correction). This suggests that peer review is indeed a valuable tool for enhancing the quality of preregistrations, a potential that is currently underused.

High occurrence of deviations and need for more transparent reporting

Only two of the 19 inspected research articles adhered completely to their preregistration, providing further evidence that deviations from preregistrations are common. Importantly, 13 articles contained undeclared deviations, which accounted for around two-thirds of all modifying deviations. This and the facts that researchers report continued insecurity about how to deal with deviations and readers of articles in psychology do not typically inspect the preregistrations (Spitzer & Mueller, 2023) highlight the need for a more transparent—and potentially standardized—handling of deviations. Fortunately, this issue has been recognized by the psychological-research community, that is, there have been first attempts to address this problem in the research literature. For example, Lakens (2024b) described in which cases it makes sense to deviate from the preregistered plan, and there are now also templates for reporting deviations (Spitzer & Mueller, 2024b; Willroth & Atherton, 2024).

In addition, we observed a high occurrence of both additive and omitting deviations. Additive deviations suggest that the preregistrations were either lacking in detail or incomplete, and omitting deviations may indicate outcome-reporting bias. Alternatively, they may reflect a shift in practice in which authors no longer provide a comprehensive description of all methods in the article but refer to the preregistration. This shows the importance of the preregistrations for fully understanding the evolution of the research, further underscoring the need for transparent reporting strategies.

Limitations

Our study has several limitations that need to be considered when interpreting these results. First, it is important to recognize that our coding does not provide a definitive assessment of restrictiveness—because this would require a complete understanding of the garden of forking paths (i.e., of all possible decisions that could be taken during conducting the research; see Gelman & Loken, 2013)—but an approximation. In addition, we found the coding scheme to be overly strict in some cases, leading to lower scores than we deemed appropriate for some RDF (e.g., D2 “additional IVs,” which explicitly asked whether preregistration authors indicated that no further covariates would be used and could be coded only with either 0 = no or 3 = yes).

Despite explicit decision rules, some ambiguity remained in the coding process, leading to low interrater reliability, particularly in the adherence assessment. To mitigate this, both coders discussed and resolved discrepancies, ensuring the final scores were more accurate than the initial individual ratings. Overall, although the coding scheme displayed some challenges, we believe that it still provides a useful basis for comparing the two preregistration samples because both were coded using the same criteria. Nevertheless, it might be useful to further revise the coding schemes in the future. Additional RDF could then also be considered, such as how hypotheses are linked to theories and what conclusions are drawn from each statistical test (as suggested by Reviewer 2).

We were also not able to blind coders to the identity of the templates because we coded only PRP-QUANT preregistrations and compared them with an existing sample of OSF preregistrations. This introduces the possibility of bias during coding, which we sought to minimize by employing a detailed and structured coding scheme adapted from earlier research (Heirene et al., 2024).

A further limitation concerns the procedure used to impute the NA values for the hypothesis tests, which favored groups with a higher proportion of NA values. If, for example, the same number of scores of 2 were available for both compared groups (e.g., PRP-QUANT and OSF preregistrations) and some additional scores of 1 were available for one group and the other group had more NA values, the imputation procedure would favor the second group in that the imputed values there would be formed based on the higher scores and would therefore be higher. Although this should be considered when interpreting the results (especially for RDF that had an overall high amount of NA values), it is also important to note that these values indicate that the authors of the coded preregistration had specified that an RDF was not relevant to them (e.g., in cases such as blinding). Therefore, this favoring might make sense in that NA values (i.e., deliberate indications that something is not relevant) are better than lower restrictiveness values.

We cannot rule out the potential influence of confounding variables in our study. Foremost, the PRP-QUANT Template was introduced at the beginning of 2021, meaning that the PRP-QUANT preregistrations in our sample are more recent than the OSF preregistrations, which were published in 2016, used for our comparison. In addition, our PRP-QUANT sample consisted partly of peer-reviewed preregistrations submitted in response to a call for free-of-charge data collection. It is conceivable that researchers put more effort into such preregistrations. However, the inspected OSF preregistrations were also part of a call, that is, the “Preregistration Challenge” organized by the Center for Open Science. Here, researchers also applied for funding, and the preregistrations were reviewed for completeness (but not quality; see Center for Open Science, n.d.). Both samples therefore appear to be comparable, although confounding influences cannot be ruled out.

Furthermore, for the deviation analyses, we note that we were able to identify articles for only a fraction of the preregistration sample, that is, only for the older ones from 2021 to 2022. No articles were identified for newer preregistrations, probably because they had not yet been published. This may have had an impact on the rate of identified deviations.

Finally, in the Stage 1 RR, we specified that we would consider all existing PRP-QUANT preregistrations published in the digital research repository PsychArchives by searching for the corresponding metadata tag. This was based on the assumption that the preregistrations in the archive were tagged accordingly. However, it turned out that 41 eligible preregistrations were missing the tag and therefore incorrectly not included in our data set. Although we do not assume that the unidentified preregistrations differ systematically from the sampled ones, this might still be another confounding factor.

Future developments

Continuous evaluation of open-science practices, such as preregistration, is essential to ensure they achieve their intended goals. Future research in this area could inspect preregistrations based on other templates or compare them across different research areas (see Heirene et al., 2024). It might also be interesting to compare a current sample of OSF preregistrations with our PRP-QUANT sample to rule out the potential confounding influence of time. In addition, preregistration templates could also be evaluated directly, for example, regarding their usability (similar to our approach in Spitzer et al., 2024).

Another aspect we did not address in our study but that could be of interest would be a closer examination of the deviations from preregistrations to final articles. Specifically, we assessed whether modifying deviations were disclosed and justified but not whether they constituted an improvement in methodology. It could be interesting to explore why researchers choose to deviate and whether such changes ultimately enhance the quality of the study.

Meta-analytical investigations of preregistrations, especially the comparison between preregistrations and associated articles, could be facilitated by publishing preregistrations in machine-readable form (see Lakens & DeBruine, 2021). In addition, this could help to ensure that preregistrations are published more in accordance with the FAIR (i.e. findable, accessible, interoperable, and reusable) principles (Wilkinson et al., 2016).

Conclusion

In our study, PRP-QUANT preregistrations were associated with greater RDF restriction than OSF preregistrations, suggesting that developing and using highly structured, detailed templates may effectively help reduce unwanted flexibility in preregistrations. Furthermore, restrictiveness was greater in peer-reviewed than non-peer-reviewed preregistrations, highlighting the potential benefit of peer review in this context. Meanwhile, deviations from preregistered plans—both declared and undeclared—were common in the inspected articles, emphasizing the persisting lack of transparent reporting.

Footnotes

Appendix

Table A1.

Study Design, Based on the Template Provided by Peer Community in Registered Reports

Question	Hypothesis	Sampling plan	Analysis plan	Rationale for deciding the sensitivity of the hypothesis test	Interpretation given different outcomes	Theory that could be shown wrong by the outcomes
Research Question 1: To what extent does the PRP-QUANT Template restrict RDF, and which RDF are more restricted than others?	None	We sampled all PRP-QUANT preregistrations published on PsychArchives which contained the corresponding metadata tag. We included all preregistrations that met our inclusion criteria (i.e., preregistrations that were based on the PRP-QUANT Template, were written in English or German, were publicly accessible, were empirical studies, and included at least one testable hypothesis). An initial search identified 74, to which all other preregistrations published up to the start of coding were added (final sample: N = 103).	The distribution of restrictiveness scores of PRP-QUANT preregistrations across all RDF was inspected. In addition, stacked bar plots of restrictiveness scores for each RDF are displayed for PRP-QUANT and OSF preregistrations and for peer-reviewed and non-peer-reviewed PRP-QUANT preregistrations. We also examined the number of preregistrations in which the minimum and maximum number of hypotheses varied when viewed as single versus interconnected but independent predictions, providing means, standard deviations, medians, and minimum and maximum values for both interpretations.	Descriptive analyses of the PRP-QUANT preregistrations’ restrictiveness scores were used to answer this research question. No hypothesis tests were conducted.	The results are reported descriptively.	N/A
Research Question 2: Are RDF more restricted in preregistrations created with the PRP-QUANT Template compared	Hypothesis 1 (primary): Preregistrations created with the PRP-QUANT Template restrict RDF more (i.e., have higher restrictiveness scores) than preregistrations based on the format inspected by Bakker	All included PRP-QUANT preregistrations (N = 103) were compared with the N = 52 OSF preregistrations sampled by Bakker et al. (2020). A sensitivity analysis conducted for the	We conducted a nested one-tailed Wilcoxon-Mann-Whitney test to compare restrictiveness scores between PRP-QUANT and OSF preregistrations using the R package nestedRanksTest (Scofield, 2016). In this	Bakker et al. (2020) determined their sample size of 53 by conducting a power analysis for a Wilcoxon-Mann-Whitney test with α = .05 and a power of .8 to detect a medium effect size	We preregistered the following interpretation in Stage 1: If the preregistrations created with the PRP-QUANT format restrict RDF more (i.e., have an overall higher restrictiveness score)	This test was not grounded in a clear-cut theory but was based on the assumption that employing more structured templates is linked to higher
with the OSF Preregistration Template studied by Bakker et al. (2020)?	et al. (2020; i.e., the OSF Preregistration Template).	Stage 1 RR indicated that with the preliminary sample sizes (PRP-QUANT preregistrations: n = 74, OSF preregistrations: N = 52), we would have had a power of .97 to detect a small effect size of Cohen’s d = 0.2 and a power above .99 to detect d = 0.5 (which corresponds to Cliff’s D of approximately 0.33; Romano et al., 2006).	model, template was treated as a fixed effect, and RDF was treated as a random effect. First, group-specific Z scores were calculated by comparing the ranks between templates. In addition, distributions of Z scores were generated by bootstrapping, for which ranks were assigned without considering the template. The Z scores were then aggregated across groups. Finally, the p value was determined by assessing the percentage of cases in which the bootstrapped aggregated Z score was higher than the observed one. To determine significance, a criterion of α = .05 was applied. In addition, we conducted 23 more Wilcoxon-Mann-Whitney tests to compare the restrictiveness scores for the individual RDF. For these follow-up tests, p values were corrected for multiple tests using the Benjamini-Hochberg correction technique. As effect size, we used Cliff’s delta (D; Cliff, 1993).	of Cohen’s d = 0.5, which they defined to be a practically meaningful difference between two samples of preregistrations (however, because one preregistration was withdrawn, their final group size was N = 52). We used all PRP-QUANT preregistrations fulfilling our criteria, that is, 103. Thus, our sample size surpassed that of Bakker et al. (2020). In addition, we implemented a nested Wilcoxon-Mann-Whitney test, resulting in a higher power than in the original study.	compared with the OSF preregistrations sampled by Bakker et al. (2020; support for Hypothesis 1), it will be concluded that the PRP-QUANT format is indeed more effective in reducing RDF than the previous format in the field of psychology. It therefore appears worthwhile to develop/use highly structured templates in the future. However, if contrary to our predictions, the PRP-QUANT preregistrations do not have significantly higher restrictiveness scores than the OSF ones, we will conclude that there is no evidence that the PRP-QUANT Template achieves a higher level of restrictiveness. We will also further examine for how many of the individual RDF restrictiveness is higher in PRP-QUANT than OSF preregistrations and will conclude that the benefit of the PRP-QUANT Template might be most pronounced for all RDF showing significant differences.	restrictiveness, as initially described by Bakker et al. (2020). Our objective was to examine whether a template even more structured and detailed than the one previously studied by Bakker et al. (2020) can even better restrict RDF.
Research Question 3: Can peer review of preregistrations help to restrict RDF?	Hypothesis 2 (secondary): Peer-reviewed preregistrations created with the PRP-QUANT Template restrict RDF more (i.e., have higher restrictiveness scores) than non-peer-reviewed preregistrations created with the same format.	All PRP-QUANT preregistrations that were reviewed were compared with the remaining non-peer-reviewed PRP-QUANT preregistrations. A sensitivity analysis conducted for the Stage 1 RR showed that with the preliminary group sizes of 27 reviewed and 47 nonreviewed preregistrations, we would have had a power of .89 to detect small effects of d = 0.2 with α = .05, and an effect size of d = 0.5 could have been detected with a power above .99.	Similar to the analysis of Hypothesis 1, we conducted a one-tailed nested Wilcoxon-Mann-Whitney test to compare the restrictiveness scores between peer-reviewed versus non-peer-reviewed PRP-QUANT preregistrations (procedure is detailed above). Review status was treated as a fixed effect, and RDF was treated as a random effect. To determine significance, a criterion of α = .05 was applied. In addition, we conducted 23 more Wilcoxon-Mann-Whitney tests to compare the restrictiveness scores for the individual RDF. For these follow-up tests, p values were corrected for multiple tests using the Benjamini-Hochberg correction technique. Cliff’s delta (D; Cliff, 1993) was used as effect size.	For this comparison, the group sizes were limited by the number of available (non-)peer-reviewed preregistrations. However, our sensitivity analysis in the Stage 1 RR indicated that we still had a high power to detect even small effects (e.g., a power of .89 to detect effects of d = 0.2 with α = .05).	We preregistered the following interpretation in Stage 1: If our analysis reveals that peer-reviewed preregistrations exhibit a higher level of restrictiveness (i.e., have an overall higher restrictiveness score) compared with non-peer-reviewed preregistrations (supporting Hypothesis 2), we will conclude that peer review is indeed a valuable tool for enhancing the quality of preregistrations, a potential that is currently underused. If we find no significant difference in the overall restrictiveness between peer-reviewed and non-peer-reviewed preregistrations, we will conclude that there is insufficient evidence to support the necessity of peer review for achieving high restrictiveness. As for Hypothesis 1, we will also inspect for how many of the individual RDF restrictiveness is higher in peer-reviewed than non-peer-reviewed preregistrations. Based on these analyses, we will conclude that	This test was also not based on a formulated theory but rather on the observation made by Bakker et al. (2020) that peer review could potentially have a positive effect on the restrictiveness of preregistrations.
					the benefit of peer review for increasing restrictiveness might be most evident for RDF exhibiting significant differences.
Research Question 4: To what degree do researchers that used the PRP-QUANT Template adhere to their preregistered plan, what deviations occur, and how are these reported?	None	We searched for associated publications for all included preregistrations by examining the PsychArchives record of each preregistration and searching for the preregistration DOI on the internet (N = 19).	Researchers’ adherence to their preregistered plans and reporting of deviations were analyzed descriptively. We focused on two aspects: the number of preregistration-article pairs with deviations and the total deviations across all pairs. At the level of preregistration-article pairs, we analyzed the number of studies that included modifying, additive, or omitting deviations. We provide the average number of deviations and their corresponding standard deviations and minimum and maximum values. At the deviations level, we calculated percentages and frequencies of different types of deviations for each RDF and overall across all preregistration-article pairs, presenting the results in a table. For modifying deviations, we also assessed the proportion of justified, unjustified, and nondisclosed deviations.	Descriptive analyses of the PRP-QUANT preregistrations’ adherence and deviation type scores were used to answer this research question. No hypothesis tests were conducted.	The results are reported descriptively.	N/A

Note: PRP-QUANT = Psychological Research Preregistration-Quantitative Template; RDF = researcher degrees of freedom; RR = registered report.

Acknowledgements

The grammar of individual text sections was improved with the help of artificial intelligence, but no text sections were generated by artificial intelligence. Registered reports (RR) involving existing data at Peer Community in Registered Reports: For our study, we compared a new data set coded using PRP-QUANT preregistrations with existing data from Bakker et al. (2020). We assume a bias level of 3. For our Stage 1 RR, we had already downloaded the data from Bakker et al.; however, we did not look at them and blinded these data sets to write and test our analysis scripts (the script used for blinding is available in the supplemental material of the Stage 1 RR, Spitzer & Mueller, 2024a). In addition, we had already downloaded the PRP-QUANT preregistrations that existed to date for the Stage 1 RR submission but did not begin coding until receiving in-principle acceptance. For additional supporting information, see Spitzer and Mueller (2024c).

Transparency

Action Editor: Felix Thoemmes

Editor: Felix Thoemmes

Authors Contributions

Lisa Spitzer: Conceptualization; Data curation; Formal analysis; Investigation; Methodology; Project administration; Software; Validation; Visualization; Writing – original draft.

Amelie Kroeger: Investigation; Writing – review & editing.

Stefanie Mueller: Conceptualization; Methodology; Resources; Supervision; Writing – review & editing.

ORCID iDs

Lisa Spitzer

Amelie Kroeger

Stefanie Mueller

References

Appelbaum

Cooper

Kline

R. B.

Mayo-Wilson

Nezu

A. M.

Rao

S. M.

(2018). Journal article reporting standards for quantitative research in psychology: The APA Publications and Communications Board task force report. American Psychologist, 73(1), 3–25. https://doi.org/10.1037/amp0000191

Aust

Barth

(2022). Papaja: Prepare reproducible APA journal articles with R Markdown. https://github.com/crsh/papaja

Bakker

Veldkamp

C. L. S.

van Assen

M. A. L. M.

Crompvoets

E. A. V.

Ong

H. H.

Nosek

B. A.

Soderberg

C. K.

Mellor

Wicherts

J. M.

(2020). Ensuring the quality and specificity of preregistrations. PLOS Biology, 18(12), Article e3000937. https://doi.org/10.1371/journal.pbio.3000937

Bates

Mächler

Bolker

Walker

(2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01

Benjamini

Hochberg

(1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B: Methodological, 57(1), 289–300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x

Bernaards

C. A.

Sijtsma

(2000). Influence of imputation and EM methods on factor analysis when item nonresponse in questionnaire data is nonignorable. Multivariate Behavioral Research, 35(3), 321–364. https://doi.org/10.1207/S15327906MBR3503_03

Bosnjak

Fiebach

C. J.

Mellor

Mueller

O’Connor

D. B.

Oswald

F. L.

Sokol

R. I.

(2022). A template for preregistration of quantitative research in psychology: Report of the Joint Psychological Societies Preregistration Task Force. American Psychologist, 77(4), 602–615. https://doi.org/10.1037/amp0000879

Center for Open Science. (n.d.). Preregistration challenge. https://www.cos.io/initiatives/prereg-more-information

Chan

A.-W.

Hróbjartsson

Haahr

M. T.

Gøtzsche

P. C.

Altman

D. G.

(2004). Empirical evidence for selective reporting of outcomes in randomized trials: Comparison of protocols to published articles. JAMA, 291(20), 2457–2465. https://doi.org/10.1001/jama.291.20.2457

10.

Chan

A.-W.

Hróbjartsson

Jorgensen

K. J.

Gotzsche

P. C.

Altman

D. G.

(2008). Discrepancies in sample size calculations and data analyses reported in randomised trials: Comparison of publications with protocols. BMJ, 337, Article a2299. https://doi.org/10.1136/bmj.a2299

11.

Chen

Qin

Wang

Dodd

Wang

Cornelius

(2019). Comparison of clinical trial changes in primary outcome and reported intervention effect size between trial registration and publication. JAMA Network Open, 2(7), Article e197242. https://doi.org/10.1001/jamanetworkopen.2019.7242

12.

Claesen

Gomes

Tuerlinckx

Vanpaemel

(2021). Comparing dream to reality: An assessment of adherence of the first generation of preregistered studies. Royal Society Open Science, 8(10), Article 211037. https://doi.org/10.1098/rsos.211037

13.

Cliff

(1993). Dominance statistics: Ordinal analyses to answer ordinal questions. Psychological Bulletin, 114(3), 494–509. https://doi.org/10.1037/0033-2909.114.3.494

14.

Forstmeier

Wagenmakers

Parker

T. H.

(2017). Detecting and avoiding likely false-positive findings – A practical guide. Biological Reviews, 92(4), 1941–1968. https://doi.org/10.1111/brv.12315

15.

Gamer

Lemon

Singh

I. F. P.

(2019). irr: Various coefficients of interrater reliability and agreement. https://CRAN.R-project.org/package=irr

16.

Gelman

Loken

(2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. Department of Statistics, Columbia University. https://sites.stat.columbia.edu/gelman/research/unpublished/p_hacking.pdf

17.

Goldacre

Drysdale

Dale

Milosevic

Slade

Hartley

Marston

Powell-Smith

Heneghan

Mahtani

K. R.

(2019). COMPare: A prospective cohort study correcting and monitoring 58 misreported trials in real time. Trials, 20(1), Article 118. https://doi.org/10.1186/s13063-019-3173-2

18.

Grosjean

Ibanez

(2018). pastecs: Package for analysis of space-time ecological series. https://CRAN.R-project.org/package=pastecs

19.

Hardwicke

T. E.

Wagenmakers

E.-J.

(2023). Reducing bias, increasing transparency and calibrating confidence with preregistration. Nature Human Behaviour, 7(1), 15–26. https://doi.org/10.1038/s41562-022-01497-2

20.

Heirene

LaPlante

Louderback

Keen

Bakker

Serafimovska

Gainsbury

(2024). Preregistration specificity and adherence: A review of preregistered gambling studies and cross-disciplinary comparison. Meta-Psychology, 8. https://doi.org/10.15626/MP.2021.2909

21.

Huntington-Klein

Arenas

Beam

Bertoni

Bloem

J. R.

Burli

Chen

Grieco

Ekpe

Pugatch

Saavedra

Stopnitzky

(2021). The influence of hidden researcher decisions in applied microeconomics. Economic Inquiry, 59(3), 944–960. https://doi.org/10.1111/ecin.12992

22.

Lakens

(2019). The value of preregistration for psychological science: A conceptual analysis. 心理学評論, 62(3), 221–230. https://doi.org/10.24602/sjpr.62.3_221

23.

Lakens

(2022). Sample size justification. Collabra: Psychology, 8(1), Article 33267. https://doi.org/10.1525/collabra.33267

24.

Lakens

(2024a). Examining the restrictiveness of the PRP-QUANT Template. Peer Community in Registered Reports. https://rr.peercommunityin.org/articles/rec?id=480

25.

Lakens

(2024b). When and how to deviate from a preregistration. Collabra: Psychology, 10(1), Article 117094. https://doi.org/10.1525/collabra.117094

26.

Lakens

(2025). Examining the restrictiveness of the PRP-QUANT Template. Peer Community in Registered Reports. https://doi.org/10.24072/pci.rr.101013

27.

Lakens

DeBruine

L. M.

(2021). Improving transparency, falsifiability, and rigor by making hypothesis tests machine-readable. Advances in Methods and Practices in Psychological Science, 4(2). https://doi.org/10.1177/2515245920970949

28.

Neuwirth

(2022). RColorBrewer: ColorBrewer palettes. https://CRAN.R-project.org/package=RColorBrewer

29.

Ofosu

G. K.

Posner

D. N.

(2023). Pre-analysis plans: An early stocktaking. Perspectives on Politics, 21(1), 174–190. https://doi.org/10.1017/S1537592721000931

30.

Parsons

Azevedo

Elsherif

M. M.

Guay

Shahim

O. N.

Govaart

G. H.

Norris

O’Mahony

Parker

A. J.

Todorovic

Pennington

C. R.

Garcia-Pelegrin

Lazić

Robertson

Middleton

S. L.

Valentini

McCuaig

Baker

B. J.

Collins

. . . Aczel

(2022). A community-sourced glossary of open scholarship terms. Nature Human Behaviour, 6, 312–318. https://doi.org/10.1038/s41562-021-01269-4

31.

Peer Community in Registered Reports. (n.d.). Guide for authors. https://rr.peercommunityin.org/help/guide_for_authors

32.

R Core Team. (2023). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/

33.

Revelle

(2023). psych: Procedures for psychological, psychometric, and personality research. Northwestern University. https://CRAN.R-project.org/package=psych

34.

Romano

Kromrey

J. D.

Coraggio

Skowronek

Devine

(2006, October 14–17). Exploring methods for evaluating group differences on the NSSE and other surveys: Are the t-test and Cohen’s d indices the most appropriate choices? Annual Meeting of the Southern Association for Institutional Research [Conference session]. Arlington, Virginia.

35.

Scofield

D. G.

(2015). Using nestedRanksTest. http://cran.nexr.com/web/packages/nestedRanksTest/vignettes/nestedRanksTest.html

36.

Scofield

D. G.

(2016). Mann-Whitney-Wilcoxon test for nested ranks. https://github.com/douglasgscofield/nestedRanksTest

37.

Simmons

J. P.

Nelson

L. D.

Simonsohn

(2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632

38.

Spitzer

Bosnjak

Mueller

(2024). Testing the usability of the Psychological Research Preregistration-Quantitative (PRP-QUANT) Template. Meta-Psychology, 8. https://doi.org/10.15626/MP.2023.4039

39.

Spitzer

Kroeger

Mueller

(2025a). Analysis scripts for the Stage 2 Registered Report: Restriction of researcher degrees of freedom through the Psychological Research Preregistration-Quantitative (PRP-QUANT) Template. PsychArchives. https://doi.org/10.23668/psycharchives.21202

40.

Spitzer

Kroeger

Mueller

(2025b). Coding schemes for the Stage 2 Registered Report: Restriction of researcher degrees of freedom through the Psychological Research Preregistration-Quantitative (PRP-QUANT) Template. PsychArchives. https://doi.org/10.23668/psycharchives.16152

41.

Spitzer

Kroeger

Mueller

(2025c). R Markdown file for the Stage 2 Registered Report: Restriction of researcher degrees of freedom through the Psychological Research Preregistration-Quantitative (PRP-QUANT) Template. PsychArchives. https://doi.org/10.23668/psycharchives.21201

42.

Spitzer

Mueller

(2023). Registered report: Survey on attitudes and experiences regarding preregistration in psychological research. PLOS ONE, 18(3), Article e0281086. https://doi.org/10.1371/journal.pone.0281086

43.

Spitzer

Mueller

(2024a). Analysis scripts for the Stage 1 Registered Report: Restriction of researcher degrees of freedom through the Psychological Research Preregistration-Quantitative (PRP-QUANT) Template. PsychArchives. https://doi.org/10.23668/psycharchives.14107

44.

Spitzer

Mueller

(Eds.). (2024b). Deviation template (PRP-DEV). PsychArchives. https://doi.org/10.23668/psycharchives.14684

45.

Spitzer

Mueller

(2024c). Preliminary sample and dummy/blinded data for the Stage 1 Registered Report: Restriction of researcher degrees of freedom through the Psychological Research Preregistration-Quantitative (PRP-QUANT) Template. PsychArchives. https://doi.org/10.23668/psycharchives.14045

46.

Spitzer

Mueller

(2024d). R Markdown file for the Stage 1 Registered Report: Restriction of researcher degrees of freedom through the Psychological Research Preregistration-Quantitative (PRP-QUANT) Template. PsychArchives. https://doi.org/10.23668/psycharchives.14120

47.

Spitzer

Mueller

(2024e). Stage 1 Registered Report: Restriction of researcher degrees of freedom through the Psychological Research Preregistration-Quantitative (PRP-QUANT) Template. PsychArchives. https://doi.org/10.23668/psycharchives.14119

48.

TARG Meta-Research Group & Collaborators Thibault

R. T.

Clark

Pedder

van den Akker

Westwood

Thompson

Munafo

(2023). Estimating the prevalence of discrepancies between study registrations and publications: A systematic review and meta-analyses. medRxiv. https://doi.org/10.1101/2021.07.07.21259868

49.

Torchiano

(2020). Effsize: Efficient effect size computation. Zenodo. https://doi.org/10.5281/zenodo.1480624

50.

Toth

A. A.

Banks

G. C.

Mellor

O’Boyle

E. H.

Dickson

Davis

D. J.

DeHaven

Bochantin

Borns

(2021). Study preregistration: An evaluation of a method for transparent reporting. Journal of Business and Psychology, 36(4), 553–571. https://doi.org/10.1007/s10869-020-09695-3

51.

van Buuren

Groothuis-Oudshoorn

. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67. https://doi.org/10.18637/jss.v045.i03

52.

Van Den Akker

Bakker

Van Assen

M. A. L. M.

Pennington

C. R.

Verweij

Elsherif

M. M.

Claesen

Gaillard

S. D. M.

Yeung

S. K.

Frankenberger

J.-L.

Krautter

Cockcroft

J. P.

Kreuer

K. S.

Evans

T. R.

Heppel

Schoch

S. F.

Korbmacher

Yamada

Albayrak-Aydemir

. . . Wicherts

J. M

. (2023). The effectiveness of preregistration in psychology: Assessing preregistration strictness and preregistration-study consistency. MetaArXiv. https://doi.org/10.31222/osf.io/h8xjw

53.

Veldkamp

C. L. S.

Mellor

D. T.

Bakker

Assen

M. A. L. M.

van Wicherts

Nosek

B. A.

Ong

H. H.

Crompvoets

E. A. V.

Soderberg

C. K.

(2020). Ensuring the quality and specificity of preregistrations. OSF. https://osf.io/hbze5

54.

Wicherts

J. M.

Veldkamp

C. L. S.

Augusteijn

H. E. M.

Bakker

Van Aert

R. C. M.

Van Assen

M. A. L. M

. (2016). Degrees of freedom in planning, running, analyzing, and reporting psychological studies: A checklist to avoid p-hacking. Frontiers in Psychology, 7, Article 1832. https://doi.org/10.3389/fpsyg.2016.01832

55.

Wickham

Averick

Bryan

Chang

McGowan

L. D.

François

Grolemund

Hayes

Henry

Hester

Kuhn

Pedersen

T. L.

Miller

Bache

S. M.

Müller

Ooms

Robinson

Seidel

D. P.

Spinu

. . . Yutani

(2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), Article 1686. https://doi.org/10.21105/joss.01686

56.

Wilkinson

M. D.

Dumontier

Aalbersberg

I. J.

Appleton

Axton

Baak

Blomberg

Boiten

J.-W.

Da Silva Santos

L. B.

Bourne

P. E.

Bouwman

Brookes

A. J.

Clark

Crosas

Dillo

Dumon

Edmunds

Evelo

C. T.

Finkers

. . . Mons

(2016). The FAIR guiding principles for scientific data management and stewardship. Scientific Data, 3(1), Article 160018. https://doi.org/10.1038/sdata.2016.18

57.

Willroth

E. C.

Atherton

O. E.

(2024). Best laid plans: A guide to reporting preregistration deviations. Advances in Methods and Practices in Psychological Science, 7(1). https://doi.org/10.1177/25152459231213802

58.

Xie

(2023). xfun: Supporting functions for packages maintained by ‘yihui Xie’. https://CRAN.R-project.org/package=xfun

Stage 2 Registered Report: Restriction of Researcher Degrees of Freedom Through the Psychological Research Preregistration-Quantitative Template

Abstract

Keywords

Restrictiveness of Preregistrations Created With the Psychological Research Preregistration-Quantitative Template

Adherence to the Preregistered Plan and Reporting of Deviations

Method

Transparency statement

Sample

Measures and coding procedure

Data analysis

R packages and scripts

Preprocessing

Restrictiveness

Adherence

Results

Restrictiveness

Overall restriction of RDF through the PRP-QUANT Template

Higher RDF restriction in PRP-QUANT than OSF preregistrations

Higher restriction of RDF in peer-reviewed than in non-peer-reviewed preregistrations

High occurrence of deviations

Exploratory analyses

Discussion

Higher restrictiveness in PRP-QUANT and peer-reviewed preregistrations

High occurrence of deviations and need for more transparent reporting

Limitations

Future developments

Conclusion

Footnotes

Appendix

Acknowledgements

Transparency

ORCID iDs

References