Sage Journals: Discover world-class research

Abstract

The Transparency and Openness Promotion (TOP) Guidelines describe modular standards that journals can adopt to promote open science. The TOP Factor quantifies the extent to which journals adopt TOP in their policies, but there is no validated instrument to assess TOP implementation. Moreover, raters might assess the same policies differently. Instruments with objective questions are needed to assess TOP implementation reliably. In this study, we examined the interrater reliability and agreement of three new instruments for assessing TOP implementation in journal policies (instructions to authors), procedures (manuscript-submission systems), and practices (journal articles). Independent raters used these instruments to assess 339 journals from the behavioral, social, and health sciences. We calculated interrater agreement (IRA) and interrater reliability (IRR) for each of 10 TOP standards and for each question in our instruments (13 policy questions, 26 procedure questions, 14 practice questions). IRA was high for each standard in TOP; however, IRA might have been high by chance because most standards were not implemented by most journals. No standard had “excellent” IRR. Three standards had “good,” one had “moderate,” and six had “poor” IRR. Likewise, IRA was high for most instrument questions, and IRR was moderate or worse for 62%, 54%, and 43% of policy, procedure, and practice questions, respectively. Although results might be explained by limitations in our process, instruments, and team, we are unaware of better methods for assessing TOP implementation. Clarifying distinctions among different levels of implementation for each TOP standard might improve its implementation and assessment (study protocol: https://doi.org/10.1186/s41073-021-00112-8).

Keywords

Open science TOP Factor TOP Guidelines transparency reproducibility

Because the decision to publish research findings is often related to their magnitude and statistical significance, an undesirable proportion of published findings are likely to be false. For example, the ability to replicate results is sometimes used as a proxy for whether they are true (Goodman et al., 2016), and scientists have been concerned for decades about the lack of replication in the literature (Greenwald, 1976; Nosek et al., 2022; Open Science Collaboration, 2015; Rosenthal, 1979; Sterling, 1959). There is increasing recognition that transparent and open research practices can increase trust in empirical science and increase the likelihood that published results are true (Miguel et al., 2014). Note that scientists endorse positive norms related to transparency and openness, and social changes may be contributing to greater uptake of technologies and rules to promote better scientific practices (Agnoli et al., 2021; Christensen et al., 2020; Lindsay, 2017; Spellman, 2015; Tenopir et al., 2015).

Academic publishers and journal editors can facilitate greater transparency and openness in the published literature by implementing policies in their instructions to authors that promote open-science standards (Mayo-Wilson et al., 2021). In 2015, scientists representing multiple disciplines developed the Transparency and Openness Promotion (TOP) Guidelines, which provide standards on open-science policies for scientific journals (Nosek et al., 2015). TOP includes eight modular standards on transparency (design and analysis reporting guidelines), reproducibility (data, code, and materials sharing), prospective registration (study and analysis plan preregistration), and rewarding researchers for engaging in open science (conducting replications and citing data, code, and materials). The TOP Factor—a measure of journal implementation of TOP—includes two additional standards related to (a) publication bias and (b) “open science badges” that acknowledge open research practices.

Journal policies might operationalize TOP standards at “levels” that differ in the stringency of requirements for journals, peer reviewers, and authors (Nosek et al., 2015). Level 1 policies promote open research practices, typically by requiring authors to disclose whether they used an open research practice. Level 2 policies involve stronger expectations for authors without added requirements for editors and reviewers, typically by requiring that authors use open research practices. Level 3 policies require that journals also invest resources to verify that authors used open research practices. Journal policies that do not implement TOP are assigned Level 0. Using data sharing as an example, journals could require authors to disclose whether data are publicly accessible (Level 1), require authors to archive data in trusted repositories (Level 2), verify that reported analyses can be reproduced independently using the publicly accessible data (Level 3), or say nothing about data sharing (Level 0). The TOP Factor is calculated as the sum of the levels of implementation across all 10 standards.

Journal procedures and practices should be aligned with journal policies. Here, we define “procedures” to include the manuscript-submission systems used to submit articles and related information about those articles and their authors. We define journal “practices” as information reported in journal articles (Mayo-Wilson et al., 2021). Journal procedures that complement stringent policies could enable uniform policy adherence and facilitate systematic monitoring of standards related to transparency and openness (Aalbersberg et al., 2018). For example, journals could implement procedures that require authors to complete certain tasks during manuscript submission, such as entering structured data elements into submission systems (e.g., entering a URL in a field for the location of a data set). In addition, journals might create templates for empirical articles that facilitate disclosing information in journal articles. For all manuscripts that are potentially eligible for open-science badges, Advances in Methods and Practices in Psychological Science requires that manuscripts include a dedicated “Disclosures” section in the main text on preregistration, transparent reporting, and data, code, and materials sharing. By contrast, stringent policies that are not complemented by equally stringent procedures and practices might be less effective and more difficult to monitor. For example, many journal policies require that authors register clinical trials prospectively, yet a study of high-impact clinical-psychology journals found that two of the three journals that required trial registration published unregistered trials (Cybulski et al., 2016).

Reliable assessment of journal implementation is essential to TOP’s goals of increasing transparency and openness. Designed as a complement to the journal impact factor, the TOP Factor is a quantifiable metric for journal quality that focuses on the degree to which journals promote scholarly norms of transparency and openness. Yet because TOP involves a complex scoring system, raters might differ in their assessments of the same journals. Without standardized instruments, guidance, and training, the TOP Factor has questionable credibility and utility. For example, groups that have assessed the extent to which journals implement TOP have identified difficulties in understanding the distinctions among different levels of implementation and difficulties in how language used by journals corresponds to TOP (Cashin et al., 2021; Hansford et al., 2021; Spitschan et al., 2020). Crowdsourcing efforts to rate journals have used bespoke methods or subjective rater judgments that are not methodologically reproducible. Although the interrater reliability (IRR) of TOP ratings is unknown, anecdotal evidence suggests that differences in the interpretation and rating of journal policies are common. Given the growing use of TOP as a framework to change journal behaviors, reliable instruments with objective and clear questions are needed.

Objectives

In this study, we systematically assessed the interrater agreement (IRA) and the IRR of three instruments for assessing implementation of the TOP standards in journal policies (instructions to authors), procedures (manuscript-submission systems), and practices (journal articles). Two related articles reported results from other parts of the overall study. One describes the level of TOP uptake at the journals included in our study (Grant et al., 2022). The other describes a survey of journal editors that aimed to identify facilitators and barriers to uptake (Naaman et al., 2023).

Disclosures

We published the protocol for this study in a peer-reviewed journal (Mayo-Wilson et al., 2021). Readers can access the code and documentation (Kianersi et al., 2022b), materials (Kianersi et al., 2022a), deidentified data (Mayo-Wilson, Grant, Kianersi, & Naaman, 2022; Naaman, et al., 2023), and other resources at https://osf.io/txyr3/. In the Supplemental Material available online, we summarized descriptive findings on journal publishers and submission systems, our methods for interpreting the magnitude of IRR estimates, comparisons of our journal ratings with those posted on the Center for Open Science website, and our journal-rating timelines. We reported how we determined our sample size, all data exclusions, all manipulations, and all measures in the study. This study was reviewed by the Institutional Review Board (IRB) at Indiana University and determined to be exempt human-subjects research (IRB No. 10201).

Method

Modeling best practices in systematic reviewing (Higgins et al., 2019), we developed a structured process and instruments for rating TOP implementation in journal policies, procedures, and practices (Mayo-Wilson et al., 2021). This report follows the Guidelines for Reporting Reliability and Agreement Studies (GRRAS; Kottner et al., 2011).

Eligible journals

We included journals that have published empirical evaluations on the effectiveness of social and psychological interventions. To identify eligible journals, we first searched for federal evidence clearinghouses in a previous study (Mayo-Wilson, Grant, & Supplee, 2022). Federal clearinghouses rate the quality of published empirical evaluations on intervention effects to distinguish and disseminate information about “evidence-based” interventions for public policy and local decision-making. In the current study, we included all journals that published at least one report of an evaluation used by a federal clearinghouse to support the highest rating possible for an intervention (i.e., a “top tier” evidence designation). We did not restrict reports by date restriction when identifying eligible journals. We included journals that have changed publisher or changed name since publishing an eligible report. We excluded journals that have ceased operation entirely.

We initially identified eight clearinghouses, from which we identified the 339 eligible journals in our sample (Grant et al., 2022; Mayo-Wilson, Grant, & Supplee, 2022). Two clearinghouses (Pathways to Work and Prevention Services Clearinghouse) became active during our project after we had generated the list of eligible journals (Mayo-Wilson, Grant, & Supplee, 2022), and we did not search the new clearinghouses for additional journals.

TRUST policy, procedure, and practice rating instruments

For each rating instrument, the principal investigators (PIs; S. P. Grang and E. Mayo-Wilson) developed concise questions with detailed instructions for each TOP standard (Center for Open Science, 2014; Kepes et al., 2020; Nosek et al., 2015). To facilitate reliability, these questions are intended to be objective, single-barreled (i.e., asking about only one aspect of a standard), consistent in structure, and include “yes-or-no” responses only (Polanin et al., 2019). Each “yes” response indicated that a policy, procedure, or practice implemented a particular aspect of a TOP standard.

The policy rating instrument included 41 yes-no questions and two multiple-choice questions. The procedure rating instrument included 60 yes-no questions and no multiple-choice questions. Both instruments were divided into 10 sections (with two to eight questions in each section). Each section evaluated one TOP standard and concluded with an open text-box field in which raters were instructed to copy and paste relevant text from eligible journal documents. The practice rating instrument included 26 yes-no questions and one multiple-choice question, with seven open text-box fields. Although instructions assumed some knowledge of quantitative research methods and publication processes, they also included examples of common scenarios to facilitate reliable ratings.

We programmed the instruments for journal policies and procedures into Research Electronic Data Capture (REDCap; Harris et al., 2009, 2019). We programmed the instrument for journal practices in EPPI-Reviewer (Thomas et al., 2020). To promote efficiency and to ensure consistency of the data, the instruments were rated online using display logic. That is, some questions were always displayed, and other questions were displayed conditionally on each rater’s previous responses. Our rating instruments to assess journals policies (TRUST Team, 2019a), procedures (TRUST Team, 2019b), and practices (TRUST Team, 2021) are freely available on OSF (https://osf.io/txyr3/).

To assess IRR and IRA, we analyzed the questions in each rating instrument that were answered for all journals by all research assistants (RAs). These included 13 policy questions, 26 procedure questions, and 14 practice questions. We excluded questions that appeared depending on responses to previous questions because of display logic.

Most policy, procedure, and practice questions (50 of 53) included in the analyses of the current study were dichotomous. One policy question had four options that could be either “checked” or “unchecked.” Two questions about practices had three possible options: “no,” “yes study was not registered,” and “yes study was registered”; we combined the two “yes” responses for analysis.

Statistical power

After identifying 339 eligible journals, we calculated the lowest expected value for κ (lower-bound estimate for a 95% confidence interval [CI]) under a range of scenarios for the fixed sample size (Table 1). We calculated precision in RStudio (RStudio Team, 2021) using the kappaSize package (Version 1.2; Rotondi, 2018). Depending on the prevalence of yes responses, our sample size was large enough to capture κ point estimates as small as 0.2 and lower-bound estimates as small as 0.04.

Table 1.

Power Analysis

κ point estimate	Prevalence of yes responses
	Three raters (policy instrument)			Two raters (procedures/practice instrument)
	2%	10%	20%	2%	10%	20%
0.8	0.59	0.72	0.74	0.54	0.69	0.72
0.5	0.27	0.39	0.42	0.25	0.37	0.40
0.2	0.06	0.11	0.13	0.04	0.09	0.10

Note: Values in the table show lower-bound estimates for a 95% confidence interval that can be detected for a fixed sample size of 339 journals by different κ point estimates, number of raters, and prevalence of yes responses (yes responses indicate the presence of policies, procedures, and practices that support transparency and openness).

Raters

Study PIs sent targeted emails to graduate students and their supervisors to recruit paid RAs to rate journal policies, procedures, and practices. The PIs trained the recruited RAs by introducing the project aims, answering general questions about transparency and openness, and discussing each question on the instruments. Compared with journal procedures and practices, we anticipated that journal policies would be more subjective and more difficult to rate. For this reason, we assigned two RAs to evaluate each procedure and practice, and we assigned three RAs to rate each journal policy.

We pilot tested preliminary versions of the instruments with RAs by assessing a small number of journals. During pilot testing, we held weekly group meetings to solicit feedback and to assess and promote agreement among RAs. Questions in the instruments with high disagreement were identified for further discussion and potential revision. When we identified questions with disagreements attributable to wording, structure, and/or instructions, we revised the questions to improve clarity and to promote future agreement. In the current study, we report the IRR and IRA of the revised versions of the TRUST policy, procedure, and practice rating instruments.

Identifying policy, procedure, and practice documents

To assess policies, two independent RAs identified eligible policy documents for each journal, specifically “instructions to authors” and related explanatory documents concerning manuscript submissions. RAs independently searched each journal’s website, downloaded a PDF version of each document, dated each document, and stored documents in a folder on Google Drive. RAs then met and reconciled any discrepancies in the eligible documents. Unresolved discrepancies were discussed and resolved during weekly meetings with the PIs.

To assess procedures, RAs located each journal’s online submission system and, when possible, initiated manuscript submissions. RAs simulated manuscript submission steps, took screenshots of each step, and saved those screenshots on Google Drive for rating. Because manuscript-submission systems might ask questions related to transparency and openness that depend on answers to previous questions (display logic), the RAs answered questions such that all relevant questions and fields would appear. A few journals required manuscript submissions by email. For these journals, RAs downloaded the journals’ submission instructions as PDF files. In weekly group meetings, RAs discussed issues about procedure identification with the PIs.

To evaluate practices, we used methods similar to article-identification and data-extraction procedures used in systematic reviews. We included journal articles reporting quantitative evaluations of interventions that were intended to modify processes and systems that are social and behavioral in nature and that are hypothesized to improve health or social outcomes (Grant et al., 2018). Because we aimed to describe journal practices (i.e., the proportion of journals that were transparent and open) rather than author practices (e.g., the proportion of articles in each journal that were transparent and open), we did not sample multiple articles from each journal. Instead, two RAs independently hand-searched each eligible journal by screening titles and abstracts. When they identified potentially eligible articles, they entered citation information (i.e., volume number, issue number, first page number, and DOI) using a REDCap form. RAs could identify more than one article for each journal (Mdn = 7). Articles identified by RAs were retrieved for full-text review by one of the PIs, who reviewed the articles in reverse chronological order (i.e., starting with the most recent). Because we anticipated that transparent and open journals might use templates or other procedures to achieve consistent practices, a PI identified one eligible article per journal for rating. Questions about inclusion were resolved through discussion with the other PI.

Rating setting

All ratings were conducted independently online, and RAs were aware that ratings would be compared. RAs were not masked to journal names. The RAs met with the PIs weekly. In these meetings, PIs answered RAs’ questions about the instruments and any problems that they faced during the rating of a specific journal question. After the start of the rating process, we updated the policy rating instrument and added six questions that we also completed for journals rated previously. Four of the added questions asked whether a policy applied to all studies or only certain kinds of studies, and two questions concerned items included in the TOP Factor that were not part of the TOP Guidelines. We initially conducted the weekly meetings in person, and we moved to online Zoom meetings in 2020 because of the COVID-19 pandemic.

Reconciliation

After RAs completed their ratings, one RA identified disagreements in the data set. The PIs reviewed all disagreements and selected the reconciled ratings to be used for analysis.

Scoring policies

To determine the TOP levels for all standards in a journal policy, we used algorithms based on a published rubric for the TOP Factor (Center for Open Science, 2016a). The algorithms are included in our protocol (Mayo-Wilson et al., 2021) and are freely available online (TRUST Team, 2019a). We did not calculate levels for journal procedures and practices because levels of implementation apply specifically to journal policies only.

Consistency check for journal procedures

In the reconciled procedure-ratings data set, we checked whether journals with the same publisher and submission system had the same procedure ratings. For instance, we expected that journals published by Wiley and accepting manuscripts through the ScholarOne submission system would use similar procedures and thus have similar ratings. To examine the reliability of our ratings and to improve the quality of our data, one author (S. Kianersi) used Python to identify journals from the same publication and submission system that were rated differently. A second author (K. Naaman) reviewed inconsistent ratings, verified whether the procedures were actually the same or different, and recommended changes to improve consistency and accuracy. One PI (S. P. Grant) then reviewed and finalized the ratings used for analysis.

Comparison with external data sources

To further assess the reliability of our reconciled ratings, we compared the TOP Factor scores we calculated with TOP Factor scores published elsewhere. We found no overlapping journals with the results from three previous reports (Cashin et al., 2021; Hansford et al., 2021; Spitschan et al., 2020). We identified some overlapping journals rated on OSF by April 8, 2022. OSF ratings were completed at different times (which were not recorded in the data set) by staff at Center for Open Science and volunteers at hackathon-style events or by following journal and publisher submissions (personal communication). To rate journals, Center for Open Science refers raters to the published rubric that we used for our study (Center for Open Science, 2016a). To our knowledge, ratings published by Center for Open Science were not done in duplicate, and they were not completed using structured instruments. We considered ratings on OSF to be the best available data set for comparison, but these ratings were not considered a “gold standard” for determining the transparency and openness of journals.

Statistical analysis

Descriptive analyses on raters

We calculated the number of journals and the number of questions that each RA rated, and we calculated the proportion of times that each RA was in agreement with the reconciled rating (i.e., the rating agreed by all RAs or the rating after reconciliation by a PI). For policy questions, we also calculated the proportion of times that each RA was in the minority (i.e., rated a question differently from the other two policy RAs); we did not calculate this proportion for procedure or practice ratings data because there were only two raters.

IRR analysis

To estimate reliability, we used Fleiss’s κ and the intraclass correlation coefficient (ICC; Koo & Li, 2016; Kozlowski & Hattrup, 1992). We used Fleiss’s κ statistic when evaluating the IRR for each journal policy, procedure, and practice question (Fleiss, 1971). Because level of implementing each TOP standard is on an ordinal scale (range = 0–3), we used the ICC when assessing the IRR for TOP standard implementation (Kottner et al., 2011; Shrout & Fleiss, 1979). We used the two-way random-effects ICC model in our analysis, treating both journals and RAs as random effects (McGraw & Wong, 1996; Shrout & Fleiss, 1979). We used the “absolute agreement” definition and implemented the “single rater” type in our ICC analysis (Koo & Li, 2016). We categorized the κ statistic and ICC values using published guidelines (Koo & Li, 2016; Landis & Koch, 1977).

IRA analyses

For all IRA analyses, we evaluated the overall and specific agreement proportions (Cicchetti & Feinstein, 1990; de Vet et al., 2017; Kottner et al., 2011; Kozlowski & Hattrup, 1992). Overall agreement proportion was defined as the number of cases in which raters agreed exactly relative to the total number of ratings. Specific agreement proportion was the observed agreement relative to each of the yes or no rating categories (for Equations 1–3, see Table 2). We reported overall agreement for each TOP standard. In addition, in a sensitivity analysis for evaluating IRR for each TOP standard, we also reported the information-based measure of disagreement and its 95% CI (Costa-Santos et al., 2010; Henriques, Antunes, Bernardes, et al., 2013; see Table S1 in the Supplemental Material).

Table 2.

Abbreviations and Definitions

COS: Center for Open Science (Center for Open Science, 2022).Equations: In Equations 1 to 3, a, b, c, and d denote the frequencies for combination of ratings by the two raters. a is the number of cases in which both raters chose the yes response for a question, b is the number of cases in which Rater 1 selected yes and Rater 2 selected no for a question, c is the number of cases in which Rater 1 selected no and Rater 2 selected yes for a question, and d is the number of cases in which both raters selected no for a question (for more details and the generalized case equations, see Uebersax website; Uebersax, 2018):

Overall agreement = \frac{a + d}{a + b + c + d}

(1)

Specific agreement on yes responses = \frac{2 a}{2 a + b + c}

(2)

Specific agreement on the no responses = \frac{2 d}{2 d + b + c}

(3)

IRA (interrater agreement): “The extent to which raters make essentially the same ratings” (Kozlowski & Hattrup, 1992, p. 163).IRR (interrater reliability): A proportional and correlational measure of whether ratings vary between raters. The extent of variability that is due to error (Koo & Li, 2016; Kozlowski & Hattrup, 1992).

Level: Modular standards in the TOP Guidelines have up to four levels with increasingly stringent requirements, ranging from 0 (not implemented) to 3 (implemented at the highest level).

Policy: Instructions to authors in place at the journal level, sometimes referred to as “manuscript-submission guidelines” (Mayo-Wilson et al., 2021).

Practice: Behaviors that can be observed through the outcome of journal procedures, such as information reported systematically in journal articles (Mayo-Wilson et al., 2021).

Procedure: Methods and mechanisms that journals use to implement policies, such as manuscript-submission processes in online systems including EditorialManager and ScholarOne (Mayo-Wilson et al., 2021).

Proportion of overall agreement: The number of cases in which raters agree exactly, relative to the total number of observations (Kottner et al., 2011).

Proportion of specific agreement: Observed agreement in proportion to each rating category individually (Cicchetti & Feinstein, 1990; Kottner et al., 2011).

REDCap: Research Electronic Data Capture is a secure, web-based software platform designed to support data capture for research studies (Harris et al., 2009, 2019).

Standard: One of the 10 modular standards for transparency and openness included in the TOP Guidelines and TOP Factor (Nosek et al., 2015).

TOP: Transparency and Openness Promotion Guidelines (Nosek et al., 2015).

TOP Factor: A quantitative metric based on the TOP Guidelines that assesses the degree to which journal policies promote transparency and openness (topfactor.org).

TRUST Initiative: Transparency of Research Underpinning Social Intervention Tiers Initiative.

Software

Data processing, management, visualization, and part of descriptive analyses were done in Python (Van Rossum & Drake, 2009). We conducted the IRA and IRR analysis in R using RStudio (R Development Core Team, 2021; RStudio Team, 2021). We used the obs.agree package in R for IRA analysis and the irr package for IRR analysis (Gamer et al., 2019; Henriques, Antunes, & Costa-Santos, 2013). Annotated data processing and analysis-code notebooks are available at OSF (https://osf.io/xtdb6/). As recommended in GRRAS, we used the lower bond of the 95% CI to interpret the IRA and ICC measures.

Results

We rated 339 journals using the policy instrument, 335 journals using the procedure instrument, and 322 journals using the practice instrument (Fig. 1). Fifteen RAs were recruited and started rating policy, procedure, and practice documents in December 2019, May 2020, and November 2020, respectively (see Fig. S1 in the Supplemental Material). RAs completed all ratings by January 2021 and performed comparably with each other. The percentage agreement with the reconciled rating used for analysis was 95% to 100% across all RAs and instruments (Table 3). Moreover, the percentage of times each RA disagreed with the other two RAs on questions in the policy instrument was low, between 1% and 3%.

Fig. 1.

Journal flow diagram.

Table 3.

Rater Agreement on Journal Policy, Procedure, and Practice Ratings

	Journals rated	Questions rated	Ratings in the minority		Agreement with the reconciled rating
	n	n	n	%	n	%
Policy raters: Questions for the eight standards in the original TOP Guidelines^a
Rater 1	224	2,464	28.0	1.1	2,426	99
Rater 2	209	2,299	40.0	1.7	2,236	97
Rater 3	178	1,958	35.0	1.8	1,904	97
Rater 4	126	1,386	17.0	1.2	1,368	99
Rater 5	107	1,177	36.0	3.1	1,132	96
Rater 6	98	1,078	34.0	3.2	1,040	97
Rater 7	75	825	20.0	2.4	817	99
Policy raters: Questions for the two additional standards in the TOP Factor^a
Rater 8	339	678	12.0	1.8	652.0	96
Rater 9	315	630	7.0	1.1	615.0	98
Rater 10	223	446	13.0	2.9	437.0	98
Rater 11	140	280	2.0	0.7	276.0	99
Procedure raters^b
Rater 11	335	8,710	NA	NA	8,544	98
Rater 10	335	8,710	NA	NA	8,526	98
Practice raters^b
Rater 4	165	2,310	NA	NA	2,269	98
Rater 12	130	1,820	NA	NA	1,786	98
Rater 13	65	910	NA	NA	898	99
Rater 14	64	896	NA	NA	885	99
Rater 10	61	854	NA	NA	851	100
Rater 11	53	742	NA	NA	734	99
Rater 8	36	504	NA	NA	477	95
Rater 15	36	504	NA	NA	496	98
Rater 2	17	238	NA	NA	234	98
Rater 1	17	238	NA	NA	233	98

Note: TOP = Transparency and Openness Promotion; NA = not applicable.

The original TOP Guidelines questions (n = 11) were rated by seven raters at Indiana University-Bloomington campus, and additional TOP Factor questions (n = 2) were rated by four raters at the Indiana University-Indianapolis campus.

We did not calculate the proportion of times that a rater’s rating was in the minority for procedure or practice ratings data because there were only two raters.

Journal policies (instructions to authors)

For journal policies, we present estimates for agreement and interrater reliability of the 13 individual questions in our instrument, agreement and IRR of the scores for the 10 standards in the TOP Factor, and agreement and reliability of our final ratings compared with an external data source (Center for Open Science, 2016b).

Agreement and reliability of individual questions

For all 13 policy questions, overall agreement and specific agreement on the no responses exceeded 90%. The prevalence of yes responses was low (Table 4), ranging from 1% to 23% (Mdn = 2%). Specific agreement on yes responses was inconsistent; there was lower specific agreement for questions with few yes responses. Fleiss’s κ values ranged from −0.008 to 0.903 (M = 0.507, SD = 0.371) and were statistically significantly different from 0 for most policy questions (10/13; 77%). Whereas the IRR statistics were “almost perfect” for four questions (4/13; 31%) and “substantial” for one question (1/13, 8%), IRR values were “moderate” or worse for most questions (8/13; 62%).

Table 4.

Interrater Agreement and Interrater Reliability for Questions in the Policy Rating Instrument Among the Three Raters

		IRA measure			IRR measure
Questions in journal policy rating instrument^a	Yes responsesn (%)	% Overall agreementPE [95% CI]	% Specific agreement onthe no responsesPE [95% CI]	% Specific agreement on yes responsesPE [95% CI]	Fleiss’s κ(p value)	Interpretation^b
1a. For previously collected data sets described in a manuscript, does the policy describe a format for citing the data sets in the text and for listing them in the reference section (e.g., specific information to include)?	232 (23)	93 [90, 95]	95 [94, 97]	84 [79, 89]	0.793 (< .001)	Substantial
1b. Does the policy describe a format for citing code in the text and for listing it in the reference section (e.g., specific information to include)?	20 (2)	96 [94, 98]	98 [97, 99]	5 [0, 15]	0.031 (.324)	Slight
2a. For newly collected data (i.e., collected as part of a study submitted to this journal), does the policy require as a condition of publication that authors post the data on a repository for one or more types of studies?	41 (4)	96 [94, 97]	98 [97, 99]	49 [30, 63]	0.466 (< .001)	Moderate
3a. Does the policy require as a condition of publication that code used to perform statistical analyses reported in each manuscript be posted on a repository for one or more types of studies?	23 (2)	97 [95, 98]	98 [97, 99]	30 [8, 52]	0.288 (< .001)	Fair
5a. Does the policy require that each manuscript disclose whether newly created research materials are available to the public for one or more types of studies?	21 (2)	97 [95, 98]	98 [97, 99]	19 [0, 42]	0.173 (< .001)	Slight
6a. Does the policy reference or refer to one or more reporting guidelines (e.g., by reference to the EQUATOR Network, CONSORT, or a checklist of specific items used by the journal)?	229 (23)	96 [94, 97]	97 [96, 98]	90 [86, 94]	0.876 (< .001)	Almost perfect
7a. Does the policy require that one or more types of manuscript indicate whether the study was registered?	183 (18)	96 [94, 98]	98 [97, 99]	89 [84, 93]	0.867 (< .001)	Almost perfect
8a. Does the policy require that one or more types of manuscript indicate whether the analysis plan was registered?	8 (1)	98 [97, 99]	99 [99, 100]	0 [0, 0]	−0.008 (.801)	Poor
9a. Does the policy “welcome,” “encourage,” or otherwise indicate that the journal accepts replication studies?	78 (8)	93 [90, 95]	96 [95, 97]	51 [39, 64]	0.472 (< .001)	Moderate
9b. Does the policy state that authors may submit manuscripts as “Registered Reports” to be peer reviewed prior to conducting the research and obtaining the results?	45 (4)	99 [98, 100]	99 [99, 100]	87 [74, 96]	0.860 (< .001)	Almost perfect
9c. Does the policy state that authors may submit manuscripts to be reviewed based on the background and methods sections alone?	6 (1)	99 [98, 100]	99 [99, 100]	0 [0, 0]	−0.006 (.850)	Poor
Additional TOP Factor Question 1. Does the policy state that the statistical “significance” of results or “novelty” of findings are NOT a criteria for publication decisions?	27 (3)	96 [94, 97]	98 [97, 99]	19 [6, 30]	0.163 (< .001)	Slight
Additional TOP Factor Question 2. Does the policy state that the journal awards one or more of the following open-science badges?^c
“Preregistration” badge	24 (2)	99 [98, 100]	100 [99, 100]	83 [62, 96]	0.829 (< .001)	Almost perfect
“Open Data” badge	21 (2)	100 [99, 100]	100 [99, 100]	90 [73, 100]	0.903 (< .001)	Almost perfect
“Open Materials” badge	21 (2)	100 [99, 100]	100 [99, 100]	90 [73, 100]	0.903 (< .001)	Almost perfect

Note: Order of questions might not match that in the protocol study because the policy instrument was modified after publication of the protocol study. IRA = interrater agreement; IRR = interrater reliability; PE = point estimate; CI = confidence interval.

There was no skip logic for the questions. Hence, all raters responded (yes or no) to these questions for all journals.

Interpretation based on the following: < 0.00 = poor; 0.00–0.20 = slight; 0.21–0.40 = fair; 0.41–0.60 = moderate; 0.61–0.80 = substantial; 0.81–1.00 = almost perfect (Landis & Koch, 1977).

Additional TOP Factor Question 2 had four possible responses (no, “Preregistration” badge, “Open Data” badge, and “Open Materials” badge). We evaluated the IRA/IRR for the last three responses. Here, the number of yes responses column shows the number of times that raters checked the box for each of the possible responses.

Agreement and reliability for TOP standards

Agreement on assessments for journal level of implementation was high for all standards (Table 5), ranging from 79% for “data transparency” to 99% for “open science badges.” However, as with individual questions, IRR varied across the standards. The ICC ranged from −0.004 for “registration of analysis plan” to 0.883 for “open science badges” (M = 0.479, SD = 0.303). In addition, no standard had excellent reliability: three standards had “good” (3/10; 30%), one “moderate” (1/10; 10%), and six “poor” IRR (6/10; 60%).

Table 5.

Interrater Reliability and Interrater Agreement for TOP Standard Levels of Implementation

	IRA measure	IRR measure
TOP Standards domains	Overall agreement percentage	ICC [95% CI]	Interpretation^a
Citation standards^b	82	0.673 [0.624, 0.719)	Moderate
Data citation	82	0.673 [0.624, 0.719]	Moderate
Code citation	99	0.199 [0.132, 0.269]	Poor
Data transparency	79	0.493 [0.430, 0.554]	Poor
Code transparency	93	0.347 [0.280, 0.416]	Poor
Research materials transparency	95	0.175 [0.108, 0.245]	Poor
Design and analysis transparency	87	0.857 [0.832, 0.879]	Good
Registration of studies^c	86	0.809 [0.776, 0.838]	Good
Registration of analysis plans^c	98	−0.004 [−0.063, 0.06]	Poor
Replication^c	89	0.474 [0.410, 0.536]	Poor
Registered reports and publication bias	94	0.169 [0.103, 0.238]	Poor
Open-science badges	99	0.883 [0.861, 0.901]	Good

Note: IRA = interrater agreement; IRR = interrater reliability; TOP = Transparency and Openness Promotion; CI = confidence interval; ICC = intraclass correlation coefficient.

Interpretation is based on the lower bound of the 95% CI (< 0.50 = poor; 0.50–0.75 = moderate; 0.75–0.90 = good; >0.90 = excellent reliability; Koo & Li, 2016).

The TOP rubric refers to “data and materials.” We assessed Standards for citing data and statistical code (rows in italics) separately, and we assigned the higher of those two values to citation Standards.

Typically, a journal would get a Level 2 for a Standard if it requires that authors actually use an open-science practice. The exceptions are the Standards for preregistering a study, including its analysis plan, and the replication standard. In those cases, Level 2 states that the journal verifies compliance with the preregistered plan or its analysis plan or reviews replication studies blinded to their results.

Agreement with an external data source

In April 2022, we identified ratings on OSF for 1,575 journal policies, including TOP Factor scores for individual standards and for the total TOP Factor (Center for Open Science, 2016b). Of the 339 journal policies rated in our study, 134 (40%) were also rated on OSF (Fig. 2). Total TOP Factor scores were the same for 56 (42%) journals. Scores on OSF were higher for 49 (37%) journals, and scores in our study were higher for 29 (22%) journals. The mean absolute difference between the scores was 2, and ICC was 0.602 (95% CI = [0.426, 0.734]), suggesting a “moderate” reliability between the two scores (see Tables S2 and S3 in the Supplemental Material).

Fig. 2.

Overlap between the OSF and TRUST total Transparency and Openness Promotion Factor scores (134 journals). We added noise to overlapping circles, which appear as darker areas in the plot.

Journal procedures (manuscript-submission systems)

For the 26 procedure questions, overall agreement and specific agreement on the no responses exceeded 90% for 23 and 24 questions, respectively. The prevalence of yes responses was low (Table 6), ranging from 0% to 29% (Mdn = 3%). As with journal policies, the specific agreement on yes responses was inconsistent such that there was lower specific agreement for questions with few yes responses. Fleiss’s κ values ranged from −0.017 to 1.000 (M = 0.457, SD = 0.390) and were statistically significantly different from 0 for most procedure questions (18/26; 69%). IRR values were moderate or worse for most questions (14/26; 54%), although some were almost perfect (8/26; 31%) and substantial (4/26; 15%).

Table 6.

Interrater Agreement and Interrater Reliability for Questions in the Procedure Rating Instrument Between the Two Raters

		IRA measure			IRR measure
Questions in journal procedure rating instrument^a	Yes responsesn (%)	% Overall agreementPE [95% CI]	% Specific agreement on the no responsesPE [95% CI]	% Specific agreement on yes responsesPE [95% CI]	Fleiss’ κ(p value)	Interpretation^b
1a. Does the submission system ask authors to confirm that data have been cited according to journal guidelines?	10 (1)	98 [96, 99]	99 [98, 100]	20 [0, 55]	0.188 (< .001)	Slight
2a. Does the submission process include one or more fields in which authors may enter a link to DATA (i.e., DOI, URL, or other persistent identifier)?	108 (16)	95 [92, 97]	97 [95, 98]	83 [75, 90]	0.801 (<.001)	Almost perfect
2c. Does the submission process include one or more fields for indicating whether DATA are publicly available?	196 (29)	73 [68, 78]	81 [77, 84]	54 [45, 62]	0.351 (< .001)	Fair
2e. Does the submission process include one or more fields for authors to upload their DATA?	141 (21)	91 [88, 94]	95 [92, 96]	79 [72, 86]	0.740 (< .001)	Substantial
3a. Does the submission process include one or more fields in which authors may enter a link to CODE (i.e., DOI, URL, or other persistent identifier) posted on a repository?	14 (2)	97 [95, 99]	98 [97, 99]	29 [0, 57]	0.270 (< .001)	Fair
3c. Does the submission process include one or more fields for indicating whether CODE is publicly available?	20 (3)	95 [93, 97]	98 [96, 99]	20 [0, 43]	0.175 (.001)	Slight
3e. Does the submission process include one or more fields for authors to upload their CODE?	11 (2)	97 [95, 99]	98 [97, 99]	0 [0, 0]	−0.017 (.760)	Poor
4a. Does the submission process include one or more fields in which authors may enter a link to newly created RESEARCH MATERIALS (i.e., DOI, URL, or other persistent identifier) posted on a repository?	20 (3)	95 [92, 97]	97 [96, 98]	10 [0, 30]	0.072 (.186)	Slight
4c. Does the submission process include one or more fields for indicating whether RESEARCH MATERIALS are publicly available?	91 (14)	85 [81, 89]	92 [89, 94]	46 [32, 58]	0.377 (< .001)	Fair
4e. Does the submission process include one or more fields for authors to upload their RESEARCH MATERIALS?	19 (3)	96 [93, 98]	98 [96, 99]	21 [0, 44]	0.187 (< .001)	Slight
5a. Does the submission process include a field for uploading a completed checklist for one or more reporting guidelines (e.g., a CONSORT checklist)?	25 (4)	97 [96, 99]	99 [98, 99]	64 [37, 83]	0.626 (< .001)	Substantial
5c. Does the submission process include a field for indicating whether the authors followed a reporting guideline (e.g., by reference to the EQUATOR Network, CONSORT, or a checklist of specific items used by the journal)?	32 (5)	97 [95, 99]	98 [97, 99]	69 [46, 85]	0.672 (< .001)	Substantial
5e. Does the submission process include a field for linking to or uploading a study protocol?	23 (3)	94 [91, 96]	97 [95, 98]	9 [0, 27]	0.054 (.319)	Slight
5g. Does the submission process include a field for indicating whether the study protocol is available?	18 (3)	95 [93, 97]	98 [96, 99]	11 [0, 33]	0.087 (.113)	Slight
6a. Does the submission process include one or more fields in which authors may enter a study registration number or a link to the STUDY REGISTRATION (i.e., URL or DOI)?	39 (6)	98 [96, 99]	99 [98, 100]	82 [67, 94]	0.809 (< .001)	Almost perfect
6c. Does the submission process include a field for indicating whether the study was registered?	31 (5)	99 [97, 100]	99 [98, 100]	84 [67, 95]	0.831 (< .001)	Almost perfect
7a. Does the submission process include one or more fields in which authors may enter a location or a link to the ANALYSIS PLAN (i.e., URL or DOI)?	3 (0)	99 [98, 100]	100 [99, 100]	0 [0, 0]	−0.004 (.934)	Poor
7c. Does the submission process include a field for indicating whether the ANALYSIS PLAN was registered?	2 (0)	99 [99, 100]	100 [99, 100]	0 [0, 0]	−0.003 (.956)	Poor
7g. Does the submission process include a field for uploading an analysis plan?	2 (0)	99 [99, 100]	100 [99, 100]	0 [0, 0]	−0.003 (.956)	Poor
8a. Does the submission process include a field for indicating whether the study is a replication?	15 (2)	99 [98, 100]	100 [99, 100]	80 [50, 100]	0.795 (< .001)	Substantial
9a. Does the submission system ask authors to affirm that all outcomes have been reported, regardless of significance or novelty of findings?	4 (1)	99 [98, 100]	99 [99, 100]	0 [0, 0]	−0.006 (.912)	Poor
9c. Does the submission process include a field for indicating whether the study is being submitted for results blind review?	2 (0)	100 [100, 100]	100 [100, 100]	100 [100, 100]	1.000 (< .001)	Almost perfect
9e. Does the submission process include a field for indicating whether the study is a “Registered Report”?	34 (5)	99 [98, 100]	99 [99, 100]	88 [74, 97]	0.876 (< .001)	Almost perfect
10a. Does the submission system ask authors whether they wish to be considered for the “Preregistered” badge?	10 (1)	100 [100, 100]	100 [100, 100]	100 [100, 100]	1.000 (< .001)	Almost perfect
10c. Does the submission system ask authors whether they wish to be considered for the “Open Data” badge?	8 (1)	100 [100, 100]	100 [100, 100]	100 [100, 100]	1.000 (< .001)	Almost perfect
10e. Does the submission system ask authors whether they wish to be considered for the “Open Materials” badge?	8 (1)	100 [100, 100]	100 [100, 100]	100 [100, 100]	1.000 (< .001)	Almost perfect

Note: IRA = interrater agreement; IRR = interrater reliability; PE = point estimate; CI = confidence interval; NA = not applicable.

There was no skip logic for the questions. Hence, all raters responded (yes or no) to these questions for all journals.

Interpretation based on the following: poor < 0.00; slight = 0.00–0.20; fair = 0.21–0.40; moderate = 0.41–0.60; substantial = 0.61–0.80; almost perfect = 0.81–1.00; Landis & Koch, 1977).

Consistency check for journal procedures

To identify potential errors in our ratings of journal procedures, we checked for consistency across journals with the same publisher and submission system (“publisher–submission system pairs”). Three questions did not have any inconsistency across the publisher–submission system pairs (3/26; 12%; Table 7). The largest number of inconsistent reconciled ratings was for the question regarding whether the submission process included a field for indicating data availability (16/26; 62%). After removing inconsistencies because of errors in our ratings, our ratings improved for nine of 26 submissions (35%). Ratings did not change for 14 questions because procedures truly varied across journals with both the same publisher and submission system (14/26; 54%).

Table 7.

Inconsistency of Procedures Within Publication-Submission Systems

Procedure question	Inconsistent publication submission systems before consistency check		Inconsistent publication submission systems after consistency check
Questions in journal procedure rating instrument^a	N ^b	%^c	N ^b	%^c
1a. Does the submission system ask authors to confirm that data have been cited according to journal guidelines?	3	3%	3	3%
2a. Does the submission process include one or more fields in which authors may enter a link to DATA (i.e., DOI, URL, or other persistent identifier)?	9	8%	5	5%
2c. Does the submission process include one or more fields for indicating whether DATA are publicly available?	16	15%	10	9%
2e. Does the submission process include one or more fields for authors to upload their DATA?	6	6%	3	3%
3a. Does the submission process include one or more fields in which authors may enter a link to CODE (i.e., DOI, URL, or other persistent identifier) posted on a repository?	1	1%	1	1%
3c. Does the submission process include one or more fields for indicating whether CODE is publicly available?	3	3%	2	2%
3e. Does the submission process include one or more fields for authors to upload their CODE?	0	0%	0	0%
4a. Does the submission process include one or more fields in which authors may enter a link to newly created RESEARCH MATERIALS (i.e., DOI, URL, or other persistent identifier) posted on a repository?	2	2%	1	1%
4c. Does the submission process include one or more fields for indicating whether RESEARCH MATERIALS are publicly available?	4	4%	2	2%
4e. Does the submission process include one or more fields for authors to upload their RESEARCH MATERIALS?	1	1%	0	0%
5a. Does the submission process include a field for uploading a completed checklist for one or more reporting guidelines (e.g., a CONSORT checklist)?	3	3%	3	3%
5c. Does the submission process include a field for indicating whether the authors followed a reporting guideline (e.g., by reference to the EQUATOR Network, CONSORT, or a checklist of specific items used by the journal)?	4	4%	4	4%
5e. Does the submission process include a field for linking to or uploading a study protocol?	1	1%	1	1%
5g. Does the submission process include a field for indicating whether the study protocol is available?	1	1%	1	1%
6a. Does the submission process include one or more fields in which authors may enter a study registration number or a link to the STUDY REGISTRATION (i.e., URL or DOI)?	8	7%	8	7%
6c. Does the submission process include a field for indicating whether the study was registered?	6	6%	6	6%
7a. Does the submission process include one or more fields in which authors may enter a location or a link to the ANALYSIS PLAN (i.e., URL or DOI)?	1	1%	1	1%
7c. Does the submission process include a field for indicating whether the ANALYSIS PLAN was registered?	0	0%	0	0%
7g. Does the submission process include a field for uploading an analysis plan?	1	1%	1	1%
8a. Does the submission process include a field for indicating whether the study is a replication?	5	5%	5	5%
9a. Does the submission system ask authors to affirm that all outcomes have been reported regardless of significance or novelty of findings?	3	3%	3	3%
9c. Does the submission process include a field for indicating whether the study is being submitted for results blind review?	0	0%	0	0%
9e. Does the submission process include a field for indicating whether the study is a “Registered Report”?	8	7%	8	7%
10a. Does the submission system ask authors whether they wish to be considered for the “Preregistered” badge?	3	3%	3	3%
10c. Does the submission system ask authors whether they wish to be considered for the “Open Data” badge?	3	3%	2	2%
10e. Does the submission system ask authors whether they wish to be considered for the “Open Materials” badge?	3	3%	2	2%

There was no skip logic for the questions. Hence, all raters responded (yes or no) to these questions for all journals.

This is the number of times there was an inconsistent case (different ratings for the journals within the same publication-submission system) for a question.

Total number (denominator in proportion) of publisher–submission system combinations was 108.

Journal practices (journal articles)

Overall agreement was above 86% for questions on journal practices. The prevalence of yes responses for questions in the practice instrument was low (Table 8), ranging from 0% to 16% (Mdn = 5%). As with journal policies and procedures, specific agreement on yes responses was lower than agreement on the no responses, and it was notably low for questions on journal practices with few yes responses. However, the practice instrument was comparatively more reliable than the policy and procedure instruments. Fleiss’s κ estimates ranged from 0.242 to 1.000 (M = 0.612, SD = 0.211) and were statistically significantly different from 0 for most practice questions (12/14; 86%). Whereas the IRR statistics were almost perfect for two questions (2/14; 17%) and substantial for four questions (4/14, 29%), IRR values were moderate for most questions (5/14; 36%). IRR values were not poor for any question. We could not calculate Fleiss’s κ for two questions because there were no yes responses.

Table 8.

Interrater Agreement and Interrater Reliability for Questions in the Practice Rating Instrument Between the Two Raters

		IRA measure			IRR measure
Questions in journal practice rating instrument^a	Yes responsesn (%)	% Overall agreementPE [95% CI]	% Specific agreement on the no responsesPE [95% CI]	% Specific agreement on yes responsesPE [95% CI]	Fleiss’s κ(p value)	Interpretation^b
0. Did the authors of the study collect the data?	511 (79)	90 [86, 93]	75 [66, 83]	94 [91, 96]	0.687 (< .001)	Substantial
1a. Is the data set used in the study cited in the text of the article as an in-text citation using the same format as citations for other journal articles and books (e.g., Mayo-Wilson & Grant, 2020)?	32 (5)	95 [93, 97]	97 [96, 99]	50 [25, 69]	0.474 (< .001)	Moderate
1b. Is the data set used in the study cited in the references section?	37 (6)	95 [92, 97]	97 [96, 98]	54 [31, 71]	0.513 (< .001)	Moderate
2a. Does the article include a designated section with a statement about the availability of the data underlying the findings reported in the article?	54 (8)	96 [94, 98]	98 [97, 99]	78 [64, 89]	0.757 (<.001)	Substantial
2c. Does the article include a designated section with a statement about the availability of the analysis code or software used to generate the findings reported in the article?	18 (3)	97 [95, 99]	98 [97, 99]	44 [11, 71]	0.428 (< .001)	Moderate
2e. Does the article include a designated section with a statement about the availability of the research materials used to conduct the study reported in the article?	29 (5)	93 [91, 96]	97 [95, 98]	28 [6, 47]	0.242 (< .001)	Fair
3a. Does the article indicate that the authors followed a reporting guideline?	22 (3)	97 [95, 98]	98 [97, 99]	55 [24, 76]	0.529 (< .001)	Moderate
3c. Does the article indicate that the study protocol is publicly available?	44 (7)	96 [93, 98]	98 [96, 99]	68 [50, 82]	0.658 (< .001)	Substantial
4a. Does the article indicate whether the authors registered the study?^c	105 (16)	97 [94, 98]	98 [97, 99]	90 [82, 95]	0.875 (< .001)	Almost perfect
5a. Does the article indicate whether the authors registered the analysis plan?^c	6 (1)	100 [100, 100]	100 [100, 100]	100 [100, 100]	1.000 (< .001)	Almost perfect
6a. Does the article include a statement about whether or not the study is a replication?	21 (3)	98 [97, 100]	99 [98, 100]	76 [50, 94]	0.754 (< .001)	Substantial
6b. Does the article include a statement that all outcomes have been reported regardless of significance or novelty of findings?	8 (1)	99 [98, 100]	99 [99, 100]	50 [0, 89]	0.494 (< .001)	Moderate
6c. Does the article indicate whether the study is a “Registered Report”?	0 (0)	100 [100, 100]	100 [100, 100]	NA [NA, NA]	NA	NA
7a. Has the article been awarded any [of the following] open-science badges?	0 (0)	100 [100, 100]	100 [100, 100]	NA [NA, NA]	NA	NA

Note: IRA = interrater agreement; IRR = interrater reliability; PE = point estimate; CI = confidence interval; NA = specific agreement on yes responses and Fleiss’ κ cannot be calculated because there were 0 yes responses.

There was no skip logic for the questions. Hence, all raters responded (yes or no) to these questions for all journal articles.

Interpretation based on the following: poor < 0.00; slight = 0.00–0.20; fair = 0.21–0.40; moderate = 0.41–0.60; substantial = 0.61–0.80; almost perfect = 0.81–1.00; Landis & Koch, 1977).

Response options for Questions 4a and 5a were “no”; “yes, study was not registered”; and “yes, study was registered.” The latter two options were combined in analyses.

Discussion

This study demonstrated the feasibility of using a structured process and instruments to assess journal implementation of the TOP Guidelines. Trained graduate RAs rated instructions to authors, manuscript-submission systems, and articles published in a large cohort of journals. They performed comparably with each other and had high interrater agreement on level of implementation in journal policies (i.e., TOP Level 0, 1, 2, or 3). In addition, we found high overall interrater agreement on the questions about specific aspects of TOP in all three of our instruments. Agreement was particularly high when journals did not implement TOP standards in their instructions to authors, submission systems, and journal articles. Although raters generally agreed when journals had not implemented a standard, they found it difficult to identify the level at which each TOP standard had been implemented. That is, specific agreement was low when journals had language related to TOP. Most questions had moderate or worse IRR, and we did not have excellent reliability in rating the level of implementation of any TOP standard. For some items, such as analysis plan, differences in reliability suggest that raters might be better able to evaluate the transparency and openness of journal articles compared with policies and procedures. Finally, we found that our ratings sometimes differed from TOP Factor scores rated by others, but we were unable to determine the timing of those ratings, and it is possible that some journals updated their policies during the time between the two ratings.

Our study highlights several obstacles to monitoring journal implementation of standards to promote open research. In particular, it was time-consuming to train RAs, identify relevant documents, and rate those documents. The rigorous processes used in our study would be expensive to scale and to sustain. In addition, it might be challenging to keep journal ratings up to date if journals change their policies, procedures, and practices without public notice. Although some TOP standards could be monitored automatically, our study suggests that automated surveillance would be challenging. Most manuscript-submission systems do not request structured information about the items needed to assess TOP implementation (e.g., study registration number, link to data set). We also found that TOP implementation varies across journals with both the same publisher and submission system. Because information needed to assess transparency and openness does not appear consistently and because reports often lack explanations and metadata needed to evaluate supplemental materials, it might not be possible to design a simple automated program to assess TOP implementation. Machine learning could address some of these challenges in theory; however, the small number of examples of implementation at Levels 1 to 3 would make it difficult to train a model. More examples of stringent policies, procedures, and practice may be needed for machines to recognize and differentiate among more stringent practices.

Revising the TOP Guidelines

Our study shows that TOP, like other guidelines (Logullo et al., 2020), does not translate easily into a measure for research on journal quality. As noted by others, TOP standards and their corresponding levels of implementation are difficult to assess because of their complex requirements described in multiple clauses (Cashin et al., 2021; Hansford et al., 2021; Spitschan et al., 2020). We found it challenging to parse compound requirements into clear and concise questions, and it was challenging to identify those facts that would allow us to determine objectively whether each standard had been implemented. Conversely, some constructs that are separated in TOP could be difficult to distinguish consistently in practice, such as registering a protocol and registering an analysis plan.

We sometimes struggled to distinguish different levels of TOP implementation because journals provided information on some but not all requirements. For example, it was unclear whether journals met Level 2 for data transparency when instructions to authors stated that data sharing in a trusted digital repository was required for publication but the instructions to authors did not state whether data sets must include all variables described in the manuscript. It was also difficult to rate TOP implementation when journals had multiple policies that applied to different study designs. For example, it was unclear whether journals met Level 2 for methods transparency when they required systematic reviews to adhere to the PRISMA Guidelines (Page et al., 2021) but did not require adherence to reporting guidelines for other study designs.

Rating TOP implementation assumes that journals have a single coherent set of policies, but we also found contradictions in the information journals provide to authors. That is, instructions to authors for a single journal might appear across multiple web pages and included hyperlinks to pages with additional hyperlinks. Dispersed instructions are difficult to interpret and follow, and multiple pages provide opportunities for policies to become internally inconsistent when pages are updated by different journal staff at different times. Furthermore, instructions to authors sometimes had ambiguous and confusing language. We particularly struggled with varying use of modal verbs in both journal documents and current TOP guidance, for example, using “should” instead of “must” to describe policies that appear to require an open research practice (see Box 1).

Box 1.

Examples of Ambiguous Language in Policy Documents

1. “Data and programs should be archived in the [repository].”

2. “Citations should include the type of material submitted.”

3. “[Journal] strongly encourages that all datasets on which the conclusions of the paper rely should be available to readers.”

4. “A CONSORT Statement includes recommendations, a checklist of items that should be included in a comprehensive report, and a participant flow diagram.”

To support future research about TOP implementation, the Center for Open Science and the broader scientific community could adopt a standard process and instruments, including training materials, that explain how to use TOP to assess journal quality. Currently, TOP is described across several websites, guidance documents, and rubrics that are subtly but meaningfully different. For example, citation standards are sometimes described as applying to data sets only, and they are sometimes described as applying to data sets, code, and research materials. Moreover, these resources provide information for journals, including examples of language that journals could incorporate in their instructions to authors, but they do not provide tools to evaluate policy language written by others.

We anticipate that our process and instruments could help in the development of these resources to improve TOP and its implementation. We translated TOP into factual questions, display logic, and scoring algorithms. Like checklists included in guidelines such as CONSORT and PRISMA, operationalizing TOP in this way might facilitate understanding and communication with editors and authors. By providing a structured interpretation of the guidelines, our instruments and algorithms could also help readers identify areas of agreement and disagreement in their interpretations of TOP. For example, if the differences we identified between Level 1 and Level 2 for a given standard are not what the developers intended, then TOP could be revised to clarify how the levels should be distinguished. Finally, current resources address journal policies only. Our process and instruments could help in the development of official guidance on implementing TOP standards in journal procedures (i.e., manuscript-submission systems) and practices (i.e., information reported in articles).

Limitations

In addition to the issues identified above, limited reliability in our study could be explained by limitations in our process, instruments, and team. To minimize these sources of error, the two PIs—who have contributed to the development of TOP standards—led the design of the process and instruments, trained RAs in their use, and provided close supervision throughout the study. We deconstructed criteria for TOP standards into factual questions with detailed instructions, and we programed the instruments using software with validity checks. We also checked for consistency across publisher–submission system pairs, which we hypothesized would have similar ratings. These methods improved on previous approaches that asked raters to consider all criteria simultaneously and to record the level of implementation for each standard in spreadsheets. For these reasons, we suspect that different raters or methods would not produce more reliable ratings. It is also a limitation that journals in our study publish research for which some but not all TOP standards are relevant. We focused specifically on applicable study designs by including randomized experiments and other quantitative evaluations of intervention effectiveness. Finally, we included only one article per journal to rate practices. If some positive practices we identified were representative of all articles in the included journals, then our results might overestimate TOP practices.

Conclusions

TOP aims to align scientific ideals with research practice. Unfortunately, it has not been implemented widely by journals in the behavioral, social, and health sciences (Grant et al., 2022; Mayo-Wilson, Grant, & Supplee, 2022) despite journal endorsement and widespread community support (Naaman et al., 2023). We conclude that revising TOP might improve its interpretability and use. Although we found high agreement among raters using our instruments, our experiences throughout this study indicate that limited reliability might arise from ambiguities in TOP and associated instructions. Standardized processes and instruments for assessing TOP implementation, accompanied by training materials, could advance efforts to implement, assess, and monitor open research policies, procedures, and practices. Monitoring transparency practices would be easier if journals were to collect structured data and metadata about research reports and associated elements such as registrations, data sets, and code.

Supplemental Material

sj-docx-1-amp-10.1177_25152459221149735 – Supplemental material for Evaluating Implementation of the Transparency and Openness Promotion Guidelines: Reliability of Instruments to Assess Journal Policies, Procedures, and Practices

Supplemental material, sj-docx-1-amp-10.1177_25152459221149735 for Evaluating Implementation of the Transparency and Openness Promotion Guidelines: Reliability of Instruments to Assess Journal Policies, Procedures, and Practices by Sina Kianersi, Sean Patrick Grant, Kevin Naaman, Beate Henschel, David Mellor, Shruti Apte, Jessica E. Deyoe, Paul Eze, Cuiqiong Huo, Bethany L. Lavender, Nicha Taschanchai, Xinlu Zhang and Evan Mayo-Wilson in Advances in Methods and Practices in Psychological Science

Footnotes

Acknowledgements

Additional TRUST collaborators include Lauren Supplee (Child Trends, Bethesda, Maryland), Emily Fortier (Indiana University, Indianapolis, Indiana), Madison Haralovich (Indiana University School of Dentistry, Indianapolis, Indiana), and Nick Liu (Indiana University Network Sciences Institute, Bloomington, Indiana).

Transparency

Action Editor: David A. Sbarra

Editor: David A. Sbarra

Author Contribution(s)

Sina Kianersi: Conceptualization; Formal analysis; Investigation; Methodology; Visualization; Writing – original draft.

Sean Patrick Grant: Conceptualization; Funding acquisition; Investigation; Methodology; Project administration; Supervision; Writing – original draft.

Kevin Naaman: Data curation; Investigation; Validation; Writing – review & editing.

Beate Henschel: Formal analysis; Investigation; Validation; Visualization; Writing – review & editing.

David Mellor: Investigation; Writing – review & editing.

Shruti Apte: Data curation; Investigation; Writing – review & editing.

Jessica E. Deyoe: Data curation; Investigation; Writing – review & editing.

Paul Eze: Data curation; Investigation; Writing – review & editing.

Cuiqiong Huo: Data curation; Investigation; Writing – review & editing.

Bethany L. Lavender: Data curation; Investigation; Writing – review & editing.

Nicha Taschanchai: Data curation; Investigation; Writing – review & editing.

Xinlu Zhang: Data curation; Investigation; Writing – review & editing.

Evan Mayo-Wilson: Conceptualization; Funding acquisition; Investigation; Methodology; Project administration; Supervision; Writing – original draft.

ORCID iDs

Kevin Naaman

Paul Eze

Cuiqiong Huo

Nicha Taschanchai

Evan Mayo-Wilson

Supplemental Material

Additional supporting information can be found at

References

Aalbersberg

I. J.

Appleyard

Brookhart

Carpenter

Clarke

Curry

Dahl

DeHaven

A. C.

Eich

Franko

Freedman

Graf

Grant

Hanson

Joseph

Kiermer

Kramer

Kraut

Karn

R. K.

. . . Vazire

(2018). Making science transparent by default; introducing the TOP statement. OSF. https://doi.org/10.31219/osf.io/sm78t

Agnoli

Fraser

Singleton Thorn

Fidler

(2021). Australian and Italian psychologists’ view of replication. Advances in Methods and Practices in Psychological Science, 4(3). https://doi.org/10.1177/25152459211039218

Cashin

A. G.

Bagg

M. K.

Richards

G. C.

Toomey

McAuley

J. H.

Lee

(2021). Limited engagement with transparent and open science standards in the policies of pain journals: A cross-sectional evaluation. BMJ Evidence-Based Medicine, 26(6), 313–319. https://doi.org/10.1136/bmjebm-2019-111296

Center for Open Science. (2014). Transparency and Openness Promotion (TOP) guidelines. OSF. https://osf.io/9f6gx/

Center for Open Science. (2016a). TOP factor rubric. OSF. https://osf.io/t2yu5/

Center for Open Science. (2016b). TOP resources–evidence and practices (Version 29). OSF. https://osf.io/kgnva/

Center for Open Science. (2022). Center for Open Science. https://www.cos.io

Christensen

Wang

Levy Paluck

Swanson

Birke

Miguel

Littman

(2020). Open science practices are on the rise: The state of social science (3S) survey. Center for Effective Global Action, UC Berkeley.

Cicchetti

D. V.

Feinstein

A. R.

(1990). High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 43(6), 551–558. https://doi.org/10.1016/0895-4356(90)90159-m

10.

Costa-Santos

Antunes

Souto

Bernardes

(2010). Assessment of disagreement: A new information-based approach. Annals of Epidemiology, 20(7), 555–561. https://doi.org/10.1016/j.annepidem.2010.02.011

11.

Cybulski

Mayo-Wilson

Grant

(2016). Improving transparency and reproducibility through registration: The status of intervention trials published in clinical psychology journals. Journal of Consulting and Clinical Psychology, 84(9), 753–767. https://doi.org/10.1037/ccp0000115

12.

de Vet

H. C. W.

Dikmans

R. E.

Eekhout

. (2017). Specific agreement on dichotomous outcomes can be calculated for more than two raters. Journal of Clinical Epidemiology, 83, 85–89. https://doi.org/10.1016/j.jclinepi.2016.12.007

13.

Fleiss

J. L.

(1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382. https://doi.org/10.1037/h0031619

14.

Gamer

Lemon

Singh

I. F. P.

(2019). irr: Various coefficients of interrater reliability and agreement (Version 0.84.1). https://cran.r-project.org/web/packages/irr/index.html

15.

Goodman

S. N.

Fanelli

Ioannidis

J. P.

(2016). What does research reproducibility mean? Science Translational Medicine, 8(341), Article 341ps12. https://doi.org/10.1126/scitranslmed.aaf5027

16.

Grant

Mayo-Wilson

Kianersi

Naaman

Henschel

(2022). Implementation of the Transparency and Openness Promotion Guidelines at journals publishing influential social and behavioral intervention research. MetaArXiv. https://doi.org/10.31222/osf.io/f9ptg

17.

Grant

Mayo-Wilson

Montgomery

Macdonald

Michie

Hopewell

Moher

, on behalf of the CONSORT-SPI Group. (2018). CONSORT-SPI 2018 explanation and elaboration: Guidance for reporting social and psychological intervention trials. Trials, 19(1), Article 406. https://doi.org/10.1186/s13063-018-2735-z

18.

Greenwald

A. G.

(1976). An editorial. Journal of Personality and Social Psychology, 33, 1–7.

19.

Hansford

H. J.

Cashin

A. G.

Wewege

M. A.

Ferraro

M. C.

McAuley

J. H.

Jones

M. D.

(2021). Evaluation of journal policies to increase promotion of transparency and openness in sport science research. Arthroscopy, 37(11), 3223–3225. https://doi.org/10.1016/j.arthro.2021.09.005

20.

Harris

P. A.

Taylor

Minor

B. L.

Elliott

Fernandez

O’Neal

McLeod

Delacqua

Kirby

(2019). The REDCap consortium: Building an international community of software platform partners. Journal of Biomedical Informatics, 95, Article 103208. https://doi.org/10.1016/j.jbi.2019.103208

21.

Harris

P. A.

Taylor

Thielke

Payne

Gonzalez

Conde

J. G.

(2009). Research electronic data capture (REDCap)—A metadata-driven methodology and workflow process for providing translational research informatics support. Journal of Biomedical Informatics, 42(2), 377–381.

22.

Henriques

Antunes

Bernardes

Matias

Sato

Costa-Santos

(2013). Information-based measure of disagreement for more than two observers: A useful tool to compare the degree of observer disagreement. BMC Medical Research Methodology, 13, Article 47. https://doi.org/10.1186/1471-2288-13-47

23.

Henriques

Antunes

Costa-Santos

(2013). obs.agree: An R package to assess agreement between observers (Version 1.0). https://cran.r-project.org/web/packages/obs.agree/index.html

24.

Higgins

J. P. T.

Thomas

Chandler

Cumpston

Page

M. J.

Welch

V. A.

(2019). Cochrane handbook for systematic reviews of interventions (2nd ed.). John Wiley & Sons.

25.

Kepes

Banks

G. C.

Keener

S. K.

(2020). The TOP factor: An indicator of quality to complement journal impact factor. Industrial and Organizational Psychology, 13(3), 328–333. https://doi.org/10.1017/iop.2020.56

26.

Kianersi

Grant

Naaman

Mayo-Wilson

(2022a). TOP rating instrument reliability, research materials. OSF. https://doi.org/10.17605/OSF.IO/W8FBA

27.

Kianersi

Grant

Naaman

Mayo-Wilson

(2022b). TOP rating instrument reliability, statistical code. OSF. https://doi.org/10.17605/OSF.IO/XTDB6

28.

Koo

T. K.

M. Y.

(2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15(2), 155–163. https://doi.org/10.1016/j.jcm.2016.02.012

29.

Kottner

Audigé

Brorson

Donner

Gajewski

B. J.

Hróbjartsson

Roberts

Shoukri

Streiner

D. L.

(2011). Guidelines for reporting reliability and agreement studies (GRRAS) were proposed. International Journal of Nursing Studies, 48(6), 661–671. https://doi.org/10.1016/j.ijnurstu.2011.01.016

30.

Kozlowski

S. W.

Hattrup

(1992). A disagreement about within-group agreement: Disentangling issues of consistency versus consensus. Journal of Applied Psychology, 77(2), 161–167. https://doi.org/10.1037/0021-9010.77.2.161

31.

Landis

J. R.

Koch

G. G.

(1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.

32.

Lindsay

D. S.

(2017). Sharing data and materials in psychological science. Psychological Science, 28(6), 699–702. https://doi.org/10.1177/0956797617704015

33.

Logullo

MacCarthy

Kirtley

Collins

G. S.

(2020). Reporting guideline checklists are not quality evaluation forms: They are guidance for writing. Health Science Reports, 3(2), Article e165. https://doi.org/10.1002/hsr2.165

34.

Mayo-Wilson

Grant

Kianersi

Naaman

(2022). TRUST data sets and files. OSF. https://doi.org/10.17605/OSF.IO/3REB6

35.

Mayo-Wilson

Grant

Supplee

L. H.

(2022). Clearinghouse standards of evidence on the transparency, openness, and reproducibility of intervention evaluations. Prevention Science, 23(5), 774–786. https://doi.org/10.1007/s11121-021-01284-x

36.

Mayo-Wilson

Grant

Supplee

Kianersi

Amin

DeHaven

Mellor

(2021). Evaluating implementation of the Transparency and Openness Promotion (TOP) guidelines: The TRUST process for rating journal policies, procedures, and practices. Research Integrity and Peer Review, 6(1), Article 9. https://doi.org/10.1186/s41073-021-00112-8

37.

McGraw

K. O.

Wong

S. P.

(1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30–46. https://doi.org/10.1037/1082-989X.1.1.30

38.

Miguel

Camerer

Casey

Cohen

Esterling

K. M.

Gerber

Glennerster

Green

D. P.

Humphreys

Imbens

Laitin

Madon

Nelson

Nosek

B. A.

Petersen

Sedlmayr

Simmons

J. P.

Simonsohn

Van der Laan

(2014). Social science. Promoting transparency in social science research. Science, 343(6166), 30–31. https://doi.org/10.1126/science.1245317

39.

Naaman

Grant

Kianersi

Supplee

Henschel

Mayo-Wilson

(2023). Exploring enablers and barriers to implementing the Transparency and Openness Promotion Guidelines: A theory-based survey of journal editors. Royal Society Open Science, 10(2), 221093. https://doi.org/10.1098/rsos.221093

40.

Nosek

B. A.

Alter

Banks

G. C.

Borsboom

Bowman

S. D.

Breckler

S. J.

Buck

Chambers

C. D.

Chin

Christensen

Contestabile

Dafoe

Eich

Freese

Glennerster

Goroff

Green

D. P.

Hesse

Humphreys

. . . Yarkoni

(2015). Promoting an open research culture: Author guidelines for journals could help to promote transparency, openness, and reproducibility. Science, 348(6242), 1422–1425. https://doi.org/10.1126/science.aab2374

41.

Nosek

B. A.

Hardwicke

T. E.

Moshontz

Allard

Corker

K. S.

Dreber

Fidler

Hilgard

Kline Struhl

Nuijten

M. B.

Rohrer

J. M.

Romero

Scheel

A. M.

Scherer

L. D.

Schönbrodt

F. D.

Vazire

(2022). Replicability, robustness, and reproducibility in psychological science. Annual Review of Psychology, 73, 719–748. https://doi.org/10.1146/annurev-psych-020821-114157

42.

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), Article aac4716. https://doi.org/10.1126/science.aac4716

43.

Page

M. J.

McKenzie

J. E.

Bossuyt

P. M.

Boutron

Hoffmann

T. C.

Mulrow

C. D.

Shamseer

Tetzlaff

J. M.

Akl

E. A.

Brennan

S. E.

(2021). The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. International Journal of Surgery, 88, Article 105906. https://doi.org/10.1186/s13643-021-01626-4

44.

Polanin

J. R.

Pigott

T. D.

Espelage

D. L.

Grotpeter

J. K.

(2019). Best practice guidelines for abstract screening large-evidence systematic reviews and meta-analyses. Research Synthesis Methods, 10(3), 330–342. https://doi.org/10.1002/jrsm.1354

45.

R Development Core Team. (2021). R: A language and environment for statistical computing (Version 4.1.2). R Foundation for Statistical Computing. https://www.R-project.org/

46.

Rosenthal

(1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86(3), 638–641. https://doi.org/10.1037/0033-2909.86.3.638

47.

Rotondi

M. A.

(2018). kappaSize: Sample size estimation functions for studies of interobserver Agreement. https://cran.r-project.org/web/packages/kappaSize/index.html

48.

RStudio Team. (2021). RStudio: Integrated development environment for R. RStudio (Version 2021.09.0). RStudio, PBC. http://www.rstudio.com/

49.

Shrout

P. E.

Fleiss

J. L.

(1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428. https://doi.org/10.1037//0033-2909.86.2.420

50.

Spellman

B. A.

(2015). A short (personal) future history of revolution 2.0. Perspectives on Psychological Science, 10(6), 886–899. https://doi.org/10.1177/1745691615609918

51.

Spitschan

Schmidt

M. H.

Blume

(2020). Transparency and open science principles in reporting guidelines in sleep research and chronobiology journals. Wellcome Open Research, 5, Article 172. https://doi.org/10.12688/wellcomeopenres.16111.2

52.

Sterling

T. D.

(1959). Publication decisions and their possible effects on inferences drawn from tests of significance—or vice versa. Journal of the American Statistical Association, 54(285), 30–34. https://doi.org/10.1080/01621459.1959.10501497

53.

Tenopir

Dalton

E. D.

Allard

Frame

Pjesivac

Birch

Pollock

Dorsett

(2015). Changes in data sharing and data reuse practices and perceptions among scientists worldwide. PLOS ONE, 10(8), Article e0134826. https://doi.org/10.1371/journal.pone.0134826

54.

Thomas

Graziosi

Brunton

Ghouze

O’Driscoll

Bond

(2020). EPPI-reviewer: Advanced software for systematic reviews, maps and evidence synthesis (EPPI-Centre Software). UCL Social Research Institute.

55.

TRUST Team. (2019a). Journal policy evaluation. OSF. https://osf.io/56pgq/

56.

TRUST Team. (2019b). Journal procedure evaluation. OSF. https://osf.io/u3dak/

57.

TRUST Team. (2021). Journal practice evaluation. OSF. https://osf.io/ctv9q/

58.

Uebersax

(2018). Raw agreement indices. John Uebersax PhD. http://www.john-uebersax.com/stat/raw.htm

59.

Van Rossum

Drake

F. L.

(2009). Python 3 reference manual. CreateSpace.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.07 MB

Evaluating Implementation of the Transparency and Openness Promotion Guidelines: Reliability of Instruments to Assess Journal Policies,Procedures,and Practices

Abstract

Keywords

Objectives

Disclosures

Method

Eligible journals

TRUST policy, procedure, and practice rating instruments

Statistical power

Raters

Identifying policy, procedure, and practice documents

Rating setting

Reconciliation

Scoring policies

Consistency check for journal procedures

Comparison with external data sources

Statistical analysis

Descriptive analyses on raters

IRR analysis

IRA analyses

Software

Results

Journal policies (instructions to authors)

Agreement and reliability of individual questions

Agreement and reliability for TOP standards

Agreement with an external data source

Journal procedures (manuscript-submission systems)

Consistency check for journal procedures

Journal practices (journal articles)

Discussion

Revising the TOP Guidelines

Box 1.

Limitations

Conclusions

Supplemental Material

sj-docx-1-amp-10.1177_25152459221149735 – Supplemental material for Evaluating Implementation of the Transparency and Openness Promotion Guidelines: Reliability of Instruments to Assess Journal Policies, Procedures, and Practices

Footnotes

Acknowledgements

Transparency

ORCID iDs

Supplemental Material

References

Supplementary Material