Sage Journals: Discover world-class research

Abstract

Peer review has become the gold standard in scientific publishing as a selection method and a refinement scheme for research reports. However, despite its pervasiveness and conferred importance, relatively little empirical research has been conducted to document its effectiveness. Further, there is evidence that factors other than a submission’s merits can substantially influence peer reviewers’ evaluations. We report the results of a metascientific field experiment on the effect of the originality of a study and the statistical significance of its primary outcome on reviewers’ evaluations. The general aim of this experiment, which was carried out in the peer-review process for a conference, was to demonstrate the feasibility and value of metascientific experiments on the peer-review process and thereby encourage research that will lead to understanding its mechanisms and determinants, effectively contextualizing it in psychological theories of various biases, and developing practical procedures to increase its utility.

Keywords

assessment decision making evidence field study file drawer publication bias metascience peer review open data open materials preregistered

Recent work estimating the robustness of psychological research has given the research community pause for thought because a substantial proportion of published psychological investigations have not been successfully replicated (Camerer et al., 2018; Open Science Collaboration, 2015). How fundamental to this problem is the role of peer review? A population of reviewers who systematically value significant results more than nonsignificant results would incentivize researchers to use questionable research practices, such as p-hacking and cherry-picking (Simmons, Nelson, & Simonsohn, 2011), or may deter researchers from even trying to write up and publish nonsignificant results, and thus contribute to the file-drawer problem (Rosenthal, 1979). In a similar vein, reviewers influence whether replicating studies itself is valued, or whether emphasis is placed on originality. Despite the impactful role of peer review, little is known about its effectiveness and determinants (particularly in psychological science). In the study reported here, we aimed to contribute to closing this research gap by testing reviewers’ evaluations of significant versus nonsignificant results and original research versus replications in the peer-review process for a conference. Further, and more generally, we wanted to demonstrate the feasibility and value of metascientific field experiments on the peer-review process.

Peer Review: Objectives and Evidence

In writing about peer review, it has almost become a customary practice to adapt the famed statement by Winston Churchill on democracy as a form of government—to declare that peer review is the worst form of academic quality assessment, except for all the other forms that have been tried (see, e.g., Rennie, 2003a; Robin & Burke, 1987; Smith, 2006). Since the mid-20th century, peer review has been considered the gold standard of quality assurance in scientific publishing (Burnham, 1990; Spier, 2002). Through this process, peers influence which research is ever presented to the public and which remains in academics’ file drawers. Thus, depending on one’s perspective, peer reviewers can be considered the gatekeepers (Simmons et al., 2011), bottleneck (Pöschl, 2012), or hostage takers (Hammerschmidt, 1994) of scientific knowledge.

Jefferson, Wager, and Davidoff (2002) suggested that the two major functions of peer review are (a) selection and (b) refinement of scientific manuscripts. Although the empirical study of peer review, particularly through proper randomized control trials, has been very limited, there is some evidence documenting its underlying mechanisms. Descriptively, and probably unsurprisingly to any academic whose work has ever undergone academic peer review, there is often substantial disagreement among reviewers’ evaluations of a given manuscript (Bornmann, Mutz, & Daniel, 2010). This may be due to reviewers’ differences in competence, preferences and emphasis, or familiarity with relevant theory or methodology, but also may be due to reviewers’ idiosyncratic understanding of their role and the function of peer review itself (Bedeian, 2003). Overall convergence of evaluations would improve with a greater number of reviewers, but Forscher, Cox, Devine, and Brauer’s (2019) results suggest that achieving acceptable levels of reliability consistently (in grant reviews) would routinely require a double-digit number of reviewers, necessarily overburdening the pool of volunteers.

Perhaps more worryingly, the available evidence on the usefulness of peer review and the effectiveness of the entire process with regard to its two primary functions is somewhat mixed (Jefferson, Alderson, Wager, & Davidoff, 2002). Reassuringly, a range of studies suggest that research reports’ quality generally improves from their initially submitted versions to the published articles (Goodman, Berlin, Fletcher, & Fletcher, 1994), which indicates that manuscripts undergoing peer review do benefit from it. However, its utility as a selection method is challenged by surmounting evidence of bias against or in favor of manuscript and author characteristics not immediately relevant to the quality of the research.

Factors other than submissions’ merits can substantially influence peer reviewers’ evaluations of manuscripts and grant proposals. These factors include, but are not limited to, the conformity of study results to reviewers’ own predispositions (Ernst & Resch, 1994); the presence of formulas and equations even when they are meaningless (Eriksson, 2012); the statistical significance of reported results (Atkinson, Furlong, & Wampold, 1982; Emerson et al., 2010; Tsou, Schickore, & Sugimoto, 2014); the reviewers’ familiarity with the research program reported (Heesen & Romeijn, 2019); whether the research is a replication study (Tsou et al., 2014); reviewers’ resistance against innovations and unconventional theory, methods, and practice (Rennie, 2003b); characteristics, such as sex (Wood & Wessely, 2003) and prestige (Okike, Hug, Kocher, & Leopold, 2016), that are conveyed by unblinding authors’ names; and blinding of reviewers’ identities (Godlee, Gale, & Martyn, 1998). These biases, in turn, incentivize researchers to use some research practices that may be orthogonal or even detrimental to scientific ideals. They may also provide disincentives to beneficial behavior; for example, researchers may be disinclined to pursue publication of sound research if they believe that it is likely to be met with negative reviews (Cooper, DeNeve, & Charlton, 1997; Coursol & Wagner, 1986; Franco, Malhotra, & Simonovits, 2014). Given the available evidence, Heesen and Bright (2019) argued that abolishing prepublication peer review in its current form would have neutral or positive net value for the incentive structure in science and for individual researchers’ behavior.

Proposed solutions to ameliorate some systematic problems with peer review are aimed at increasing transparency of review processes (e.g., Wicherts, Kievit, Bakker, & Borsboom, 2012). Some of these measures have been employed by individual journals, such as F1000Research and Meta-Psychology, which publish the complete history of submitted manuscripts (including all versions and signed peer reviews) and encourage commenting on articles during and after publication. Other initiatives, such as JournalReviewer.org and SciRev.org, provide academics with a repository of authors’ experiences with review processes for the benefit of scholars considering submitting their work to journals.

The Need for Experimental Metascience in Peer Review

Although these measures may be helpful in increasing the transparency of academic publishing, systematic research is needed to substantiate peer review’s utility. Empirical evidence that would disentangle the mechanisms involved in editorial decision making is sparse, mostly because the process is so opaque that opportunities to study it from the outside are limited (Couzin-Frankel, 2013).

Considering its regulatory impact, making peer review a field of scientific study itself is inevitable, as the costs—financial, opportunity, and otherwise—of operating a dysfunctional quality-assurance system in scientific publishing are potentially enormous. A system that routinely selects bad science shifts competitive resources (funds, personnel, journal space, time, attention) toward degenerative research programs (Lakatos, 1969), allows false paradigms to persist (Akerlof & Michaillat, 2018), and causes other research lines that were not selected despite their value to remain unexplored, unfunded, and unpublished (Smaldino & McElreath, 2016).

However, as in other domains where effective fixes to a dysfunctional procedure need to be developed, a rigorous research program on the mechanisms of peer review cannot and should not be limited to observational and survey studies reifying pressing concerns: Experimental (field) research with carefully designed manipulations is necessary to identify and ideally isolate causes of human behavior and to implement interventions to modify these contingencies for the better. Although the number of studies testing interventions in peer review is slowly increasing (Malicˇki, von Elm, & Marušic, 2014), such studies are still rather underrepresented, and with few exceptions (e.g., Epstein, 1990; Mahoney, 1977), they are restricted to peer review in biomedicine. Certainly, although one may reasonably conclude that dysfunctional selection mechanisms could have much graver consequences in biomedical publishing than in other sciences, it is somewhat surprising to see peer review receiving relatively little attention by behavioral researchers given that this practice and the interactions of biases shaping it are ultimately psychological research objects (Mahoney, 1976).

The Present Study

We report results from a preregistered experimental study conducted in the regular peer-review process for a scientific conference on research in media psychology. This study was conducted to document the extent to which media psychologists show preferences for (a) original studies over direct replications and (b) statistically significant over statistically nonsignificant findings. Although our broad hypotheses concerned peer review in general, the hypotheses tested in our field study were specific to the conference context:

Compared with a conference submission reporting a replication study, original research has a higher chance of being accepted for presentation and will score higher on standardized reviewing criteria.

Compared with a conference submission reporting a statistically nonsignificant effect (p > .05), a submission reporting a statistically significant effect (p < .05) has a higher chance of being accepted for presentation and will score higher on standardized reviewing criteria.

Disclosures

Preregistration

The study’s rationale, hypotheses, stimulus materials, and measures were preregistered on the Open Science Framework (https://osf.io/t6hs9/). Regrettably, we did not preregister an analysis plan. We report all the analyses we conducted to test our hypotheses and welcome researchers to further explore the available data.

Data, materials, and online resources

All stimulus materials (https://osf.io/zxthq/); data (https://osf.io/zajx3/), including a codebook (https://osf.io/d5zj3/); and analysis scripts (https://osf.io/cfnvu/) underlying this report are available on the Open Science Framework. Note, however, that in order to protect the reviewers’ anonymity, the data set does not include their written comments about the submissions or information about their personal characteristics (see Additional Variables, later in the method section).

Reporting

We report how we determined our sample size, all data exclusions, all manipulations, and all measures in the study.

Ethical approval

This protocol was not approved by an institutional review board because, regrettably, such a committee did not exist at the first author’s institution at the time and obtaining ethical approval for social-science research, with few exceptions, is not required by law in Germany. We did, however, consult with an institutional review board that was not formally responsible.

Method

In a 2 × 2 between-subjects experiment, we manipulated the originality and statistical significance of the research reported in a fictitious abstract submitted to a small- to medium-sized conference and sent the abstract to voluntary reviewers. They evaluated it as part of the regular double-blind peer-review process and submitted a recommendation with a standardized reviewing form. The reviewers also evaluated regularly submitted abstracts that were, whenever possible, assigned to them on the basis of their expertise (although, depending on the volume of submissions, it is not unusual for reviewers to be assigned abstracts only tangentially relevant to their own research). The fabricated abstract was assigned to all reviewers regardless of their expertise.

Setting

This field experiment was conducted during the peer-review process of the ninth biennial international conference of the Media Psychology Division of the German Psychological Association (DGPs), held in September 2015 in Tübingen. DGPs is a nonprofit association with more than 4,000 members working in higher education, either in psychology or in a neighboring field. Its goal is to advance and expand scientific psychology. As an organization, it aims to represent psychology as a scientific discipline and promotes psychology’s role in policy making and the public sphere (Deutsche Gesellschaft für Psychologie, n.d.). The Media Psychology Division is an interdisciplinary section of DGPs dedicated to study human behavior, thought, and affect in the context of media use. At the time of the study, the second author was acting as the division’s chair, the third author as vice chair, and the first author as an early-career representative. All three authors were also the conference chairs. For the context of this study, it is relevant to note that the division (and its leadership at the time) can be characterized as rather progressive with regard to open-science ideals and practices. This is reflected, for example, in events at the conference, such as a keynote speech on open science given by Neuroskeptic (a pseudonymous science blogger) and a discussion panel on open science in media psychology, and in the promotion of transparent research practices by the division’s members (Elson & Przybylski, 2017; Krämer, 2015).

Stimulus materials

We designed a base study abstract that fit a media-psychology conference and was simple enough to allow nonspecialists to review it. The abstract was titled “Pictures of Misery: Effects of Facebook Use on Body Image,” and it described a simple laboratory experiment investigating how viewing other people’s Facebook pictures affects individuals’ body dissatisfaction. From the base version, we derived four variants for the experimental conditions: an original study with statistically significant findings, a replication study with statistically significant findings, an original study with statistically nonsignificant findings, and a replication study with statistically nonsignificant findings (see Table 1 for the text of two of these versions).

Table 1.

The Abstract Versions Used for the Original Study With Significant Results and the Replication Study With Nonsignificant Results

Body dissatisfaction (BD) has reached normative levels among girls and young women. It has been identified as one of the most robust risk factors for eating disorders, low self-esteem, and depression. One major cause of BD is the portrayal of female bodies in mass media consistently reinforcing thinness as a normative ideal central to attractiveness. Recent research has suggested social networks are a new potential source of BD. With more than one billion registered accounts, Facebook is the world’s most popular social network. The majority of Facebook activity consists of consuming other people’s posts and photos, which are known to portray users in a highly self-enhancing way.

Original study with significant results: Facebook users are therefore exposed to a constant stream of carefully selected displays of beauty corroborating unrealistic standards (Smith & Chen, 2014). The present study was carried out to investigate how viewing other’s [sic] Facebook pictures affects BD.

Replication with nonsignificant results: Facebook users are therefore exposed to a constant stream of carefully selected displays of beauty corroborating unrealistic standards. The present study was a direct replication of Smith and Chen (2014) investigating how viewing other’s [sic] Facebook pictures affects BD, using the same stimulus materials, instruments, and sample size.

In an experiment, 63 female undergraduates were asked to log into Facebook and interact with another participant’s (actually a confederate) profile which included pictures showing either an idealized (Body Mass Index = 19) or nonidealized (BMI = 29) body type. Afterwards they filled out the Body Dissatisfaction Scale.

Original study with significant results: Results show that participants exposed to idealized beauty standards reported significantly greater BD than participants exposed to nonidealized beauty standards, F(1,61)=11.97, p < .001.

Replication with nonsignificant results: Results show that participants exposed to idealized beauty standards did not significantly differ in their reported BD from participants exposed to nonidealized beauty standards, F(1,61)=0.57, p=.453, failing to replicate the results by Smith and Chen (2014).

Although Facebook is a useful tool for social networking and maintaining relationships, there can be serious disadvantages to using it.

Original study with significant results: Our results confirm that exposure to self-enhancing photos on Facebook affects body dissatisfaction.

Replication with nonsignificant results: However, our results do not confirm that exposure to self-enhancing photos on Facebook affects body dissatisfaction.

Further systematic research is needed to isolate the conditions under which social network use can serve as a psychological asset versus a psychological liability.

Note: Passages that differed between the two versions are indented and identified with labels in italic type. Readers may infer the text of the other two versions from what is shown here. All the versions are available at https://osf.io/zxthq/.

Note that the reference mentioned in all four versions was also fabricated (no detailed information other than the alleged authors’ names and the publication year was provided, which made it hard for reviewers to verify that it does not actually exist). The formatting requirements of the conference limited the abstract’s length to 300 words. The review process was double-blind.

Dependent variables

The dependent variables relevant to our preregistered hypotheses were five evaluation criteria, overall recommendation, and a total score. Additionally, the conference’s submission system allowed reviewers to recommend accepting a submission as a poster instead of the presentation type preferred by the author (a talk in the case of the fabricated abstract) and encouraged reviewers to provide written comments. Neither poster recommendations nor written comments were part of our preregistered study plan, and they are not discussed further in this article.

Evaluation

On a scale from 0 to 10, in 2-point steps (0, 2, 4, 6, 8, 10), each abstract was evaluated on the following five criteria: significance to media psychology, quality of the writing, sophistication of the theory and conceptualization, appropriateness of the methods and research design, and quality of the presentation and discussion of results.

Recommendation

Additionally, reviewers were asked to provide an overall recommendation on a scale from 0, definitely reject, to 10, definitely accept.

Total score

For each review, a total score was computed from the evaluation ratings (each worth 10%) and the recommendation rating (worth 50%). In the case of regularly submitted abstracts, the mean of total scores across the assigned reviewers ultimately determined acceptance to the conference.

Additional variables

The reviewers were asked to indicate their own familiarity with each assigned abstract’s topic (on a scale from 0 to 10, in 2-point steps). Further, three potentially relevant, publicly available personal characteristics of all the reviewers were collected from their university websites: gender, academic rank (doctoral candidate, postdoc, junior or assistant professor, associate or full professor), and academic age (years since approval of the doctoral dissertation).

Sample

We used the Media Psychology Division’s member list and department websites of division members to identify 197 German researchers working in the field of media psychology and invited these researchers to participate in the conference’s review process via e-mail (see https://osf.io/jv5ac/). Further, after the conference’s submission deadline (March 1, 2015), researchers who had submitted an abstract and were not among the initial 197 were also invited to serve as a reviewer.

For ethical reasons, the invitation emphasized that the division would be working on evaluating and improving its own peer-review process and that, for this reason, reviewers’ workload would be higher than usual (no further details on the nature of the evaluation were provided). Thus, researchers volunteering as reviewers were made aware that they had also opted in to participate in the evaluation. In total, 142 experts agreed to participate in the review process. Reviewers were assigned no more than four submissions to review; one of these was the manipulated abstract. Of the 142 reviewers, 7 were excluded a priori from participation in the experiment because they were either aware of our plans for the study or involved in designing the study. Seven additional reviewers, all from the same department, noticed the manipulation during a meeting in which senior researchers provided early-career researchers with guidance regarding their peer-review assignments for the conference.¹ In addition, 1 reviewer did not submit an assessment. Thus, the final sample consisted of 127 reviewers. Of these, 45.7% were women, 26.7% were doctoral candidates, 35.4% were postdocs, 3.9% were junior or assistant professors, and 33.9% were associate or full professors. The self-reported familiarity with the topic of the fictitious abstract was normally distributed, with a moderate mean slightly above the scale’s midpoint (M = 6.35, SD = 1.95, Mdn = 6, range = 2–10).

As we were also the organizing committee of the conference, one further ethical concern was that we would be able to inspect each individual reviewer’s recommendations, including the one that constituted data for this study. To ensure that our professional relationships with the reviewers would not be affected, we arranged for a hypothesis-blind research assistant to export and subsequently delete the reviews of the fictitious abstract from the conference’s submission system. After collecting the personal data on the reviewers from their university websites the assistant purged the names of the reviewers from the exported file before finally handing over the fully anonymized data to us.

Reviewers were debriefed at the conference and through a written report about the results of the study that was disseminated after the conference.

Statistical power and sensitivity

Note that no a priori power analysis was conducted because we knew that the pool of potential reviewers was rather limited and continued data collection would not be possible given the planning and scheduling of the conference. Further, the researchers we invited might be relatively close to a total population sample of German media psychologists. Our preregistered plan stated that if the final sample of eligible reviewers was less than 120 (before data collection), we would discard the originality manipulation and run the study as a 2 × 1 between-subjects experiment.

The final sample of 127 reviewers provided 80% power to detect bivariate correlations greater than .245² (90% power for r > .282, and 95% power for r > .312 ). The study had 80% power to detect main effects (standardized mean differences) with a d value greater than 0.503 (90% power for d > 0.582 and 95% power for d > 0.647).

Results

The descriptive statistics for each variable in each condition are reported in Table 2. Figure 1 displays the relative response frequencies for each evaluation criterion and the overall recommendation in each condition. The distribution of the total score in each condition is displayed in Figure 2. Zero-order correlations are reported in Table 3.

Table 2.

Descriptive Statistics

Statistic	Dependent variable
Statistic	Significance to media psychology	Quality of writing	Sophistication of theory	Appropriateness of methods and design	Presentation and discussion of results	Overall recommendation	Total score
Replication, significant results (n = 34)
Mean	7.12 (2.16)	7.59 (1.69)	6.18 (2.48)	7.29 (2.08)	7.12 (1.98)	7.21 (2.20)	7.13 (1.83)
Median	6	8	6	8	7	7	7.1
Range	2–10	4–10	2–10	2–10	4–10	1–10	2.7–10
Original study, significant results (n = 30)
Mean	7.00 (2.08)	7.93 (1.53)	5.93 (1.86)	6.40 (2.49)	6.93 (1.55)	6.97 (1.99)	6.90 (1.70)
Median	8	8	6	6	8	8	7.2
Range	2–10	6–10	2–8	0–10	2–10	2–10	3.4–9.4
Replication, nonsignificant results (n = 31)
Mean	6.45 (2.41)	7.61 (1.82)	5.68 (2.43)	7.16 (2.46)	6.32 (1.94)	6.48 (2.50)	6.56 (1.99)
Median	6	8	6	8	6	7	6.7
Range	0–10	2–10	2–10	0–10	2–10	1–10	2.2–9.1
Original study, nonsignificant results (n = 32)
Mean	6.25 (2.26)	7.56 (1.41)	5.69 (2.10)	5.50 (2.38)	5.81 (2.24)	6.00 (2.34)	6.08 (1.92)
Median	6	8	6	6	6	7	6.5
Range	2–10	4–10	2–10	0–10	2–10	2–10	2.4–9.6

Note: Numbers inside parentheses are standard deviations.

Fig. 1.

Relative response frequencies for each dependent variable by condition. The condition names have been abbreviated as follows: R, S = replication, significant results; O, S = original study, significant results; R, NS = replication, nonsignificant results; O, NS = original study, nonsignificant results.

Fig. 2.

Empirical cumulative distribution of the total score in each condition.

Table 3.

Zero-Order Correlations

Variable	1	2	3	4	5	6	7	8
1. Self-reported familiarity with the topic	—
2. Significance to media psychology	.249	—
3. Quality of writing	.138	.455	—
4. Sophistication of theory	.186	.511	.539	—
5. Appropriateness of methods and design	.123	.465	.448	.484	—
6. Presentation and discussion of results	.057	.539	.512	.567	.500	—
7. Overall recommendation	.196	.702	.496	.616	.731	.676	—
8. Total score	.204	.761	.617	.722	.777	.757	.972	—

Overall, the evaluations and recommendations showed relatively small differences between conditions. Compared with reviewers in the nonsignificant-results conditions, those in the significant-results conditions evaluated the study’s significance to media psychology (ω² = .017) and the quality of the presentation and discussion of results (ω² = .050) slightly more positively and gave slightly more favorable recommendations for acceptance (ω² = .026). As a result, total scores were higher in the significant-results conditions than in the nonsignificant-results conditions (ω² = .026). The originality of the study affected only the ratings of the appropriateness of the methods and research design: Replication studies were rated higher than original studies (ω² = .063). There were no appreciable interaction effects. All significance tests are reported in detail in Table 4. Given the number of tests, there was an inflated probability of false positives for a default criterion of Cronbach’s α = .05. Therefore, for all main effects, we conducted two-tailed equivalence tests with lower and upper bounds of Cohen’s d = −0.50 and +0.50 (which corresponded to our sensitivity analysis at 80% power) and α = .05 (see Table 5). These tests underscore that, because of our sample size, the precision of effect-size estimates was rather low.

Table 4.

Analysis of Variance Table for All the Dependent Variables

Dependent variable and test	SS	F(1, 123)	p	ω²
Significance to media psychology
Originality	1.111	0.224	.637	−.006
Statistical significance	15.847	3.189	.077	.017
Originality × Statistical Significance	0.056	0.011	.916	−.008
Quality of writing
Originality	0.639	0.243	.623	−.006
Statistical significance	0.900	0.342	.559	−.005
Originality × Statistical Significance	1.239	0.471	.494	−.004
Sophistication of theory
Originality	0.553	0.110	.740	−.007
Statistical significance	4.466	0.891	.347	−.001
Originality × Statistical Significance	0.508	0.101	.751	−.007
Appropriateness of methods and design
Originality	53.239	9.624	.002	.063
Statistical significance	8.166	1.476	.227	.004
Originality × Statistical Significance	4.662	0.843	.360	−.001
Presentation and discussion of results
Originality	4.670	1.230	.270	.002
Statistical significance	28.862	7.601	.007	.050
Originality × Statistical Significance	0.841	0.221	.639	−.006
Overall recommendation
Originality	4.921	0.957	.330	.000
Statistical significance	22.448	4.367	.039	.026
Originality × Statistical Significance	0.474	0.092	.762	−.007
Total score
Originality	4.643	1.340	.249	.003
Statistical significance	15.181	4.381	.038	.026
Originality × Statistical Significance	0.512	0.148	.701	−.007

Table 5.

Summary of the Equivalence Tests for All Main Effects on the Dependent Variables

Means compared	Mean difference in raw scores	Raw-score upper and lower bounds	t	df	p
Significance to media psychology
Original study (M = 6.61, SD = 2.19) vs. replication (M = 6.80, SD = 2.29)	−0.18 [–0.84, 0.47]	±1.12	2.347	125.00	.010
Significant results (M = 7.06, SD = 2.11) vs. nonsignificant results (M = 6.35, SD = 2.32)	0.71 [0.06, 1.36]	±1.11	−1.001	123.52	.159
Quality of writing
Original study (M = 7.74, SD = 1.47) vs. replication (M = 7.60, SD = 1.74)	0.14 [–0.33, 0.61]	±0.80	−2.325	123.24	.010
Significant results (M = 7.75, SD = 1.61) vs. nonsignificant results (M = 7.59, SD = 1.61)	0.16 [–0.31, 0.64]	±0.81	−2.249	124.97	.013
Sophistication of theory
Original study (M = 5.81, SD = 1.97) vs. replication (M = 5.94, SD = 2.45)	−0.13 [–0.78, 0.52]	±1.11	2.488	121.66	.007
Significant results (M = 6.06, SD = 2.20) vs. nonsignificant results (M = 5.68, SD = 2.25)	0.38 [–0.27, 1.03]	±1.11	−1.854	124.80	.033
Appropriateness of methods and design
Original study (M = 5.93, SD = 2.45) vs. replication (M = 7.23, SD = 2.25)	−1.29 [–1.99, –0.60]	±1.18	−0.278	122.85	.609
Significant results (M = 6.87, SD = 2.31) vs. nonsignificant results (M = 6.32, SD = 2.54)	0.56 [–0.16, 1.27]	±1.21	−1.523	123.39	.065
Presentation and discussion of results
Original study (M = 6.35, SD = 2.00) vs. replication (M = 6.74, SD = 1.99)	−0.38 [–0.97, 0.20]	±1.00	1.732	124.63	.043
Significant results (M = 7.03, SD = 1.78) vs. nonsignificant results (M = 6.06, SD = 2.09)	0.97 [0.40, 1.54]	±0.97	−0.012	121.26	.495
Overall recommendation
Original study (M = 6.47, SD = 2.22) vs. replication (M = 6.86, SD = 2.36)	−0.39 [–1.07, 0.28]	±1.14	1.848	124.97	.034
Significant results (M = 7.09, SD = (2.09) vs. nonsignificant results (M = 6.24, SD = 2.41)	0.86 [0.19, 1.52]	±1.13	−0.682	121.93	.248
Total score
Original study (M = 6.48, SD = 1.84) vs. replication (M = 6.86, SD = 1.91)	−0.38 [–0.93, 0.17]	±0.94	1.671	124.98	.049
Significant results (M = 7.02, SD = 1.76) vs. nonsignificant results (M = 6.32, SD = 1.95)	0.71 [0.16, 1.25]	±0.93	−0.675	123.28	.250

Note: Sample sizes for all tests were as follows: original study—n = 62; replication—n = 65; significant results—n = 64; nonsignificant results—n = 63. Values in parentheses are standard deviations. Values in brackets are 90% confidence intervals. The upper and lower bounds for the equivalence test are the raw-score values corresponding to Cohen’s d = ±0.50.

Discussion

This study had two purposes: The first was to investigate if two characteristics of a research report, its originality and the statistical significance of its main effect, affect peer review. The second, and more important, purpose of this study was to demonstrate the value, necessity, and feasibility of metascientific field experiments on the peer-review process.

We observed some evidence for a small bias in favor of significant results. At least for this particular conference, though, it is unlikely that the effect was large enough to notably affect acceptance rates. We did not observe an aversion to replication studies documented elsewhere (Zwaan, Etz, Lucas, & Donnellan, 2018), which could be a tangible result of the continued debate regarding (media) psychology’s robustness and the value of replications that had started a few years before this experiment was conceived. One practical outcome with regard to the Media Psychology Division is that submissions to the 11th conference (held in 2019) were no longer allowed to include study results.

Lessons and limitations

There are contextual and design constraints on the generalizability of our observations. We used only (four slight variations of) one abstract with one specific research question. A fictitious abstract from another research domain, or even a different abstract of the same quality in the same domain, might have received different evaluations. Ideally, we would have used multiple abstract sets as a within-subjects factor to allow controlling for person (i.e., reviewer) characteristics (e.g., general strictness), and we would have treated the variation in abstracts as a random factor to determine between-abstract heterogeneity (to have a stronger basis for generalizing to other possible abstracts).

Each version of the abstract was shorter than 300 words, and therefore the information available to thoroughly evaluate these submissions on each criterion was limited. For instance, the entire statistical summary of the empirical evidence was reduced to an F value, degrees of freedom, and a corresponding p value (which is not uncommon for a conference abstract). The abstract reported no further tests of evidentiary value (e.g., equivalence tests or Bayesian analyses); no supporting information, such as an effect-size estimate, standard error, or confidence interval; and no descriptive statistics (means, standard deviations)—all of which are typically considered when interpreting data and evaluating a full-length study report. The absence of a statistical description of the underlying data certainly made it difficult for reviewers to properly assess the outcome of the study, and this ambiguity was likely greater for the statistically nonsignificant results than for the significant results (which means the information carried was asymmetrical across conditions). This may have affected the reviewers’ evaluation scores, in particular with regard to the quality of presentation and discussion of results.

The sample was rather small for a 2 × 2 between-subjects design, allowing us to reliably detect only main effects that were at least half a standard deviation in magnitude. Certainly, there could have been effects our study was simply not equipped to detect, particularly given the low reliability of peer-review instruments (Bornmann et al., 2010). Conversely, the low precision of the effect-size estimates limits the confidence that can be placed in both our findings of modest effects and our findings of negligible effects: Without taking any other design limitations or contextually introduced biases into account, the modest effects we observed could in fact be null, or in some cases could even be effects in the opposite direction, whereas the seemingly negligible effects could actually be substantial. However, as mentioned earlier, the full population of German-speaking media psychologists was probably exhausted to a large degree by the recruiting process. The necessary conclusion for future field experiments in relatively small research areas is that one should plan clean, parsimonious designs that rigorously test simple, incremental hypotheses.

The Media Psychology Division’s conference is certainly less competitive than larger conferences or other publication venues. Accordingly, the regularly submitted abstracts had a rather generous mean total score of 7.2 (on a scale from 0 to 10). Thus, it is unclear to what extent the findings can be generalized to peer-review processes in which the reviewers are more critical of the work they are assigned. In those publication venues, reviewers are usually selected on the basis of their expertise, whereas in our study, the fictitious abstract was assigned to any volunteer. Although self-reported expertise was only weakly correlated with any dependent variable in our study, it could be an important factor to consider when studying biases against nonsignificant findings, particularly in failed replications (see, e.g., Ernst & Resch, 1994).

Almost all the previous empirical research we have discussed in this article was conducted on peer review of journal submissions (and in some cases, grant applications), and therefore we only cautiously integrate our own observations with the literature. Certainly, differences between disciplines and subdisciplines regarding the perceived value of publishing in journals versus books versus proceedings affect the immediate relevance of this field experiment.

Finally, it is also conceivable that the conference’s theme, the reputation of the division’s leadership as open-science advocates, or awareness of other metascientific research in which we were involved affected the reviewers’ responses to direct replications (which had been rarely presented at previous instances of this conference).

A blueprint for metascientific field experiments on peer review

As regards the second purpose of this study, we note that there are, of course, practical and ethical challenges to this type of research that need careful consideration. Experiments on peer review face the same ethical challenges as does any field study in which subjects are not provided complete information about the research design or even may be deceived. Studies guiding evidence-based practice by rigorously comparing several policies (e.g., different instructions to reviewers, different levels of blinding) in an A/B design may face strong objections by the community (i.e., the subject pool), even when the untested implementation of either A or B would be unobjectionable (Meyer et al., 2019). Objections to experimentation (principled or not) naturally depend on the design, but we argue not only that experimental research on peer review can be conducted ethically, but also that there is an ethical obligation to conduct such research considering the costs of maintaining an unchecked quality-assurance procedure (Meyer, 2015). Simply put, the alternatives to evidence-based peer-review procedures—including the status quo, with its documented failures (Jefferson, Alderson, et al., 2002)—have not been subjected to systematic testing.

Given the nature of conferences, and their deadlines for submissions, reviews, and eventually presentations, our study had particular constraints that would be less relevant in a field setting where continuous data collection is possible (e.g., journal review). Although reviewers may not be told a priori to which condition specifically they were assigned, or what exactly is part of the study (which may be more obvious in typical laboratory experiments), psychologists studying peer review can design their research in a way that allows potential reviewers to make an informed choice about their participation. We propose a model in which a pool of reviewers (e.g. everyone registered with a journal’s submission system) is informed that some experimental studies will be conducted within a specified time frame in the future. This initial information sheet could outline the types of study designs planned or characteristics that might be experimentally manipulated (and refer to the resulting additional workload beyond that required by conventional review, if any), without specifying which of these will be realized. Reviewers may then pick and choose how much time they would be willing to commit to this research, and which of the potential manipulations they do or do not consent to. For example, some reviewers may be unconcerned about reviewing real manuscripts under procedural manipulations (such as omitting Results sections from manuscripts), but less inclined to review entirely fictitious manuscripts that serve no purpose other than the research itself. Both opt-in (van Rooyen, Godlee, Evans, Smith, & Black, 1998) and opt-out (Okike et al., 2016) procedures seem practically feasible and ethically unproblematic.

Naturally, when metascientists use real manuscripts as stimulus materials to investigate peer review, another challenge is to prevent the experimental manipulations from inducing systematic or selective disadvantages for authors. For example, it is conceivable that some design decisions would affect the strictness of reviews (e.g., by guiding the attention of reviewers to certain manuscript characteristics or prompting them to submit more critical remarks) but not the quality of the selection process or the refinement of submissions. Depending on the magnitude and risk of the manipulation, it may be necessary to use conventional or accepted review procedures to guide editorial decision making and to use the experimental reviews exclusively for research purposes. This approach may also reduce the community’s objections to the experimental study of peer review (Meyer et al., 2019). Metascientists may even find some benefit in this approach, as the status quo reviews could be used as natural control-group data (possibly even without obtaining consent).

Conclusions

We hope that this study encourages psychologists, as individuals and on institutional levels (associations, journals, conferences), to conduct experimental research on peer review, and that the preregistered field experiment we have reported may serve as a blueprint of the type of research we argue is necessary to cumulatively build a rigorous knowledge base on the peer-review process. We believe it prudent to eventually develop and implement evidence-based interventions that address documented shortcomings in academic publishing. We also believe that an improved understanding of peer review will increase the sustainability of the quality-management system in its entirety and reduce strain on the army of volunteer reviewers, as the current pronounced randomness of the process provides an incentive to ignore comments (even when they are valuable) and resubmit manuscripts elsewhere unchanged.

Going back to Churchill, we note that peer review cannot be described, with sufficient certainty, as the worst form of academic quality assessment except for all other forms, because very little is known about its performance in general, and because it is not particularly clear how well peer review fares against other forms of academic quality assessment, simply because not many have been tried in a systematic way. Experimental field research on peer review is necessary to understand its mechanisms, effectively contextualize it in psychological theories of various biases, and develop practical procedures to increase its utility. If peer review is maintained as the primary mechanism of arbitration in the competitive selection of research reports and funding, then the scientific community needs to make sure it is not arbitrary.

Supplemental Material

Elson_Rev_Open_Practices_Disclosure – Supplemental material for Metascience on Peer Review: Testing the Effects of a Study’s Originality and Statistical Significance in a Field Experiment

Supplemental material, Elson_Rev_Open_Practices_Disclosure for Metascience on Peer Review: Testing the Effects of a Study’s Originality and Statistical Significance in a Field Experiment by Malte Elson, Markus Huff and Sonja Utz in Advances in Methods and Practices in Psychological Science

Footnotes

Acknowledgements

We sincerely thank Johannes Breuer and James Ivory for their help in designing the stimulus materials. We further thank Sebastian Strauß for his assistance with the data preparation.

Transparency

Action Editor: Simine Vazire

Editor: Daniel J. Simons

Author Contributions

M. Elson, M. Huff, and S. Utz jointly generated the idea for the study. M. Elson designed the study and collected the data. M. Elson wrote the analysis code and analyzed the data. M. Elson wrote the first draft of the manuscript, and all three authors critically edited it. All the authors approved the final submitted version of the manuscript.

ORCID iD

Malte Elson

Notes

References

Akerlof

G. A.

Michaillat

(2018). Persistence of false paradigms in low-power sciences. Proceedings of the National Academy of Sciences, USA, 115, 13228–13233. doi:10.1073/pnas.1816454115

Atkinson

D. R.

Furlong

M. J.

Wampold

B. E.

(1982). Statistical significance, reviewer evaluations, and the scientific process: Is there a (statistically) significant relationship? Journal of Counseling Psychology, 29, 189–194. doi:10.1037//0022-0167.29.2.189

Bedeian

A. G.

(2003). The manuscript review process: The proper roles of authors, referees, and editors. Journal of Management Inquiry, 12, 331–338. doi:10.1177/1056492603258974

Bornmann

Mutz

Daniel

H.-D.

(2010). A reliability-generalization study of journal peer reviews: A multilevel meta-analysis of inter-rater reliability and its determinants. PLOS ONE, 5(12), Article e14331. doi:10.1371/journal.pone.0014331

Burnham

J. C.

(1990). The evolution of editorial peer review. JAMA, 263, 1323–1329. doi:10.1001/jama.1990.03440100023003

Camerer

C. F.

Dreber

Holzmeister

T.-H.

Huber

Johannesson

. . . Wu

(2018). Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour, 2, 637–644. doi:10.1038/s41562-018-0399-z

Cooper

DeNeve

Charlton

(1997). Finding the missing science: The fate of studies submitted for review by a human subjects committee. Psychological Methods, 2, 447–452. doi:10.1037/1082-989X.2.4.447

Coursol

Wagner

E. E.

(1986). Effect of positive findings on submission and acceptance rates: A note on meta-analysis bias. Professional Psychology: Research and Practice, 17, 136–137. doi:10.1037//0735-7028.17.2.136

Couzin-Frankel

(2013). Secretive and subjective, peer review proves resistant to study. Science, 341, 1331. doi:10.1126/science.341.6152.1331

10.

Deutsche Gesellschaft für Psychologie. (n.d.). What is the German Psychological Society (DGPs)? Retrieved from https://www.dgps.de/index.php?id=83&L=1

11.

Elson

Przybylski

A. K.

(2017). The science of technology and human behavior. Journal of Media Psychology, 29(1), 1–7. doi:10.1027/1864-1105/a000212

12.

Emerson

G. B.

Warme

W. J.

Wolf

F. M.

Heckman

J. D.

Brand

R. A.

Leopold

S. S.

(2010). Testing for the presence of positive-outcome bias in peer review. Archives of Internal Medicine, 170, 1934–1939. doi:10.1001/archinternmed.2010.406

13.

Epstein

W. M.

(1990). Confirmational response bias among social work journals. Science, Technology & Human Values, 15, 9–38. doi:10.1177/016224399001500102

14.

Eriksson

(2012). The nonsense math effect. Judgment and Decision Making, 7, 746–749.

15.

Ernst

Resch

K.-L.

(1994). Reviewer bias: A blinded experimental study. Journal of Laboratory and Clinical Medicine, 124, 178–182.

16.

Forscher

P. S.

Cox

W. T. L.

Devine

P. G.

Brauer

(2019). How many reviewers are required to obtain reliable evaluations of NIH R01 grant proposals? PsyArXiV. Retrieved from https://psyarxiv.com/483zj/

17.

Franco

Malhotra

Simonovits

(2014). Publication bias in the social sciences: Unlocking the file drawer. Science, 345, 1502–1505. doi:10.1126/science.1255484

18.

Godlee

Gale

C. R.

Martyn

C. N.

(1998). Effect on the quality of peer review of blinding reviewers and asking them to sign their reports: A randomized controlled trial. JAMA, 280, 237–240. doi:10.1001/jama.280.3.237

19.

Goodman

S. N.

Berlin

J. A.

Fletcher

S. W.

Fletcher

R. H.

(1994). Manuscript quality before and after peer review and editing at Annals of Internal Medicine. Annals of Internal Medicine, 121, 11–21. doi:10.7326/0003-4819-121-1-199407010-00003

20.

Hammerschmidt

D. E.

(1994). The vagaries of peer review: A new study and our experience. Journal of Laboratory and Clinical Medicine, 124, 146–148.

21.

Heesen

Bright

L. K.

(2019). Is peer review a good idea? The British Journal for the Philosophy of Science. Advance online publication. doi:10.1093/bjps/axz029

22.

Heesen

Romeijn

J.-W.

(2019). Epistemic diversity and editor decisions: A statistical Matthew effect. Philosophers’ Imprint, 19(39). Retrieved from http://hdl.handle.net/2027/spo.3521354.0019.039

23.

Jefferson

Alderson

Wager

Davidoff

(2002). Effects of editorial peer review: A systematic review. JAMA, 287, 2784–2786. doi:10.1001/jama.287.21.2784

24.

Jefferson

Wager

Davidoff

(2002). Measuring the quality of editorial peer review. JAMA, 287, 2786–2790. doi:10.1001/jama.287.21.2786

25.

Krämer

N. C.

(2015). Editor-in-chief transition: Achievement and future plans. Journal of Media Psychology, 27, 1–2. doi:10.1027/1864-1105/a000147

26.

Lakatos

(1969). Criticism and the methodology of scientif-ic research programmes. Proceedings of the Aristotelian Society, 69, 149–186. doi:10.1093/aristotelian/69.1.149

27.

Mahoney

M. J.

(1976). Scientist as subject: The psychological imperative. Cambridge, MA: Ballinger.

28.

Mahoney

M. J.

(1977). Publication prejudices: An experimental study of confirmatory bias in the peer review system. Cognitive Therapy and Research, 1, 161–175. doi:10.1007/BF01173636

29.

Malicˇki

von Elm

Marušic

(2014). Study design, publication outcome, and funding of research presented at international congresses on peer review and biomedical publication. JAMA, 311, 1065–1067. doi:10.1001/jama.2014.143

30.

Meyer

M. N.

(2015). Two cheers for corporate experimentation: The A/B illusion and the virtues of data-driven innovation. Colorado Technology Law Journal, 13, 273–332.

31.

Meyer

M. N.

Heck

P. R.

Holtzman

G. S.

Anderson

S. M.

Cai

Watts

D. J.

Chabris

C. F.

(2019). Objecting to experiments that compare two unobjectionable policies or treatments. Proceedings of the National Academy of Sciences, USA, 116, 10723–10728. doi:10.1073/pnas.1820701116

32.

Okike

Hug

K. T.

Kocher

M. S.

Leopold

S. S.

(2016). Single-blind vs double-blind peer review in the setting of author prestige. JAMA, 316, 1315–1316. doi:10.1001/jama.2016.11014

33.

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349, Article aac4716. doi:10.1126/science.aac4716

34.

Pöschl

(2012). Multi-stage open peer review: Scientific evaluation integrating the strengths of traditional peer review with the virtues of transparency and self-regulation. Frontiers in Computational Neuroscience, 6, Article 33. doi:10.3389/fncom.2012.00033

35.

Rennie

(2003a). Editorial peer review: Its development and rationale. In Godlee

Jefferson

(Eds.), Peer review in health sciences (2nd ed., pp. 1–13). London, England: BMJ Books.

36.

Rennie

(2003b). Innovation and peer review. In Godlee

Jefferson

(Eds.), Peer review in health sciences (2nd ed., pp. 76–90). London, England: BMJ Books.

37.

Robin

E. D.

Burke

C. M.

(1987). Peer review in medical journals. Chest, 91, 252–255. doi:10.1378/chest.91.2.252

38.

Rosenthal

(1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86, 638–641. doi:10.1037//0033-2909.86.3.638

39.

Simmons

J. P.

Nelson

L. D.

Simonsohn

(2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366. doi:10.1177/0956797611417632

40.

Smaldino

P. E.

McElreath

(2016). The natural selection of bad science. Royal Society Open Science, 3(9), Article 160384. doi:10.1098/rsos.160384

41.

Smith

(2006). Peer review: A flawed process at the heart of science and journals. Journal of the Royal Society of Medicine, 99, 178–182. doi:10.1258/jrsm.99.4.178

42.

Spier

(2002). The history of the peer-review process. Trends in Biotechnology, 20, 357–358. doi:10.1016/S0167-7799(02)01985-6

43.

Tsou

Schickore

Sugimoto

C. R.

(2014). Unpublishable research: Examining and organizing the “file drawer.” Learned Publishing, 27, 253–267. doi:10.1087/20140404

44.

van Rooyen

Godlee

Evans

Smith

Black

. (1998). Effect of blinding and unmasking on the quality of peer review: A randomized trial. JAMA, 280, 234–237. doi:10.1001/jama.280.3.234

45.

Wicherts

J. M.

Kievit

R. A.

Bakker

Borsboom

(2012). Letting the daylight in: Reviewing the reviewers and other ways to maximize transparency in science. Frontiers in Computational Neuroscience, 6, Article 20. doi:10.3389/fncom.2012.00020

46.

Wood

Wessely

(2003). Peer review of grant applications: A systematic review. In Godlee

Jefferson

(Eds.), Peer review in health sciences (2nd ed., pp. 14–44). London, England: BMJ Books.

47.

Zwaan

R. A.

Etz

Lucas

R. E.

Donnellan

M. B.

(2018). Making replication mainstream. Behavioral & Brain Sciences, 41, Article e120. doi:10.1017/S0140525X17001972

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.65 MB