Abstract

Imagine the following scenario: Ph.D. student David has run a series of studies trying to find a positive effect of brain stimulation on language comprehension in stroke patients. After three studies with null findings, he has changed the design in various ways and is overjoyed when the fourth study gives a statistically significant effect. His article is published in a prestigious, high-impact journal, with David as first author and his eminent supervisor as last author. The university press office promotes the study, and it is featured on National Public Radio. Two weeks later, when preparing slides for a talk at the Society for Neuroscience, David finds that the groups were miscoded, and in fact the sham-treatment group obtained higher posttraining scores than the stimulation group.
When I use fictitious examples like this in seminars and ask the audience, “What should David do?” the usual response is that, of course, David should come clean, admit the error, and ask for the article to be retracted. But there is typically nervousness in the room. It is pointed out that that there are massive pressures on him not to do so: The general perception is that admission of error will mean that the reputation of both David and his supervisor will be in tatters, with David’s prospects for a future career badly damaged.
Yet there are real-life examples of scientists admitting to honest errors that show that this doom-laden scenario is unrealistic. In a recent study, Azoulay, Bonatti, and Krieger (2017) considered how reputation is affected by retraction, by comparing subsequent citations of earlier published articles for authors who had and who had not had an article retracted. Retraction due to researchers’ misconduct led to a drop in subsequent citations of their earlier work, but there was a smaller effect when honest error was involved—and no evidence of reputational damage for junior researchers. In an interview study of 14 authors whose articles were retracted after they notified the journal of errors, Hosseini, Hilhorst, de Beaufort, and Fanelli (2018) found that, contrary to the interviewees’ expectations, self-retraction did not damage their reputation and in some cases improved it. This fits with more informal evidence suggesting that there can be reputational advantage from going public in correcting an error: You demonstrate that you are someone who values scientific accuracy over your success in publishing (Retraction Watch, 2017). Nevertheless, there may be pressures from institutions or senior colleagues to hide errors, and journal editors are not always supportive. Hosseini et al noted: “Many authors expected rapid, empathic and detailed responses from journal editors, but reported receiving short, unsympathetic and sometimes unpleasant ones instead” (p. 200).
The thought of having to retract an article can instill fear into the heart of scientists, who see it as equivalent to being named and shamed. There are currently few incentives for honesty, and keeping quiet about an error will often seem the easiest option. Recognizing that the threat of bad consequences could act as a deterrent to honest admission of error, Retraction Watch instituted the Doing the Right Thing award to “honor those who clean up the scientific literature” (Oransky & Marcus, 2017). I give some examples of researchers who have publicized their own errors in Box 1.
Examples of Researchers Who Highlighted Errors in Their Own Work
• With six coauthors, Richard Mann, a postdoctoral researcher using statistical methods to study behavioral ecology, had published an article on behavior in prawns in PLOS Computational Biology. He shared the data set with a colleague who was looking for data to test out some ideas on numerical integration. On his blog, Mann (2013) described the moment when the colleague phoned him to tell him of a fatal error in his analysis. As stated in the retraction notice (Mann et al., 2012),
Where each of 102 experiments should have been down-sampled to half the original size for computational efficiency, instead the number of experiments in the data set was repeatedly halved 102 times. . . . results and conclusions were based on only one experimental study, rather than the 102 reported in the paper.
The article was retracted, and the analysis was redone, giving similar findings. Mann (2013) stated that although he had a terrible few months, he did not suffer any long-term stigma.
• A story in Nature News (Gewin, 2015) documented how Pamela Ronald, a professor in plant pathology, became concerned when two of her postdocs could not replicate findings she had published, in 1995, in two high-profile articles on the basis of the immune response in rice. She notified the journal editors and then devoted the next 18 months to trying to locate the source of the discrepancy. It turned out that the strains of microbes she had been using were mislabeled, and in 2013 the articles were retracted. In 2015, Ronald published an article correctly identifying the source of the immune response. She has changed her lab procedures so that three independent researchers now validate new experimental approaches.
• Senior neuroscientist Russ Poldrack wrote computer code to classify a set of brain images into classes according to the task being performed. He had submitted a manuscript based on this analysis for publication, when a student collaborator told him that after obtaining far lower classification accuracy on the same data set, he found an error in the code. Poldrack’s (2013) response was to write a blog post about this experience, encouraging everyone to share code, use better methods for checking code, and talk about their errors.
There are two further points to take from the David scenario. As awful and embarrassing as it is to admit to error, the alternative, hiding a known error, has to be worse. The person who does this is entering into a Faustian pact to reject science in favor of personal ambition. As data fraudster Diederik Stapel openly admitted, once you embark on this process, it is difficult to stop, but it creates considerable internal conflict (Stapel, 2014, pp. 128–131).
The second point is that although errors can never be eliminated, they can be reduced by adoption of open-science practices. Even in situations in which the raw data cannot be made completely open in a repository, usually because of confidentiality issues, it is often possible to deposit a version that has been modified to remove identifiable information, so that other researchers can reproduce what was done (UK Data Service, n.d.). For sensitive data, a data-sharing agreement may be needed in addition to anonymization (Medical Research Council, 2017). Practical suggestions for data sharing in psychology were recently proposed by Gilmore, Kennedy, and Adolph (2018).
Regardless of the level of security that is required, there should be no barriers to researchers making their analysis code open, so that the analysis steps can be checked. The example from Russ Poldrack in Box 1 illustrates how easy it is even for an experienced scientist to make an error in coding that has serious consequences for results. People often worry that if they make their code and data open, errors will be found, but that is really the whole point: We need to make code and data open because this is how the errors can be found. Also, if you know that your code and data will be open, you are likely to check and double-check them far more rigorously than if you know they will never be seen by anyone else. So open practices reduce the likelihood of error. Furthermore, errors in analysis scripts are extremely common among scientists who have taught themselves to program (Merali, 2010), so errors are likely to be present. But to encourage people to share scripts, we must remove any stigma associated with detection of errors. This is not condoning sloppy science: It is just accepting the reality that we are all fallible.
Of course, making analysis programs open does not guarantee that they are free from errors. A result may be reproducible—in the sense that we obtain consistent results when the same data are run through the same program—but it may still be wrong. An example of widely used neuroimaging software that was discovered to include a bug only after many years of use was reported by Eklund, Nichols, and Knutsson (2016a). The authors noted in a subsequent correction that they were not implying that all analyses using the software were erroneous, but rather meant that it was not possible to establish which were. They explained, “Due to lamentable archiving and data-sharing practices, it is unlikely that problematic analyses can be redone” (Eklund, Nichols, & Knutsson, 2016b, para. 3). Quite simply, making code and data open does not prevent errors, but it does make it possible to detect them. And as amply documented elsewhere, it brings other benefits to researchers, in terms of improving their science as well as enhancing recognition of their work (Markowetz, 2015; McKiernan et al., 2016).
Errors in Someone Else’s Work: How to Respond
The prior discussion of errors in one’s own work should give clues about how to respond when you find errors in another person’s work. You would not want to be pilloried for an honest error, so do not pillory others for simple mistakes. In a comment on a blog post on this topic, Weil (2014) put it very well: . . . my first prominent publication was a note tearing down someone else’s work. That work had appeared in a major journal and caused quite a stir — but the apparent results were the product of a careless (not dishonest, just careless) mistake in the analysis. The note pointing this out was not derogatory in tone, nor was it intended to shame, but was doubtless embarrassing to the authors. Now that I am much older, a little wiser, and a little kinder (and a lot more employed, and thus less vulnerable to jerks) I would send the authors my analysis of their math first and give them the opportunity to correct. And I hope that my colleagues would give me the same consideration if (when?) I make a stupid mistake.
Life, however, is not always so simple. The researcher whose error is remarked on may respond with anger, denial, or silence. This is, of course, a normal human reaction, but it is not a sensible response if the error is unambiguous, as it can damage the author’s reputation for integrity. In theory, it should be possible to resolve such issues via the journal that published the original article, but in practice, this process seldom proceeds smoothly. Allison, Brown, George, and Kaiser (2016) described how their own attempts to correct substantial errors in other researchers’ work met with inaction or delaying tactics by authors and editors, and even demands for payment to publish a letter pointing out the errors. At the time of writing this Commentary, it was possible to put the record right by adding a comment in PubMed Commons (Bastian, 2014). The comment was linked to the abstract of the original article on PubMed and became part of the scientific record. The first two examples in Box 2 illustrate how both authors and other researchers have used PubMed Commons to record a correction. However, despite its utility, PubMed Commons was not widely used by commenters and was discontinued in February 2018, though the comments remain archived (National Center for Biotechnology Information, 2018).
Examples of Postpublication Commentary on PubMed Commons
1. Author adding minor corrections: https://hypothes.is/search?q=tag%3APubMedCommonsArchive+28436345
Jim van Os noted some numerical errors in a table in an article he had published.
2. Reviewer correcting an error: https://hypothes.is/search?q=tag%3APubMedCommonsArchive+28461468
Pavel Nesmiyanov noted that β-endorphin, oxytocin, and dopamine were wrongly described as neuropeptides in a journal article. Although the authors did not respond on PubMed Commons, an erratum was published in the journal.
3. Reviewer critiquing methods:https://hypothes.is/search?q=tag%3APubMedCommonsArchive+29153326
Franck Ramus criticized the small sample size of a study on neurobiological correlates of dyslexia. The authors responded, defending the small sample size and arguing that their analyses were driven by an a priori hypothesis derived from a previous study.
4. Reviewer critiquing methods: https://hypothes.is/search?q=tag%3APubMedCommonsArchive+28706072
Serge Ahmed suggested that a study of planning in ravens needed an additional control for learning of the affective value of objects.
5. Reviewer noting overhyped interpretation of results: https://hypothes.is/search?q=tag%3APubMedCommonsArchive+28735725
Clive Bates noted that a study finding an association between vaping and smoking tobacco in adolescents had been widely interpreted in the media as showing a causal link. He added a link to a more detailed critique of the study.
6. Reviewer raising more serious concerns: https://hypothes.is/search?q=tag%3APubMedCommonsArchive+17688420
David Nunan noted previously raised concerns about duplicate data in an article on the role of diet in congestive heart failure.
Errors in Interpretation of Data
Research results may seem suspect because of concerns about methodology, rather than straightforward errors in calculation or scripting. For instance, a study may lack a control group, be underpowered, use an unreliable measure, or have a major confound. There may be strong suspicion that the author has engaged in p-hacking. These are not simple errors that can be corrected, but they affect the conclusions that can be drawn. All of these are situations in which PubMed Commons provided a venue for raising the concerns, as illustrated in Box 2, Examples 3 through 5. With the disappearance of PubMed Commons, there are limited options remaining to researchers who want to engage in postpublication peer review, given that few journals have options for commenting. For researchers who do not have access to a blog, an alternative platform, PubPeer, is likely to become the method of choice for postpublication peer review. An important difference from PubMed Commons is that commenters can be anonymous. Probably because of this, PubPeer has been far more popular than PubMed Commons, but it is also noted for a harsh style of criticism that can include accusations of malpractice (Dolgin, 2018). This is unfortunate because it leads to the impression that postpublication peer review typically involves a personal attack. Harsh criticism can polarize debate and make many people reluctant to engage. PubMed Commons was also used to draw attention to malpractice, but typically such comments described the problem without engaging in personal attack (see Box 2, Example 6).
My recommendation is that when errors are found, the starting position should be that methodological weaknesses are due to ignorance rather than bad faith. Consider, for instance, p-hacking. The dangers of this practice were pointed out many years ago (de Groot, 2014), but it has been normative for decades in many branches of science, including psychology. Before he moved on to fraud, Stapel (2014) engaged in p-hacking, noting: What I did wasn’t whiter than white, but it wasn’t completely black either. It was grey, and it was what everyone did. (p. 102)
Even now that it has been prominently demonstrated that p-hacking is a major cause of false positive findings (Simmons, Nelson, & Simonsohn, 2011), many researchers still do not recognize how seriously it can distort results (Nuzzo, 2014). Furthermore, it is likely that p-hacking is deemed acceptable, because it involves paltering, that is, using a truthful statement (e.g., that the p value associated with a contrast is < .05) to mislead by failing to provide relevant contextual information (e.g., that this comparison was one of numerous comparisons and would not be statistically significant if correction were made for multiple contrasts; Rogers, Zeckhauser, Gino, Norton, & Schweitzer, 2017).
Scientists are particularly prone to paltering when it comes to citing the results of other researchers. The process of conducting a literature review is likely to be affected by confirmation bias, that is, seeking and remembering evidence that supports one’s position, and ignoring or forgetting evidence that does not (Nickerson, 1998). Rogers et al. (2017) showed that people judge such omission as less dishonest than inclusion of untrue information, and it is often unwitting, but the consequences can be substantial (Greenberg, 2009). One way of counteracting bias in literature reviews is to require that they follow the format of a systematic review, in which criteria for deciding which reports to include are specified in advance (Gough, Oliver, & Thomas, 2017; Wicherts, 2017).
Failure to Replicate: An Unreliable Indicator of Fallibility
I have focused so far on situations in which there are either honest errors in the data or analysis or methodological weaknesses that compromise conclusions that can be drawn. A much more complicated scenario arises when there is difficulty in replicating a published result. This has become a hot topic in science in recent years (Munafò et al., 2017), and failure to replicate findings in psychology was brought to the fore by an influential study published in Science (Open Science Collaboration, 2015). These developments coincided with growing awareness of p-hacking as an endemic problem for psychology (Simmons et al., 2011), which made it easy to conclude that results that were not replicated were indicative of bad science. The key point to note is that although erroneous data, erroneous inferences, and failure to control bias can lead to results that are not replicated, one cannot assume that failure to replicate is necessarily the result of any of these types of error. In psychology, we are dealing with probabilistic phenomena, so random noise is always a factor affecting results: Our statistical methods are designed to guard against Type I and Type II errors, but there is an inevitable trade-off, so some statistically significant differences will be false positives, and some failures to find an effect will be false negatives (see Box 3 for a list of possible reasons for failure to replicate). Replication is important precisely because our confidence in the robustness of a given finding cannot depend on a single study.
Possible Reasons for Failure to Replicate a Scientific Result
• The initial result was a false positive due to chance variation (Type I error)
• The replication study failed to detect a true effect because of chance variation (Type II error)
• The results are sensitive to contextual factors
• The method requires specific expertise that the researcher conducting the replication lacks
• The initial results rested on data-entry, computational, or statistical errors
• The initial results were obtained using questionable research practices, such as p-hacking
So, the question arises as to how researchers should respond when there is a failure to replicate prior work. Given the range of reasons for nonreplication, it should not be assumed that a failure to replicate a result is evidence of poor science in the original study. Nevertheless, it is important to uncover reasons for discrepant findings. Ideally, the two sets of researchers should work together to consider how to reconcile the discrepancy. If the original researchers believe that contextual factors or researcher expertise are critical to obtaining their result, then it is up to them to specify more carefully the conditions under which the effect obtains, rather than simply put forward hypothetical explanations for a null result. When there is a failure to replicate a finding, it is bad if the first response is to disparage the original researchers as incompetent, malign, or fraudulent, but it is just as bad if researchers whose findings were not replicated dismiss the critics as lacking in expertise or having malevolent motives. Again, the kudos will go to the researchers who show integrity in putting scientific truth before their own career ambitions. As a positive example, consider Finkel’s (2016) reflections on a failure to replicate one of his studies: “‘Although I am surprised by the failure of the manipulation check and disappointed that the results of the [Registered Replication Report] did not confirm the causal effects my colleagues and I originally reported, I deeply respect the process” (p. 766).
Deliberate Omission, Misrepresentation, and Misconduct
I turn now to those unfortunate situations in which it is hard to avoid concluding that a researcher is acting in bad faith. A particularly insidious kind of behavior involves deliberate selective citation of the literature, or cherry-picking. As is the case with other methodological errors, it can be difficult to distinguish deliberate misconduct from unwitting omission. No person should be pilloried for occasional bias in a review’s coverage: Even if one strenuously attempts to avoid bias, searches to identify publications on a topic may miss relevant articles because positive findings garner far more citations than null findings (Greenberg, 2009). Citation bias morphs into misconduct when there is a persistent pattern of an author ignoring contrary evidence, even when it is readily available and drawn to his or her attention. Worse still are cases in which cited studies are inaccurately portrayed. These are standard ploys by authors promoting pseudoscientific views (Grimes & Bishop, 2017) and need to be robustly challenged. However, to do so effectively, it may be necessary to trawl through a huge amount of material to reveal the distortion and lack of substance in the claims, and meanwhile, amplified by confirmation bias and social media, the original article may have propagated a wildfire of misinformation that is hard to extinguish (Lewandowsky, Ecker, & Cook, 2017).
The next level after distortion of research findings is outright invention of fake data. It is generally assumed that this is rare, though it is difficult to get accurate estimates of the frequency of this deception because of its very nature. A researcher who suspects misconduct by another scientist is placed in an uncomfortable position, and there is little formal guidance as to how to proceed. Simonsohn (2013, p. 1886), who used statistical methods to uncover the fraudulent work of two psychologists, summarized his recommended steps as follows:
Replicate the analyses across multiple studies before suspecting foul play
Compare suspect studies with similar ones by other authors
Extend the analyses to raw data
Contact the authors privately and transparently, and give them ample time to consider your concerns
Offer to discuss matters with a trusted statistically savvy advisor
Give the authors more time
If suspicions remain, convey them only to entities tasked with investigating such matters, and do so as discreetly as possible
Investigating suspected misconduct is extremely important work, but it is not for the fainthearted. An accusation of fraud is serious business and requires rock-solid evidence, which can take hours of careful work to discover. Although one would hope that academic institutions would take seriously an accusation of misconduct against a staff member, they can be slow to act; it is, of course, important that they consider the possibility that they are dealing with an unjustified attack by people with vested interests or fixed ideas. Such attacks do occur, but malign intent should not be the default assumption, unless there are several “red flags” (Lewandowsky & Bishop, 2016). Although there are some notable cases of good practice by institutions (e.g., Høj, 2013), there are also many historical instances of their closing ranks to protect an eminent researcher (Judson, 2004). This is shortsighted, as the ultimate reputational damage from being revealed to be supporting a dishonest researcher is far worse than any bad publicity from early disclosure of a problem. But the scientist who is trying to put things right can find it to be a lonely and dispiriting process, as Heathers (2017) documented on his blog. Furthermore, when we are dealing with genuine fraudsters, we can expect them to use every method possible to avoid discovery, because they have built a career on deceit. They are likely to be obstructive and may well attack back, accusing the people who are raising questions of ulterior motives. As do whistle-blowers in other areas of life, the people who detect fraud tend to get little thanks from the community whose interests they serve.
General Principles for Responding to Fallibility
Thankfully, accusations of deliberate misconduct in science are rare, but the spotlight has started to shine increasingly on fallibility in psychology, and some hitherto well-established findings are now looking less solid (e.g., O’Donnell et al., 2018). My general rule is that we should never use mockery or personal abuse against other scientists who make honest errors: Such behavior just reinforces people’s unwillingness to be open about errors. Nor should we assume that failure to replicate a result is a sign of poor science in the original study; rather, it is an indication that more work needs to be done to establish whether, and under what conditions, the result is robust. But good researchers will not hesitate to note flaws in their own scientific work and the work of others. Criticism is the bedrock of the scientific method. It should not be personal: If one has to point to problems with someone’s data, methods, or conclusions, this should be done without implying that the person is stupid or dishonest. This is important, because the alternative is that many people will avoid engaging in robust debate because of fears of interpersonal conflict—a recipe for scientific stasis. If wrong ideas or results are not challenged, we let down future generations who will try to build on a research base that is not a solid foundation. Worse still, when the research findings have practical applications in clinical or policy areas, we may allow wrongheaded interventions or policies to damage the well-being of individuals or society. As open science becomes increasingly the norm, we will find that everyone is fallible. The reputations of scientists will depend not on whether there are flaws in their research, but on how they respond when those flaws are noted.
Footnotes
Acknowledgements
This Commentary is based on a talk given on July 7, 2017, at a meeting on Reproducible Science for Early Career Researchers, at the University of Cardiff. I thank David Mehler for inviting me to present the talk at the meeting and for proposing this topic. I am also grateful to Kendal Smith for constructive comments on a preprint version of this article.
Action Editor
Daniel J. Simons served as action editor for this article.
Author Contributions
D. V. M. Bishop is the sole author of this article and is responsible for its content.
Declaration of Conflicting Interests
The author(s) declared that there were no conflicts of interest with respect to the authorship or the publication of this article.
