Abstract

There is evidence suggesting that to the extent that scientists subscribe to a philosophy of science, most are likely to cite Popper’s falsificationism as their practiced methodology (e.g., Mulkay & Gilbert, 1981; Sovacool, 2005). At the same time, well-known lessons from the history of science demonstrate that real scientific practice bears little resemblance to Popperian prescriptions (e.g., Kuhn, 1962/2012), and that this has almost certainly benefited scientific progress (e.g., Lakatos, 1978). In this Commentary, we provide a brief outline of the four main models for establishing epistemic justification in science and explain the essential role that establishing null results plays in each. Popper’s falsificationist version of the hypothetico-deductive (HD) model of science is one of the models we discuss. We argue that although strict adherence to Popper’s HD model would not solve all the problems the “replication crisis” has thrown at the scientific community, scientists could do worse than follow his advice about bold conjectures and risky tests in establishing the absence (or presence) of effects. Null-hypothesis significance testing (NHST) as it is typically practiced is neither bold nor risky. We also add a plea for recognition of scientific activity outside of the HD model, which will help solve some of the deeper problems that have led to neglect of negative results.
Four Models of Science
Broadly speaking, there are four main models for establishing epistemic justification in science: induction, abduction, Bayesianism, and hypothetico-deductivism (see Box 1 for a glossary of terms). In all of these models, the pursuit of negative or null results is essential for an accurate and cumulative science. In both inductive and abductive methods, the discovery of the absence of an effect could lead to an extensive revision of any previous conclusions. Similarly, the Bayesian approach advocates updating one’s beliefs given all of the available evidence, positive or negative. In Popper’s HD method, scientists should try to falsify, rather than confirm, their hypotheses, and thus it is essential that they design tests that may show an absence of the phenomena their hypotheses predict. Each of these models of scientific inference is controversial, but required in at least some areas of scientific reasoning, and all rely on the availability of full evidence for, and against, the presence of effects.
Box 1. Glossary of Terms
It is common to equate “absence of an effect” with “null” or “negative” results, but they are not necessarily equivalent. In all the models of epistemic justification that we discuss, the presence of an effect may serve as contradictory, or negative, evidence just as much as the absence of an effect can. In Popper’s (1983) view, the absence of an effect can falsify a hypothesis that predicted its presence, or the presence of an effect can falsify a hypothesis that predicted its absence: “A theory is . . . falsifiable if and only if there exists at least one potential falsifier—at least one possible basic statement that conflicts with it logically” (p. xx).
Induction
Induction is the mode of inference whereby a general claim is inferred from observation of previous cases. Although induction puzzles philosophers who argue that an inductive inference can be considered reliable only if there is a good reason to assume that the future will be like the past, induction is often practically unavoidable.
Induction clearly plays a role in science when statistical inferences are based on samples, but it also comes into play at the more abstract level of formulating hypotheses, when a researcher supposes that hypothesis X is worth examining in a new context because the new context seems relevantly similar to previously tested contexts. For example, if you have found on several occasions that an intervention is beneficial to people in Paris, Washington, and Nairobi, then you should consider the possibility that it will be helpful to people in Jakarta. Induction is a risky enterprise to the extent that it relies on the ability to successfully document a set of cases relevantly similar in regard to the conclusion being made. This is very difficult in many sciences, psychology included. The discovery of an absence of an effect will be important because inductive inferences, although often sensible to make with the limited information at hand, can be completely undermined by a single contradictory example or a small amount of contradictory evidence.
Abduction
Abductive reasoning, first discussed in detail by the philosopher Charles Peirce (1955), is now sometimes called inference to the best explanation. As is induction, it is controversial; epistemologists take issue with what is meant by “best,” “explanation,” and also “inference” (and to a much lesser extent, “to”). Abductive reasoning has been applied when a researcher decides that some hypothesis must be correct or is worth serious investigation because it does a better job of explaining the observable evidence than any of the alternatives do. Charles Darwin (1859/1958, as cited in Haig, 2014), for example, explicitly and famously appealed to such reasoning: It can hardly be supposed that a false theory would explain, in so satisfactory a manner as does the theory of natural selection, the several large classes of facts above specified [e.g., the geographical distribution of species, the sterility of hybrid species]. It has recently been objected that this is an unsafe method of arguing. But it is a method used in judging common events of life, and has often been used by the greatest natural philosophers. The undulatory theory of light has thus been arrived at; and the belief of the revolution of the earth on its own axis was until lately supported by hardly any direct evidence. (p. 106)
The controversy regarding abduction is partly due to the fact that it can be used in at least two distinct ways. Abduction can be used to argue that a scientific hypothesis should accepted as true—in the logic of justification—or to support the claim that a scientific hypothesis is worth pursuing—in the logic of discovery (see McKaughan, 2008, for a discussion on these and other uses of abduction).
Abduction employed in the logic of justification (to argue that a hypothesis is true) is commonly represented as follows (Josephson & Josephson, 1996): Premise 1: Hypothesis H explains data (facts, observations). Premise 2: No other hypothesis can explain the data as well as H does. Therefore Conclusion: H is probably true.
The main difficulty in successful abductive reasoning in the logic of justification is how to argue for the second premise. One possible way is to argue by a type of disjunctive syllogism (i.e., the inference “A or B, not A, therefore B”), for example, Premise 1: The only available explanations for the data are the hypotheses H1, H2, . . . , Hn. Premise 2: H1 is a better explanation than H2, . . . , Hn. Therefore Conclusion: No other hypothesis can explain the phenomenon as well as H1 does.
Searching for and being aware of the absence of an effect is extremely important for justifying both premises. Premise 1 requires that one can make a good judgment about the plausible alternative explanations that should be considered. To do this requires a thorough knowledge of the observational phenomenon to be explained, and if results indicating the absence of an effect are not included in an account of the phenomenon, then that would bias the set of hypotheses that are considered. Premise 2, which is based on the reasons why the candidate hypothesis does a better explanatory job than the other contenders, could be justified in a number of ways depending on the topic, but very often will involve considering inconsistencies between the explanations and the evidence. The relative inconsistency between explanations and evidence will look very different if negative results are not available for consideration.
Abduction is also important in generating hypotheses and in deciding which hypotheses are worth pursuing, the logic of discovery. The structure of hypothesis-postulating abduction is closer to the following: Premise 1: Phenomenon P is surprising. Premise 2: Hypothesis H would do a good job of explaining P. Therefore Conclusion: H is a hypothesis worth pursuing.
This is a weak inference, in that the conclusion that H is “worth pursuing,” in the logic of discovery, is not as strong as the conclusion that H is “true,” in the logic of justification. However, consideration of all evidence, including results of no effect, is still required to perform such hypothesis generation efficiently.
Part of abductive reasoning is knowing what type of criteria to use when evaluating a set of potential explanations or hypotheses. In developing such criteria, philosophers of science typically turn to explanatory virtues (Lipton, 2004), which include properties such as how well an explanation unifies different areas of knowledge, how specific and clear the proposed underlying mechanism is, and how elegant, parsimonious, and precise the hypothesis is (see Table 1 for a more comprehensive list).
Commonly Discussed Explanatory Virtues
Note: All of these virtues are hotly debated among philosophers of science (see Lipton, 2004, for further discussion).
Bayesianism
Bayesian philosophy of science claims that scientists should assign probabilities to propositions (i.e., theories or parameter estimates) according to the degree of belief associated with those propositions and update these probabilities in the light of new evidence in accordance with Bayes theorem (Good, 1968). Bayesian epistemology says that a reasoner faced with evidence (E) should increase his or her belief in a hypothesis (H) when the probability of that hypothesis conditioned on the evidence, P(H|E), is greater than the reasoner’s prior belief in the hypothesis, P(H), and should decrease the degree of his or her belief in the hypothesis when the conditional probability is lower than the prior belief. This updating should be done in accordance with Bayes theorem (Good, 1976; Talbott, 2016). Bayes theorem forms the basis of a range of tools allowing for normative statements to be made about how one should behave upon the receipt of new evidence insofar as priors are specifiable (Earman, 1992). In Bayesian philosophy of science, it is clear that evidence for the absence of effects must be considered alongside evidence of nonzero effects in order to make accurate inferences. Without complete access to evidence, accurate calculation of rational beliefs and updating are not possible.
A distinction should be made between Bayesian epistemology and Bayesian statistical methods. Bayesian epistemology is not incompatible with traditional frequentist statistical methods (Good, 1976), and Bayesian statistical methods can be used with or without the acceptance of Bayesian epistemology.
Hypothetico-deductivism
The HD model of the scientific method relies on testing predictions deduced from hypotheses. Loosely speaking, there are two main forms of this model: the traditional version that dates back centuries and the newer Popperian falsificationism. The latter is largely regarded as the superior form (see Nola & Sankey, 2007).
According to the traditional version of the HD method, observation of what a hypothesis predicts provides support for the hypothesis. Traditional HD inferences take the following form: Premise 1: If hypothesis H is true, then we should observe evidence E. Premise 2: We observe evidence E. Therefore Conclusion: Hypothesis H is supported.
This inference is based on the fallacy of affirming the consequent, unless one assumes an unstated premise such as “the evidence is very unlikely given the negation of the hypothesis.” There are more sophisticated versions of this confirmationist approach to the HD method (e.g., Betz, 2013). However, most of the time, when philosophers talk about the HD method, they mean Popper’s falsificationism, which relies on the valid inference of modus tollens (although see Cohen, 1994, regarding how valid modus tollens remains once the premises are probabilistic): Premise 1: If hypothesis H is true, then we should observe evidence E. Premise 2: We do not observe evidence E. Therefore Conclusion: Hypothesis H has been falsified.
Falsificationism arose as a solution to the problem of induction but is not by itself enough for a model of science (Salmon, 1981). Imagine two equally falsifiable hypotheses, one that has just been proposed and one that has already survived many attempts to falsify it. For example, the hypothesis that a patient’s illness can be cured by a medication that has never failed to cure a patient is, in principle, no more or less falsifiable than the hypothesis that the illness can be cured by a brand-new medication that has yet to be tested. However, it is surely more rational for the patient to take the medication that has been tested and never failed.
To avoid this conundrum, Popper (1968) added another element to his philosophy of science: corroboration. “So long as a theory withstands detailed and severe tests . . . we may say that it has ‘proved its mettle’ or that it is ‘corroborated’ by past experience” (p. 33). Here, Popper referred to a “theory” being tested, but what is really tested are empirical consequences of theories (their predictions). A theory can become more corroborated if the predictions that follow from it survive more severe tests. Corroboration can therefore come in degrees, depending on the severity and number of tests that a hypothesis has survived. Popper (1968) explained the need for corroboration as follows: We shall take [a theory] as falsified only if we discover a reproducible effect which refutes the theory. In other words, we only accept the falsification if a low-level empirical hypothesis which describes such an effect is proposed and corroborated. (p. 66)
Mayo and Spanos (2010) have provided an account of the connection between frequentist statistical testing and Popper’s ideas of the severity of tests, but detailed discussion of this account is beyond the scope of the current article.
It is also worth noting that for corroboration, Popper had in mind multiple different tests, more like conceptual replications than direct replications. This is probably because he was working primarily with examples from physics, a discipline in which direct replication was simply an assumed part of the process.
In Popper’s account, there is no logical difference between how to test for the presence of an effect and how to test for its absence, and certainly, the logical asymmetry between falsifying and confirmatory evidence is not a reason to believe that there is a similar asymmetry between establishing the presence or absence of an effect. The historical fixation on positive results therefore cannot be blamed on Popper. It more likely stems from the seductive appeal of the term statistically significant, which so readily expands to encompass the substantive space of “important” and “interesting” (Gigerenzer, Krauss, & Vitouch, 2004). “Statistically nonsignificant” is universally equated to the opposite, “unimportant” and “dull,” even though significance is a statistical concept, whereas importance reflects a value judgment, and the extent to which conclusions of significance and importance overlap will depend on the context.
Falsificationism and NHST
There is a superficial fit between the language of Popper’s falsificationism and NHST. In the 1960s through the 1990s, Paul Meehl (1967, 1978), Michael Oakes (1986), and other scientists and philosophers argued that these linguistic parallels make it easy for scientists to think they are following Popper’s method by doing NHST. For example, researchers reporting NHST results refer to “rejecting” the null hypothesis, and such statements are easily mistaken as indicating that it has been “falsified.” Researchers who use NHST are warned not to speak of “accepting” null hypotheses but rather to say that they “fail to reject” the null hypotheses, which seems to (but does not) resemble Popper’s asymmetry between establishing falsification with certainty but never proving—only corroborating—a hypothesis Oakes (1986) suggested that this superficial fit has afforded NHST unwarranted philosophical justification and may have prolonged its stranglehold over many scientific fields. Table 2 outlines some of the ways in which typical NHST practice violates the principles of Popperian falsificationism.
Ways in Which Null-Hypothesis Significance Testing (NHST) Violates Popperian Falsificationism
Crucial to Popper’s account of falsificationism is that bold conjectures, potentially falsifiable statements or hypotheses, are subjected to risky tests, tests that have the best chance of exposing the conjectures to be false if indeed they are. A conjecture may well predict the absence of an effect, and as long as there is a possible observation that logically conflicts with there being an absence of that effect, that hypothesis will be considered falsifiable, and remains on solid scientific ground.
In the great majority of uses of NHST, the hypothesis being tested is not really a prediction of a theory, but rather a nil null hypothesis, that there is “no difference between groups” or “no effect” (Bakker, van Dijk, & Wicherts, 2012). The only thing that this typical application of NHST allows one to say is how likely it is for data at least as extreme as the observed data to be produced under the statistical null model. If the probability of obtaining the observed or more extreme data is sufficiently low (almost always p < .05) under the statistical null model, one typically “rejects” the null model and claims the presence of an effect or difference, usually in support of the theory of interest. The problem here is not that the absence of an effect has been posited per se, but rather that the mechanical application of nil null hypotheses creates further distance between statistical and substantive hypotheses, and in psychology, connecting statistical and substantive hypotheses is already often difficult (Meehl, 1978).
Furthermore, even if one suspends disbelief and accepts the nil null as a candidate substantive theory, the test that it is likely to be exposed to, if it resembles the tests in the vast majority of the psychological literature, will be anything but risky. The average statistical power of psychology research has been estimated to be under 50% for the average effect sizes seen in psychology research (Cohen, 1962; Szucs & Ioannidis, 2017). When power is this low, null hypotheses stand little chance of being rejected even when they are in fact false—conditions Popper would not have endorsed. Of course, p-hacking and other questionable research practices often intervene to prevent these failures to reject.
Ideally, to minimize departures from Popperian falsificationism, researchers would move away from testing nil null hypotheses and instead subject their real statistical hypotheses—developed from strong theoretical foundations and strongly and explicitly connected to their substantive hypotheses—to rigorous (risky) testing. But in order to take these steps, researchers would need theories that produce hypotheses that are more specific and bolder than “any nonzero effect.” Popper’s philosophy of science provides unfortunately little guidance on generating theories, and exactly how to develop theories of this kind remains an outstanding challenge for the current scientific-statistical reform. A diffuse hypothesis of “any nonzero effect” cannot be legitimately rejected using NHST (Greenland, 2012) because any given test may be underpowered to detect a yet smaller instantiation of “anything but zero” than the one found. One way around this is to employ techniques that allow for a hypothesis of no effect to be supported, and some of the other authors in this Invited Forum have outlined such techniques.
Beyond Presence and Absence: The Next Challenge for Scientific Reform
Despite the importance of evidence of the absence of effects in all major approaches to the philosophy of science, it is hard to argue that experiments demonstrating the absence of an effect that no one especially expected to be there are necessarily worthy of publication or particularly interesting. Showing that carrot soup does not cure cancer is not very interesting unless there has been some reason to think that it does. It might be interesting if a previously published report claimed that it does, or if an existing theory predicts that it does. Trying to articulate criteria for what constitutes an interesting absence may at first seem overwhelming, perhaps because it makes salient how little is known about the process of theory and hypothesis generation.
Scientists might agree that a hypothesis should follow from a theory, but this evades the question of how one should judge what makes a theory interesting. For a long time, scientists have (mis)used the statistical significance of results to answer this question. It is now apparent that the price paid for this practice is publication bias (Ferguson & Heene, 2012). The development of solutions to publication bias, such as Registered Reports (RRs), forces scientists to confront questions about judging the interestingness of hypotheses by other means (Nosek & Lakens, 2014). The reviewer guidelines for RR submissions to Royal Society Open Science list “importance of the research question(s)” as the first criterion (Royal Society Open Science, 2017). But how to make such judgments remains largely a matter of unarticulated expertise, not just for RRs but in science more generally.
The HD model falls short on answering questions such as “what makes an interesting hypothesis?” and “where do new theories come from?” Popper wrote about what makes a hypothesis “surprising,” but his treatment of this subject is underdeveloped for the current purpose. This is a common criticism of the HD method, but in fact Popper (1968) was very clear about the scope of his work: The initial stage, the act of conceiving or inventing a theory, seems to me neither to call for logical analysis nor to be susceptible of it. The question how it happens that a new idea occurs to [someone]—whether it is a musical theme, a dramatic conflict, or a scientific theory—may be of great interest to empirical psychology; but it is irrelevant to the logical analysis of scientific knowledge. (p. 31)
Because the critical question of judging the interestingness of a hypothesis is largely outside the domain of the HD model, it is important for statistical-scientific reform efforts to introduce other philosophical perspectives. Although scientists certainly agree about the ways in which the HD model is currently broken (e.g., publication bias, questionable research practices, lack of incentive for replication studies; see Munafò et al., 2017), an idealized HD model is not itself sufficient to move researchers toward a scientific utopia (to borrow a phrase).
Answers to questions about hypothesis and theory generation are unlikely to arise from the HD model. Further philosophical work on inductive and abductive reasoning may help scientists and philosophers develop a better understanding of theory-generation processes (e.g., Haig, 2014). One potentially useful challenge lies in operationalizing explanatory virtues (see Table 1) as criteria for both judging the interestingness of other researchers’ hypotheses and aiding hypothesis generation. Explanatory virtues have been discussed by many philosophers over many decades (e.g., van Fraassen, 1980). Some early efforts to operationalize them do exist (e.g., Lipton, 2004), but again, these are underdeveloped for the current purposes.
Part of the solution will be to (a) expand what is considered legitimate scientific activity to include exploratory research that is explicitly presented as exploratory and (b) value the inductive and abductive reasoning supporting this work. Kerr (1998) advocated recognition of exploratory research to reduce HARKing (i.e., hypothesizing after the results are known). Other authors have recommended explicit identification of confirmatory and exploratory statistical tests (e.g., Wagenmakers, Wetzels, Borsboom, van der Maas, & Kievit, 2012). Early institutional efforts to formalize these recommendations can be seen in BMJ Open Science’s and Cortex’s inclusion of exploratory reports as an article type (McIntosh, 2017). Further efforts to create space for legitimate non-HD scientific activity should be welcomed by the scientific community.
Conclusion
No matter which philosophy one subscribes to, being able to show the absence of effects is essential for accurately testing and refining theories. In what is arguably the most widely endorsed philosophy of science among working scientists (Popper’s HD method), bold conjectures and risky tests are key to corroborating or falsifying theories. Unfortunately, the approach to statistical testing that is most commonly used, NHST, is especially ineffectual at this task, and has supported the censoring of null results from the scientific literature.
Researchers could do worse than to pay closer attention to the boldness of conjectures and riskiness of tests in scientific practice. But better adherence to the HD model will not help the scientific community answer one of the fundamental questions posed by the current methodological crisis: What makes a hypothesis, theory, or prediction—regardless of whether it forecasts the presence or absence of an effect or phenomenon—scientifically interesting? For these answers, scientists need to look further afield and work toward better articulating the explanatory virtues of scientific hypotheses in order to judge their importance without relying on the statistical significance of results.
Footnotes
Action Editor
Daniel J. Simons served as action editor for this article.
Author Contributions
F. Fidler and A. Barnett conceived of the plan for this manuscript. S. Kambouris, A. Kruger, and F. Singleton Thorn contributed to background research in specialized areas. F. Singleton Thorn, A. Barnett, A. Kruger, S. Kambouris, and F. Fidler all contributed to drafting and editing the manuscript.
Declaration of Conflicting Interests
The author(s) declared that there were no conflicts of interest with respect to the authorship or the publication of this article.
