Abstract
Statistical hypothesis testing is common in research, but a conventional understanding sometimes leads to mistaken application and misinterpretation. The logic of hypothesis testing presented in this article provides for a clearer understanding, application, and interpretation. Key conclusions are that (a) the magnitude of an estimate on its raw scale (i.e., not calibrated by the standard error) is irrelevant to statistical testing; (b) which statistical hypotheses are tested cannot generally be known a priori; (c) if an estimate falls in a hypothesized set of values, that hypothesis does not require testing; (d) if an estimate does not fall in a hypothesized set, that hypothesis requires testing; (e) the point in a hypothesized set that produces the largest p value is used for testing; and (f) statistically significant results constitute evidence, but insignificant results do not and must not be interpreted as evidence for or against the hypothesis being tested.
Introduction
Current concepts of statistical testing can lead to mistaken ideas among researchers such as (a) the raw-scale magnitude of an estimate is relevant, (b) the classic Neyman–Pearson approach constitutes formal testing, which in its misapplication can lead to mistaking statistical insignificance for evidence of no effect, (c) one-tailed tests are tied to point null hypotheses, (d) one- and two-tailed tests can be arbitrarily selected, (e) two-tailed tests are informative, and (f) power-defined intervals or data-specific intervals constitute formal test hypotheses. In this article, I challenge convention regarding hypothesis testing that leads to such mistaken ideas, and I provide a coherent conceptualization and logic for testing that avoids such mistakes.
A recent book and related works by Ziliak and McCloskey (McCloskey & Ziliak, 1996, 2008; Ziliak & McCloskey, 2004a, 2004b, 2008a, 2008b) declare statistical significance is invalid for scientific inquiry. Critics responded (Engsted, 2009; Hoover & Siegler, 2008a, 2008b; Spanos, 2008). Ziliak and McCloskey, and their critic Spanos, imply the raw-scale magnitudes of parameters are relevant to hypothesis testing. This confuses the goals of hypothesis testing with that of parameter estimation; I provide a careful distinction between the two and argue the raw-scale magnitude of an estimate is irrelevant to testing.
The Neyman and Pearson (1933) approach to addressing hypotheses is often offered as a formal logic of testing (Neutens & Rubinson, 2002b; Portney & Watkins, 2000; Rothman & Greenland, 1998; Spanos, 1999b), with the unfortunate consequence that statistically insignificant findings are interpreted as evidence for no association. I argue that the Neyman–Pearson approach is not a formal hypothesis testing strategy, and I present a generalization of Fisher’s approach that is a formal strategy.
The one-tailed test is often presented as having a point null hypothesis (Neutens & Rubinson, 2002b; Portney & Watkins, 2000; Rothman & Greenland, 1998; Spanos, 1999b). Such a presentation does not constitute a general framework; I present a general characterization that allows each competing hypothesis to be a set of parameter values.
The two-tailed test is perhaps the most common test. I argue that, as a formal process, the two-tailed test (indeed, a test of any point hypothesis) is almost never informative.
Intervals associated with power, confidence intervals (CIs), and data-specific measures are sometimes offered as defining hypotheses (Mayo, 2010; Mayo & Spanos, 2010; Neyman, 1957). I argue they do not constitute formal test hypotheses.
The goal of this article is to provide a coherent understanding and approach to formal statistical hypothesis testing for the researcher who seeks to use this inference tool without confusion. This article does not discuss alternative methods for using empirical evidence in inference such as the direct interpretation of CIs or Bayesian methods, nor is this article intended as an argument in favor of a particular method. The following sections define hypotheses and hypothesis testing, distinguish the goal of hypothesis testing from that of parameter estimation, present a logic of testing, and discuss its scope.
Hypothesis Testing Versus Parameter Estimation
By the term hypothesis, I mean a formal proposition about which its truth or falsity is unknown. An empirical hypothesis is one for which empirical evidence can, in principle, bear on judgments of its truth or falsity. A statistical hypothesis is an empirical hypothesis about distribution parameters of random variables defined by a data generating process.
To properly understand Frequentist statistical hypothesis testing, it is important to understand that the relevant random variables represent the distribution of possible values that a data generating process could obtain, and not actual data. In this sense, data and corresponding estimates are realizations of the underlying random variables but are not themselves random variables. Hence, the sample mean statistic has a distribution of possible values, whereas the mean of a given sample is a number.
A statistical hypothesis should be stated in terms of distribution parameters of random variables, and not in data-specific terms. If a statement includes reference to data, then it will either not be a hypothesis or it will be uninformative. As an example, consider the claim that “there will be a significant result.” In the first case, in terms of the data generating process, there is a particular probability that what is claimed will occur, and the claim is therefore neither true nor false and thereby not a hypothesis. In the second case, in terms of the resulting data, the claim will be either true or false, but it is a proposition by virtue of a necessary numeric characteristic only (of course the result will be either statistically significant or not). The same hypothesis applies to any data generating process, and knowing whether it is true or false is uninformative regarding the data generating process under investigation.
Hypothesis testing is a process by which we can inform judgments of the truth or falsity of a hypothesis. Formal statistical hypothesis testing is a method that compares data-specific value of a statistic to the statistic’s sampling distribution as implied by the hypothesized values of a statistical hypothesis. There are two largely substitutable methods, in their common usage. One is to define a set of values in the statistic’s range that correspond to sufficiently rare events under the hypothesis-specific distribution (often termed the rejection region); if the data-specific value of the statistic is found to be in this set, the data are considered evidence against the underlying hypothesis. This is a strictly categorical method of testing. The second is to calculate the probability of obtaining data at least as extreme as that actually obtained from the data generating process under the assumption that the statistical hypothesis is true (commonly termed the p value); if the p value is sufficiently small compared with an a priori set level (commonly called the significance level), the data are considered evidence against the hypothesis being tested. This is also a categorical method of testing; however, the p value can also provide a continuous measure of evidence for the hypothesis being tested. The most common use of formal testing is to adopt the categorical approach with its designations of results being “statistically significant” or “statistically insignificant”; I will discuss testing in these terms.
By this definition, the classic Neyman–Pearson test (NPT), in which we set our acceptable Type I and Type II error rates and proceed as if the null were true or false according to our test, is not hypothesis testing: Notwithstanding Neyman and Pearson’s reference to testing, it is a decision rule, as Neyman and Pearson themselves state (Neyman & Pearson, 1933). Its goal is to decide whether or not to act as if a hypothesis is true, not to judge whether the hypothesis is true: The former is suited to model specification; the latter is suited to generating scientific understanding.
However, Fisher’s (1956) approach to statistical inference, in which we use data as evidence for or against the truth of a claim, provides a basis for hypothesis testing by the definition used here. What I provide is in one sense a generalization and in another sense a restriction of Fisher’s approach. It is a generalization because Fisher’s approach has focused primarily on point hypotheses, whereas the logic I present applies to set hypotheses in general. It is a restriction because Fisher’s approach does not explicitly state an alternative, whereas the logic I present addresses sets of hypotheses that partition a parameter space—an idea that Neyman and Pearson (1928a, 1928b) initiated with the introduction of the formal alternative hypothesis.
It is important to distinguish the goals of testing and estimation. The goal of hypothesis testing is to make a judgment regarding the truth or falsity of a hypothesis, whereas the goal of estimation is to make a judgment regarding the value of a parameter.
If I know whether a hypothesis is true or false, I have achieved the goal of hypothesis testing. Suppose I am interested in the hypothesis that “the average annual health care expenditure among men is greater than that among women”: It is either true or false, it cannot be nearly true or mostly false. If an honest omniscient being were to tell me “your hypothesis is true,” then my goal has been achieved. Knowing the magnitudes of the averages or their difference adds nothing more to achieving this goal.
Suppose you are testing the hypothesis that the effect of a policy intervention is zero, and the result is a substantively trivial but statistically significant difference. A criticism is that you have identified a statistically significant yet substantively insignificant effect. Such a critique is not an indictment against your hypothesis test. Your statement that the data provide evidence the hypothesis is false is not compromised by the raw-scale size of the effect. The two statements, “10−100 ≠ 0” and “10100 ≠ 0,” are both true: There is no degree of truth that varies with the magnitude. What then is the objection? The critic is objecting to your goal of testing the hypothesis and is instead presumably seeking evidence regarding the magnitude of the estimate; apparently to this critic the values in the CI, whether it crosses zero or not, are too small to be useful.
Suppose you are interested in estimating an odds ratio, and your data produce a 95% CI of [0.9, 2]. A policy maker may consider the costs and benefits of the program at different values across the interval, or she may take a more rigorous approach and apply statistical decision theory (Berger, 1985). A critic, however, points out that the CI crosses 1 and states that you cannot rule out there is no effect. Such a statement is not an indictment against your estimation. What then is the objection? The critic is objecting to your goal of estimation and is instead presumably seeking to judge whether there is a difference; not being able to rule out an odds ratio of 1 disallows such a judgment.
These examples presume the researcher is interested in either hypothesis testing or estimation, but the researcher may be interested in both. Suppose we are considering the incremental cost–benefit of a program modification. But, regardless of the magnitudes of the cost and benefit differences, if there is a decrease in benefit, then the modification will not be adopted, and if there is an increase in benefit with a corresponding decrease in cost, then the modification will be adopted. In this case, it is only if there is increasing benefit and increasing costs that we need to know the values of these changes to determine whether to adopt the modification. Consequently, it is only when there is sufficient evidence supporting the last hypothesis that we need to pursue the goal of estimation.
A Logic of Statistical Testing
The logic presented here requires the a priori specification of hypotheses in terms of mutually exclusive and mutually exhaustive sets of possible values for distribution parameters of random variables reflecting a data generating process. A priori specification is an epistemological requirement for results to have evidential value; however, as discussed in this section, this requirement does not apply to the act of testing. See Figure 1 for examples regarding a parameter that can possibly take values on the extended real line (i.e., any number from negative infinity to positive infinity): Panel A depicts the set of hypotheses that underlie typical one-tailed tests; Panel B depicts the hypotheses that underlie two-tailed tests; Panel C represents how three hypotheses might be expressed. Each set of values represents a hypothesis regarding the parameter. For example, partitioning the real line into the set of negative values and the set of nonnegative values can represent a comprehensive set of hypotheses (e.g., H1 and H2) regarding a parameter µ (e.g., H1: µ < 0 and H2: µ ≥ 0). The set of hypotheses can include substantively derived hypotheses as well as a catchall negation of these hypotheses. We wish to determine in which set of values is the true parameter (i.e., which hypothesis regarding the value of the parameter is true).

Example specifications of sets of hypotheses that are mutually exclusive but together make up the full set of possible parameter values (i.e., sets of hypotheses that partition the set of possible values).
If an estimate
I define an estimate
Figure 2 presents the possibilities for the single-parameter two-hypotheses case (H and ¬H) defined on the real line. Panel A depicts the estimate being consistent with both hypotheses:

Examples of consistent and inconsistent combinations for a one-tailed hypothesis and its negation in which the parameter space is the real line.
Panel B depicts the estimate being consistent with H but inconsistent with ¬H: Hypothesis H could well have produced the data, but ¬H is not likely to have (if we consider the area under the curve to the right of
Because data are consistent with a hypothesis to which the estimate conforms, it is not necessary to statistically test such a hypothesis; however, a statistical test is required of those hypotheses to which the estimate does not conform. In this case, because the estimate will conform to only one hypothesis, if there are N hypotheses, then N − 1 hypotheses must be statistically tested. In the case where the estimate does not conform to any hypothesis, all hypotheses must be tested.
Steps in the Application of the Testing Logic
The application of this logic proceeds in four steps as stated in Table 1. The interpretation of results depends on which of two cases apply.
Steps in the Application of the Logic of Statistical Testing.
Case 1: Data are consistent with all hypotheses. If the data are consistent with all hypotheses, then there exists at least one plausible parameter value in each of the hypothesized set of values. In this case, the data do not provide evidence for or against any of the hypotheses.
Case 2: Data are inconsistent with at least one hypothesis. If the data are consistent with one hypothesis but inconsistent with all others, then the data provide evidence for the hypothesis and evidence against the others. When there are more than two hypotheses, the data can provide evidence for or against sets of hypothesis. In such a case, the data cannot adjudicate between the hypotheses in the set of hypotheses with which the data are consistent but can rule out the hypotheses with which the data are inconsistent.
The One-Tailed Test
Hypotheses regarding a single parameter often take the form of directional hypotheses such as H0: θ ≤ 0 versus H1: θ > 0. Which hypothesis should we statistically test? Suppose at Step 3, we estimate
Because the hypotheses specify sets of possible values, they are not necessarily expressed in terms of the specific parameter value used to calculate the p value. If the hypotheses are H0: θ ≤ 0 versus H1: θ > 0, and if in Step 3 we obtain
Returning to our example with

Statistical testing of an estimate against hypothesis H0 using the limit point on the boundary between hypothesis subspaces (use of the distribution depicted by the solid line).
Directional hypotheses are sometimes introduced as a point hypothesis and a directional hypothesis such as H0: θ = 0 versus H1: θ > 0 (Neutens & Rubinson, 2002a; Spanos, 1999a). As indicated by the preceding discussion, this specification is not a general description of directional hypotheses and would only be appropriate when the parameter space is legitimately restricted to the indicated range and the point hypothesis is part of an a priori specification of the partition.
The Two-Tailed Test
Perhaps the most common statistical test is the two-tailed test of a scalar parameter. In this case, the hypotheses include a single point and its negation. Step 1 of the preceding logic is to express the hypotheses that constitute the relevant partition of the parameter space: for example, H0: θ = 0 versus H1: θ ≠ 0. However, we know a priori that the estimate is almost certainly going to conform to H1: The chance of an estimate equaling 0 (to the precision of the computer, and certainly to the infinite decimal place) is almost certainly 0. This has two consequences: First, we know a priori that we will almost certainly be statistically testing the null hypothesis H0. Second, it is not possible to accrue evidence for the point hypothesis H0 in a continuous parameter space. To have evidence for H0, one would need an estimated value in H0 that would be unlikely under all points in H1. Even if we obtained an estimate exactly equal to 0, for any finite sample there will be some positive number c such that if θ = 10−c (which, not being 0, is clearly in H1) will make the estimate of 0 plausible (ignoring that a p value does not actually exist for continuous parameter spaces if H0 is a point on the real line). Moreover, except in ideal cases such as perfect randomized experiments, the presumption of a point hypothesis is itself implausible—The parameter is almost certainly not exactly equal to the specified value. The conclusion is that a formal point hypothesis (in a continuous parameter space) will not likely be empirically informative: It cannot be confirmed, and it seldom needs to be disconfirmed.
Other Sources of Conceptual Errors
Another mistaken question of concern regards overly powerful tests. This concern stems from confusing statistical with substantive goals and thereby confusing the metric of statistical distance (based on the standard error, reflecting variation in the data generating process) with that of a substantive determination (typically associated with the raw scale of the variables). Indeed, if we had full information, we would know whether the hypothesis was actually true or false, or we would know the actual value of the parameter. Such knowledge is not to be shunned.
A related issue is the a posteriori interpretation of results that accounts for power. Suppose we have sufficient power to discern a deviation δ from a point hypothesis θ0. If results are significantly different from θ0, we might infer that θ is at least θ0 ± δ; if results are not significant, then we might infer that θ is no more than θ0 ± δ (Neyman, 1957). Do these results, however, constitute a formal test of the implied set of hypothesis θ ∈ [θ0 − δ, θ0 + δ] and θ ∉ [θ0 − δ, θ0 + δ]? No. The rejection region of the statistic can be between θ0 and δ, or greater than δ but not statistically significantly different from δ (see Figure 4, in which θ0 = 0). If we wish to test these new hypotheses, we would follow the steps in the testing logic, which would lead to collecting data and a test centering the sampling distribution on either θ0 − δ or θ0 + δ (whichever is closest to the new estimate). But, given the less than perfect power in the new data generating process, we will end up with another discernible deviation δ* around this point. If we again follow the a posteriori interpretation above, we would be lead to the implied hypotheses θ ∈ [θ0 − δ − δ*, θ0 + δ + δ*] and θ ∉ [θ0 − δ − δ*, θ0 + δ + δ*]. As we keep pursuing these new implied hypotheses, taking into account the discernible deviation due to power, we ultimately (through infinite iterations, assuming the sequence of standard errors has a nonzero lower bound) get to the implied statement that θ ∈ (−∞,∞), which we presume to be true a priori—We have arrived at a trivial truth rather than an informative hypothesis.

Power-specified discernible effects do not define formal hypotheses. If an estimate is anywhere in the Reject θ = 0 region, one would reject H0. However, estimates in the range denoted as
Can we consider a similar construction of implied hypotheses using data-specific concepts such as CIs or severity measures (Mayo, 2010; Mayo & Spanos, 2010)? For example, based on the CI might we consider our results as formally testing the implied hypotheses θ ∈ [
We could, however, use these data-specific values as part of a formal test. For example, regarding the CI, we could use the underlying interval estimator (L, U) as a test statistic and would thereby require its sampling distribution to calculate the probability of the statistic taking on values at least as extreme as the calculated data-specific CI. Here it is important to remember that the CI as calculated from data is not a statistic (i.e., not a random variable) upon which a formal test can be based: It is a single realization from the distribution of an underlying interval statistic (L, U). A formal test would be based on the distribution of (L, U). If such a test were constructed, the logic of its use would follow that as described in this article. Nonetheless, these types of data-specific values are not commonly used in formal statistical testing as defined here.
Why Test Hypotheses?
Statistical hypothesis testing is common when a researcher wishes to determine a substantive claim. If the truth or falsity of the substantive claim can be identified with the truth or falsity of a statistical hypothesis, then hypothesis testing can be used to inform judgments about the substantive claim. This is the basis for hypothesis-driven science. For example, Kan (2007) derived and tested statistical hypotheses from the claim that time inconsistent preferences with hyperbolic discounting explains lack of self-control among smokers. Cook, Orav, Liang, Guadagnoli, and Hicks (2011) tested the hypothesis that disparities in the placement of implantable cardioverter-defibrillators (ICD) can be explained by the underutilization of ICD implantation among clinically appropriate racial/ethnic minorities and women and the overutilization of the procedure among clinically inappropriate Whites and men. Veazie (2006) derived and tested statistical hypotheses from the claim that variation in individuals’ perceptions of those with chronic medical conditions is explained by Ames’s (2004a, 2004b) theory of social inferences. It is the fact that the estimate is consistent with a hypothesized set of parameter values and inconsistent with others that constitutes evidence for the hypothesis, not the raw-scale distance (i.e., magnitude) from the boundary between such sets.
Statistical hypothesis testing is also used when the goal of estimation is of interest only if the parameter is in a particular range of values. The cost–benefit example presented above is one such case. Another is when researchers determine whether a variable predicts an outcome. Stating that something “will either go up or not go up” clearly does not constitute an informative prediction, and stating that something “will either go up or down” (e.g., an inference from a significant two-tailed test) is not much better. Consequently, identifying predictors typically requires isolating a direction. In this case, it is reasonable for the researcher to first address the three hypotheses that (1) the parameter is greater than zero, (2) the parameter is less than zero, and (3) the parameter is equal to zero. Because, in a continuous parameter space, the third hypothesis is a point on the boundary between the first two, testing this set of hypotheses reduces in practice to essentially testing the disjunction of one of the first two with the third. If the estimate conforms to (1), then it is statistically tested against the disjunction of (2) and (3). If the estimate conforms to (2), then it is statistically tested against the disjunction of (1) and (3). If an adequate judgment regarding the truth or falsity of these hypotheses can be made, then the researcher continues with the estimation goal and interprets point or interval estimates accordingly.
Discussion
The objective of this article was to present a coherent Frequentist logic of testing. To do so, I distinguished the goal of hypothesis testing from that of estimation, and presented a logic for the former that does not confuse it with the latter. The key points include (a) hypotheses are expressed as a partition of the parameter space specifying the distribution of random variables associated with a data generating process, (b) which of the a priori specified hypotheses are statistically tested cannot generally be known before the parameter is estimated (the exception being when a point hypothesis is involved), (c) the parameter estimate is consistent with a hypothesis to which the parameter estimate conforms and thereby this hypothesis does not require statistical testing, (d) all hypotheses to which the estimate does not conform are subject to statistical testing to rule them out as alternative explanations, (e) the element in the hypothesis’ set of values that produces the largest p value is used to test the hypothesis, and (f) except in the case of a point hypothesis, an estimate can provide either evidence for or against hypotheses (or sets of hypotheses), or remain ambiguous regarding them.
When testing hypotheses, researchers should report whether there is either evidence for or against hypotheses. Moreover, ambiguous findings (i.e., insignificant findings) should not be reported as evidence from a formal test for a hypothesis. For example, the common practice of treating insignificant results of a formal two-tailed test as evidence that there is no effect should be avoided. Instead, it should be acknowledged that the data cannot distinguish hypotheses or cannot rule out certain alternatives.
In this article, I focused on hypotheses about a single parameter. The presented logic naturally extends to hypotheses regarding multiple parameters as well (e.g., hypotheses regarding two parameters θ and γ such as H1: [θ > 0 and γ > 0] and H2: [θ ≤ 0 or γ ≤ 0]). See the online appendix for a description of hypothesis testing with multiple parameters.
For clarity of presentation, I adopted the standard concept of a threshold (i.e., significance level) to categorically determine whether data are consistent with a hypothesis; an approach that leads to the common use of categorical statements such as having a “significant result” or an “insignificant result.” This does not preclude the determination of a set of thresholds to define multiple categories of evidence such as weak evidence (e.g., perhaps 0.1 ≥ p > .05), moderate evidence (e.g., perhaps .05 ≥ p > .01), and strong evidence (e.g., perhaps p ≤ .01). Note, however, that such thresholds are arbitrary, relative to a scientist’s judgment, or conventional, relative to expectations of a community of scholars (e.g., as broad as a discipline or field of study, and as narrow as a specific journal). It should also be clear that it is not necessary to adopt formal thresholds at all in the application of the presented logic: A scientist may directly interpret the evidential value of the p value: For example, notwithstanding the conventional .05 significance level, a scientist may consider p values of .052 and .048 as essentially equivalent in their evidential bearing, perhaps judging both indicate the data are inconsistent with the hypothesis being tested. Moreover, the logic described here can be applied by considering the p value as a continuous measure of consistency with a value contained in the hypotheses with which the data do not conform. Nonetheless, unlike Bayesian methods, the logic of formal Frequentist hypothesis testing does not imply statements of mathematical probability reflecting subjective beliefs. Consequently, the p value (i.e., the probability that a data generating process would produce a statistic value as extreme as that observed given hypothesized distributional characteristics) requires interpretation by the scientist in light of the context and scientific goals.
A final point of clarification may be helpful. I have mentioned the need for a priori specification of hypothesis but also the fact that one cannot determine which hypothesis (or hypotheses) will be statistically tested before observing the estimate (except when including a point hypothesis). These do not conflict. The first is an epistemic requirement for using test results as evidence. The second is a logical consequence of the first bearing on the process of establishing evidence. The first means you should not use the data to determine whether you are addressing, for example, the hypotheses H1: θ ≤ 0 and H2: θ > 0, or you are addressing the hypotheses H1: θ = 0 and H2: θ ≠ 0. This specification should be determined a priori. Notice this precludes arbitrarily doubling your power given your results. Suppose, however, I were to decide ahead of time that I will statistically test H2: θ > 0 and I subsequently obtain an estimate
By following the logic of formal statistical testing presented here, a researcher does not confuse the goal of testing with that of estimation and can thereby avoid the conflict inherent in interpreting results of testing and interpreting raw-scale magnitudes of estimation. However, a limitation of hypothesis testing is that it provides evidence solely for the truth or falsity of the specified hypotheses. It is the responsibility of the researcher to justify knowing this fact. For the case in which knowing the truth or falsity of a hypothesis is not important, formal hypothesis testing is not an appropriate goal—Estimation, or other informative means, without the pretense of formal testing, may be the better objective.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research and/or authorship of this article.
