Sage Journals: Discover world-class research

Abstract

Researchers may want to know whether an observed statistical relationship is either meaningfully negative, meaningfully positive, or small enough to be considered practically equivalent to zero. Such a question cannot be addressed with standard null hypothesis significance testing or standard equivalence testing. Three-sided testing (TST) is a procedure to address such questions by simultaneously testing whether a relationship is significantly bounded below, within, or above predetermined smallest effect sizes of interest. TST is a natural extension of the standard procedure of two one-sided tests (TOST) for equivalence testing. TST offers a more comprehensive decision framework than TOST with no penalty to error rates or statistical power. In this article, we give a nontechnical introduction to TST; provide commands for conducting TST in R, Jamovi, and Stata; and provide a Shiny app for easy implementation. Whenever a meaningful smallest effect size of interest can be specified, TST should be combined with null hypothesis significance testing as a standard frequentist testing procedure.

Keywords

three-sided testing equivalence test hypothesis test TST NHST TOST SESOI tutorial R Jamovi Stata open materials

Researchers often use standard null hypothesis significance testing (NHST) to evaluate the null hypothesis ( $H_{0}$ ) that a given statistical relationship does not exist. If $H_{0}$ is rejected in standard NHST, then researchers conclude that the relationship of interest is nonzero and existent.

A problem with standard NHST is that researchers can only ever reject $H_{0}$ ; they can never accept it. Put differently, standard NHST never lets researchers conclude that the relationship of interest is zero. This is because statistically insignificant results are ambiguous in standard NHST. Large $p$ values can signal precisely measured null relationships or large relationships that are imprecisely measured. Standard NHST makes no distinction between these cases.

This limitation of standard NHST leads to two substantial problems. First, researchers commonly misinterpret statistically insignificant results as evidence that the relationship of interest is negligibly small even when the chance of a meaningfully large, noisily estimated relationship is high (Aczel et al., 2018; Fitzgerald, 2025; Gates & Ealing, 2019; Greenland et al., 2016). Second, researchers who are familiar with only standard NHST lack an inferential method to conclude that a relationship is null, which they may often want to do. To provide just a few examples, researchers may want to test whether a drug has no negative side effects, whether a new treatment works as well as existing treatments, whether two populations are similar on some attribute, and so on. None of these predictions can be tested with standard NHST. Figure 1a demonstrates the problem visually. The first (blue) and second (orange) estimates would yield identical statistical significance conclusions in standard NHST even though the null hypothesis is clearly better supported by the first (blue) estimate than by the second (orange) estimate.

Fig. 1.

Contrasting standard (a) null hypothesis significance testing (NHST) with (b) two one-sided testing (TOST). Four identical estimates are tested with either standard NHST or TOST. The 1 – $α$ confidence intervals are plotted. (a) Statistical inferences for test results analyzed with standard NHST. (b) Statistical inferences for test results analyzed with TOST. $D_{L}$ and $D_{U}$ represent the lower and upper smallest effect size of interest (SESOI) bounds, respectively. The red $H_{0}$ regions are more extreme than the SESOI, and the $H_{1}$ region is closer to zero than the SESOI.

A remedy to this situation is the frequentist procedure of two one-sided tests (TOST) for equivalence testing (see Lakens et al., 2018). The TOST procedure allows researchers to test whether a relationship is smaller than the “smallest effect size of interest” (SESOI) for that relationship. In this procedure, a relationship is considered to be “practically equal to zero” if it can be significantly bounded both (a) above $D_{L}$ and (b) below $D_{U}$ using one-sided tests, in which $D_{L}$ and $D_{U}$ are the lower and upper SESOI bounds, respectively. For example, under the TOST procedure, one would conclude that the first (blue) estimate in Figure 1b is practically equal to zero, but one would not reach this conclusion for the second (orange), third (green), and fourth (red) estimates. The TOST procedure gives researchers a method to distinguish imprecisely estimated and potentially meaningful relationships from precisely estimated and trivially small relationships.

However, the TOST procedure has its own limitation: In TOST, it is not possible to accept the hypothesis that the relationship of interest is large enough to care about—that is, that it is more extreme than the SESOI. Put differently, the TOST procedure does not distinguish between estimates that significantly exceed the SESOI (e.g., the third green estimate in Fig. 1b) and estimates whose relationship with the SESOI is uncertain (e.g., the second orange and fourth red estimates in Fig. 1b).

A common but flawed solution to this problem is to combine standard NHST with the TOST procedure (e.g., see Campbell & Gustafson, 2018). This allows researchers to test both whether the relationship of interest is different from zero and whether it is smaller than the SESOI. However, this procedure does not actually test if the relationship is large enough to care about. Even a combination of standard NHST and equivalence-testing approaches cannot tell whether there is significant evidence that a relationship is larger than its SESOI. For example, the bottom red estimate in Figure 1 is significantly greater from zero (Fig. 1a) and not significantly smaller than the SESOI (Fig. 1b). According to common practice, this would often be interpreted as a meaningfully large result even though it obviously does not indicate strong evidence for a relationship larger than the SESOI. Consequently, it is not safe to assume that a significant standard-NHST result plus a statistically insignificant TOST result indicates evidence for a practically large relationship.

To obtain significant evidence that a relationship is larger than its SESOI, researchers need minimum-effects tests (see Murphy & Myors, 1999). Such procedures can test for inferiority, assessing whether a relationship is at least as negative as its lower SESOI bound (see Fig. 2a). They can also test for superiority, assessing whether a relationship is at least as positive as its upper SESOI bound (see Fig. 2c).

Fig. 2.

Hypotheses for (constituent tests of) the three-sided testing approach. (a) Inferiority. (b) Equivalence (two one-sided tests [TOST]). (c) Superiority. (d) Three-sided test (TST). $D_{L}$ and $D_{U}$ represent the lower and upper bounds of the smallest effect size of interest, respectively. (a) Hypotheses for an inferiority test. (b) Hypotheses for an equivalence test, such as the TOST procedure. (c) Hypotheses for a superiority test. (d) Hypotheses for the TST procedure.

In practice, researchers may often want to test for superiority, inferiority, and equivalence all at the same time. Fortunately, there is a formal procedure that lets one conduct all three tests simultaneously while controlling error rates across all tests, known as three-sided testing (TST). TST is simply a combination of TOST and minimum-effects tests. Initially developed in the field of medical statistics (Goeman et al., 2010), TST is a relevant procedure in a broad range of disciplines, including psychology. In any research project in which analyses include tests for statistical equivalence, TST is a uniform improvement over the more widely adopted TOST procedure. TST permits the same power for equivalence testing as the TOST procedure while simultaneously allowing researchers to test for significant evidence that relationships are bounded outside of their SESOIs, which the TOST procedure alone cannot do. TST thus gives researchers a comprehensive testing framework to assess the practical significance of statistical relationships.

What Is the TST Procedure?

The TST procedure tests three mutually exclusive hypotheses, all of which are related to the same relationship of interest and each of which should lead to a meaningfully different conclusion if accepted: (a) $H_{I n f}$ : relationship of interest $< D_{L}$ (inferiority) (b) $H_{E q}$ : $D_{L}$ ≤ relationship of interest ≤ $D_{U}$ (equivalence) (c) $H_{S u p}$ : $D_{U} <$ relationship of interest (superiority) $D_{L}$ and $D_{U}$ refer to the lower and upper SESOI bounds, respectively. In words, one is testing whether the relationship is less than the lower SESOI bound, greater than the upper SESOI bound, or contained within these two bounds. These hypotheses can be tested separately using a two-sided test against the lower SESOI bound to test inferiority (Fig. 2a), a TOST procedure to test for equivalence (Fig. 2b), and a two-sided test against the upper SESOI bound to test for superiority (Fig. 2c). Although the inferiority and superiority hypotheses are technically one-sided, two-sided tests on these hypotheses are required to control error rates from simultaneous tests on all three hypotheses (Goeman et al., 2010; see also TST Using One-Sided Tests).

The TST procedure simply entails running all the tests listed above at once and then using the combined result of all tests to make an overall inference about the practical significance of the effect (see Fig. 2d). For a more technical introduction to TST, see Goeman et al. (2010). In this article, we present an identical formulation of the original TST procedure, combining a TOST procedure, an inferiority test, and a superiority test to make the procedure more easily understandable for researchers who are already familiar with the TOST procedure for equivalence testing. In what follows, we offer two hypothetical examples of psychological-research projects in which TST could be used for effective data-driven decision-making.

Example 1: comparing treatment options for insomnia

Suppose researchers are studying the effects of a new treatment for insomnia. The standard care for individuals suffering from chronic insomnia is sleep-inducing medication combined with cognitive-behavioral therapy for insomnia (CBT-I). However, a new therapeutic approach has been proposed as a substitute for CBT-I: attention and commitment therapy for insomnia (ACT-I).

The researchers would like to know whether it is preferable or even clinically responsible to offer ACT-I instead of standard CBT-I care to insomnia patients. Suppose then that they commission a large-scale randomized controlled trial to study the effectiveness of ACT-I relative to standard insomnia treatment. Insomnia patients in the control group receive the standard course of CBT-I treatment, and patients in the treatment group receive ACT-I instead. The outcome of interest for this trial is the difference in average minutes of sleep per night at endline between patients receiving ACT-I and patients receiving CBT-I. The research team sets an SESOI for this trial based on the number of additional minutes of sleep per night that patients would consider large enough to warrant a change in treatment course. The researchers run a longitudinal survey of patients receiving standard care and find that 15 min per night seems to be the smallest change in sleep duration that patients would consider a large enough benefit to merit changing their course of treatment. This implies that the SESOI for differences in insomnia care is 15 min of sleep per night and that $D_{L}$ and $D_{U}$ are, respectively, −15 min and 15 min of sleep per night.

The researchers develop an official evidence-based policy for the use of ACT-I in insomnia treatment based on TST results:

Inferiority: If ACT-I leads to a reduction of more than 15 min in average nightly sleep duration compared with CBT-I, then therapists will be discouraged from offering ACT-I.

Equivalence: If ACT-I leads to a change in nightly sleep duration within $\pm 15$ min of CBT-I, then therapists may choose between the approaches based on personal experience and patient preference.

Superiority: If ACT-I leads to an increase of more than 15 min in average nightly sleep duration compared with CBT-I, then ACT-I will be recommended to replace CBT-I as the standard care offered to insomnia patients.

Inconclusive: If the practical significance of differences between the sleep improvements yielded by ACT-I and CBT-I is inconclusive, then a replication study will be conducted, and results from the two studies will be pooled to provide a conclusive test. If this pooled study still does not yield a precise result, then the mean minutes of sleep gained or lost will be used as a temporary basis for policy until more data become available, and no additional recommendations will yet be made.

Example 2: behavioral research on sales tactics

Consider a set of researchers who are advising a services company. The company sells service packages to clients, offering small, medium, or large packages (increasing in price). Company leadership has noticed that some branches attempt to anchor clients on purchasing the medium package by telling clients that the medium package is the most popular purchase. There is debate among company leadership as to whether this sales tactic is a good idea. Anchoring clients on the medium package will increase sales if this tactic draws clients away from purchasing the small package or no package at all, but will decrease sales if the medium package instead substitutes for the large package, or if the anchoring tactic discourages purchases of any package at all.

Suppose, then, that the company runs an experiment to assess the effects of the anchoring tactic in sales pitches. Company branches are randomly assigned into a treatment group that uses the anchoring tactic and a control group that does not. The company estimates that the cost of coordinating use of a single sales tactic across branches would be $10,000 per month. They therefore decide that any difference in branch-level monthly sales below $10,000 is practically equal to zero and consequently that any difference in such sales exceeding $10,000 is practically meaningful. This implies that the SESOI for this intervention in this company is $10,000 and that $D_{L}$ and $D_{U}$ are, respectively, −$10,000 and $10,000.

The company’s official stance on the anchoring tactic will depend on the practical significance of the tactic’s impact on sales:

Inferiority: If the tactic leads to a loss in revenue larger than $10,000 per month, then the company will conclude that sales pitches with anchoring are inferior to pitches without anchoring and will ban use of the anchoring tactic in sales pitches across all branches.

Equivalence: If the effect is significantly bounded between −$10,000 and $10,000 per month, then the company will conclude that the tactic’s effect is practically equal to zero and will allow branches to freely choose whether to use the tactic.

Superiority: If the tactic’s effect is significantly greater than $10,000 per month, then the company will conclude that the sales tactic is superior and will mandate its use in sales pitches throughout the company.

Inconclusive: If none of these conclusions can be reached with statistical significance, then the company will delay its decision until more data become available (to ensure error rates are controlled in this case, TST should ideally be combined with a sequential-analysis procedure; see Lakens, 2022).

How to Execute and Interpret TST

In what follows, we start by discussing methods for specifying an SESOI for TST. We then present two identical methods for conducting TST. The methods can be applied to any statistical estimate for which (a) a standard error can be computed and (b) test statistics can be reasonably assumed to be Student $t$ or normally distributed. We direct readers to Jané et al. (2024) for parametric-standard-error formulas across a wide range of standardized effect sizes that can be used in this testing framework. We then give examples of how to conduct TST in practice using the ShinyTST app (https://jack-fitzgerald.shinyapps.io/shinyTST/). In the Supplemental Material available online, we also show how to conduct TST in R (https://osf.io/hr863) and Jamovi (https://osf.io/ncxvg). These supplementary files can easily be adapted to fit different data sets and specifications.

Specifying an SESOI

The first step in TST is to specify a meaningful SESOI. Just as in equivalence testing, specifying an SESOI is the most challenging aspect of TST. In general, how you should proceed with SESOI specification depends very much on the specific question you want to answer and the research field you are working in. No one approach will fit all contexts. To give an impression of the range of SESOI-specification strategies available, we outline four concepts that may be helpful to researchers seeking to define SESOIs in a wide variety of research contexts. In what follows, we focus our recommendations on methods that reduce researcher degrees of freedom whenever possible and highlight methods that are susceptible to “SESOI-hacking.”

Unit interpretability and standardized effect sizes

SESOIs are difficult to specify unless one’s parameter of interest is “unit-interpretable,” that is, unless it is measured in a unit that can be meaningfully reasoned about. For example, the effect of a binary treatment on a dummy variable indicating a depression diagnosis is unit-interpretable because people can easily interpret what it means to switch from control to treatment conditions and can relatively easily interpret what it means for the probability of depression to increase or decrease. Likewise, a regression of annual salary on years of education yields a unit-interpretable parameter because people can easily reason about how many more or fewer dollars per year they will make if they undertake 1 additional year of education.

A common example of a parameter that is not unit-interpretable is a group difference in a Likert scale. Not only does the difference between levels of a Likert scale differ depending on the number of items on the scale and the specific wording of the labels on each item, but also, the meaning of a linear move from one item to the next will conceptually differ depending on the construct measured by the Likert scale.

Whenever possible, it is always preferable to establish SESOIs based on unit-interpretable effect sizes. Unfortunately, most standardized-effect-size measures applied in psychology are not unit-interpretable. Some examples of unit-interpretable standardized-effect-size measures include odds ratios (which are often estimated in biomedical research) and elasticities (which are regularly estimated in economics).

Although standardized effect sizes are often less unit-interpretable than “raw” effect sizes, they nonetheless have unique interpretability properties. Because standardized effect sizes are often used across studies, they make effect sizes comparable across studies. This specific utility of standardized effect sizes was directly referenced in historical American Psychological Association statistical-reporting guidelines, which encouraged the use of effect-size measures that could be easily compared with those of prior studies (Wilkinson & American Psychological Association Task Force on Statistical Inference, 1999).

Standardized mean differences—such as Glass’s (1976) $D$ , Hedges’s (1981) $g$ , and Cohen’s (1988) $d$ —can be reasoned about in a unit-interpretable way by considering the extent to which a given input shifts one’s outcome to a different quantile of the population distribution. Standardized mean differences are computed by dividing the mean difference in an outcome between two groups by some measure of the outcome’s standard deviation. Hedges’s $g$ and Cohen’s $d$ divide the difference in these two groups’ mean outcomes by a pooled standard deviation of the outcome (computed using data from both groups), whereas Glass’s $D$ is computed by dividing that mean difference by the outcome standard deviation in the control group (which aligns more with practice in other social sciences, e.g., economics). Such standardized mean differences tend to be $t$ -distributed. This is useful because the inverse cumulative density function of the $t$ distribution can be used to convert a standardized mean difference into a shift in distributional quantiles. For example, if Glass’s $D = 2$ , then one knows that the treatment group’s average outcome approximately sits in the 97.5th percentile of the control group’s distribution of that outcome. Thus, if a researcher can meaningfully reason about the benefits and costs associated with moving an individual from one quantile to another on the distribution of a given outcome variable, then standardized mean differences can be rendered a unit-interpretable effect-size measure so long as a reasonable justification is provided based on this distributional interpretation and this distributional difference is linked to costs and benefits (see Cost–Benefit Analysis).

This distributional approach is far superior to using benchmark values—such as those proposed by Cohen (1988)—to set SESOIs, a practice that methodological guides on equivalence testing and SESOI determination often discourage (see Anvari & Lakens, 2021; Bakker et al., 2019; Funder & Ozer, 2019; Lakens et al., 2018). In many disciplines, there is documented evidence that canonical small effect-size benchmarks are larger than a substantial proportion of published effect sizes (Gignac & Szodorai, 2016; Kraft, 2020; Lovakov & Agadullina, 2021; Paterson et al., 2016). Even SESOIs set using benchmark values based on the distribution of published effect sizes for a given parameter are problematic. Principally, the distribution of published effect sizes is well known to be biased in favor of statistically significant results (Brodeur et al., 2023; Franco et al., 2014; Moniz et al., 2025). This, in turn, results in the inflation of published effect sizes that are considerably larger than those from (robustness) replications (see Brodeur et al., 2024; Camerer et al., 2016, 2018; Gelman & Carlin, 2014; Open Science Collaboration, 2015; Yang et al., 2023). Furthermore, even in a discipline with no publication bias, a published effect size that is practically meaningful in one context may be practically negligible in another; SESOIs for individual studies must be idiosyncratically determined for a specific study’s context. In addition, the large, heterogeneous menu of published estimates typically available to researchers can create avenues for researcher degrees of freedom and opportunities to “hack” the SESOI.

In many contexts relevant to psychological research, one can convert unit-uninterpretable parameters into unit-interpretable parameters by dichotomizing outcomes based on a practically meaningful division. For example, scales used to diagnose mental-health conditions, such as stress or depression, can be dichotomized at thresholds used for clinical diagnosis of those conditions. One practice common in economics is to dichotomize variables using a dummy indicating whether the variable is above or below the median value in the sample. Treatment effects on these dichotomized outcomes are unit-interpretable as effects on diagnosis probability or as effects on the probability of being “above median” on a certain outcome. A key drawback to this approach is that dichotomizing a continuous variable can reduce power by discarding informative variation (Ragland, 1992); so if possible, it is useful for researchers designing new studies to measure constructs in units that render the end parameter that will be estimated in their study unit-interpretable.

Eliciting SESOIs from surveys

Regardless of whether a given parameter is unit-interpretable, the SESOI for that parameter can be reliably established by eliciting judgments on the smallest practically meaningful effect size from independent samples. In psychological research, Anvari and Lakens (2021) termed this an “anchor-based” approach for determining SESOIs. However, anchor-based approaches for SESOI determination actually constitute a broader class of SESOI-determination methods originating in the medical literature, which “anchor” SESOIs to some other estimate (e.g., the medical impact of a major life event or aging a certain number of years; see Lydick & Epstein, 1993). Because such anchor-based methods for setting SESOIs can rely on many possible anchors (see Devji et al., 2020), allowing researchers to select their own anchor creates considerable researcher degrees of freedom and risks of Region of Practical Equivalence (ROPE)-hacking. However, in this broader class of approaches, the SESOI-determination method that Anvari and Lakens identified as anchor-based most closely aligns with methods for determining minimal (clinically) important differences.

Minimal important differences are often elicited from patients or physicians by surveying the smallest effect size that would be large enough to justify changing a course of treatment (Ferreira et al., 2012). For instance, the insomnia-treatment experiment described above uses such a survey to identify the smallest change in sleep that patients would consider clinically meaningful. Outside of the clinical context, Anvari and Lakens (2021) recommended surveying people about the smallest effect size that they would find perceptible. A more generally applicable method in cases in which direct stakeholders are difficult to identify or reach is surveying researchers for their judgments on SESOIs; for a more comprehensive guide on such surveys, see Fitzgerald (2025). Regardless of whether they are elicited from stakeholders or experts, the key component that can make SESOIs elicited from surveys credible is that such SESOIs are determined from a data source that is independent of the research team and their data.

Smallest measurable differences

One potentially useful SESOI is the smallest measurable difference for a given instrument. For example, Brañas-Garza et al. (2021) used equivalence testing to show that the effect of hypothetical stakes on the number of risky choices made on a multiple-price list is bounded beneath one risky choice. The SESOI could also be set to 1 for a single Likert item or the number of correct answers on a test because this is similarly equivalent to the smallest measurable difference in those outcomes. In cases in which such measures are present on both sides of a regression equation, the commensurate smallest measurable difference would be equivalent to a 1-point increase in the independent variable being associated with a 1-point change in the dependent variable (i.e., a regression coefficient of $\pm 1$ ).

A 1-point shift in a composite Likert scale—that is, a scale composed using a (weighted) average—is not the same as the smallest measurable difference for that scale and should not be interpreted as one. Consider the widely used nine-item Patient Health Questionnaire (PHQ-9) for depression diagnosis, which is composed of 4-point Likert items (Kroenke et al., 2001). The PHQ-9 scale is often reported as a raw sum of points across its nine items, ranging from 0 to 27. At this unit scale, a 1-point difference is the smallest measurable difference in the PHQ-9. However, the PHQ-9 could in principle be converted into a composite scale by dividing this sum total by 9, effectively creating an average score per item; many Likert scales are transformed in this way in psychological research. It would be inappropriate to interpret a 1-point difference in this composite scale as the smallest measurable difference for the PHQ-9. Consider that Kroenke et al. (2001) found that the raw PHQ-9’s raw standard deviation is 6.1 in patients with major depressive disorder. This implies that a 1-point difference in a composite PHQ-9 scale would exceed two-thirds of a standard deviation, whereas a 1-point difference in the raw scale would be just over 0.16 SD.

Cost–benefit analysis

In applied settings in which effects can be tied to specific costs and benefits—for example, when a certain intervention has a known cost and unit-interpretable benefits have a distinct financial valuation—it may be possible to specify SESOIs based on cost–benefit analysis. For example, the sales-tactic experiment described above uses a cost–benefit analysis to set the SESOI. When the SESOI is set in this way, one can explicitly test whether a given observed effect is large enough for intervention benefits to outweigh costs. See Fitzgerald (2025) for more detailed guidance on practical significance testing with SESOIs determined through cost-benefit analysis.

TST using one-sided tests

The first way to conduct TST is by obtaining $p$ values from the four constituent tests involved in TST and then interpreting the joint test result. In the statistical software of your choice, for any relationship you wish to test, run all four of the one-sided tests that make up the inferiority, superiority, and equivalence tests specified in Figures 2a to 2c. Let $μ$ represent the relationship of interest. Then, the four relevant one-sided tests are as follows:

Inferiority: $H_{0} : μ \geq D_{L}$ . $H_{A} : μ < D_{L}$ .

Lower-bound equivalence test: $H_{0} : μ < D_{L}$ . $H_{A} : μ \geq D_{L}$ .

Upper-bound equivalence test: $H_{0} : μ > D_{U}$ . $H_{A} : μ \leq D_{U}$ .

Superiority: $H_{0} : μ \leq D_{U}$ . $H_{A} : μ > D_{U}$ .

Running these four tests will yield four one-sided $p$ values. We combine all four results to form one overall conclusion about the relationship of interest based on the TST procedure. Suppose that we set a significance level of $α = 5 %$ . The procedure will yield one of the following four conclusions concerning the relationship of interest:

Practically significant and negative: If the one-sided inferiority-test statistic is significant at level $α / 2$ (i.e., $2 p < . 05$ or $p < . 025$ ), then treat the relationship as inferior to the lower SESOI bound. The equivalence and superiority tests will yield insignificant results.

Practically equal to zero: If both of the two one-sided tests for equivalence are significant at level $α$ (i.e., $p < . 05$ ), then treat the relationship as if it is no farther from zero than the SESOI. The inferiority and superiority tests will yield insignificant results.

Practically significant and positive: If the one-sided superiority-test statistic is significant at level $α / 2$ (i.e., $2 p < . 05$ or $p < . 025$ ), then treat the relationship as superior to the upper SESOI bound. The inferiority and equivalence tests will yield insignificant results.

Inconclusive: If none of the above conditions hold, then remain uncertain about the practical significance of the relationship.

Notice that even though we have set a significance level of 5%, the $α$ threshold that we consider for the inferiority and superiority tests is half that, set at 2.5%. This is necessary to adjust for multiple-hypothesis testing. If our relationship of interest is truly bounded between $D_{L}$ and $D_{U}$ , then it is possible to make two Type I errors: one in which we mistakenly conclude that the relationship is bounded above $D_{U}$ and one in which we mistakenly conclude that it is bounded below $D_{L}$ . Because the TST procedure involves simultaneous tests both for inferiority and for superiority, a simple Bonferroni correction across these two tests—which halves the effective $α$ for these tests—effectively controls the size of the test at our nominal significance level (Goeman et al., 2010). An identical method is to double the one-sided $p$ values for the inferiority and superiority tests, which is the method employed in existing statistical software; of course, for inferiority/superiority tests that produce $p > . 5$ , this adjusted $p$ value is mechanically set to $p = 1$ .

No multiple-hypothesis-testing corrections are required for the lower- and upper-bound equivalence tests because the TOST procedure already requires that both one-sided tests are significant at level $α$ before a relationship is deemed practically equal to zero (see Berger & Hsu, 1996). Even though the inferiority and superiority tests in TST must be conducted at significance level $α / 2$ to control TST’s error rates, we can still safely conduct TST’s equivalence test at significance level $α$ . This is a useful property of TST because it implies that TST can allow researchers to augment TOST with minimum-effects testing for inferiority and superiority without sacrificing any power or error-rate control for the equivalence test.

TST using confidence intervals

An identical procedure for conducting TST is to compute two confidence intervals (CIs) for the relationship of interest—one at $100 (1 - 2 α)$ % confidence and one at $100 (1 - α)$ % confidence. One can then inspect where these intervals fall relative to the upper and lower SESOI bounds. Assuming $α = 5 %$ , the wider 95% CI is used to evaluate whether the inferiority and superiority tests yield statistically significant results, and the 90% CI is used to assess whether the equivalence test yields statistically significant results.

To start, compute the $100 (1 - 2 α)$ % CI and $100 (1 - α)$ % CI for an estimate of your relationship of interest. Then, compare the upper and lower CI bounds with the upper and lower SESOI bounds. As described in TST Using One-Sided Tests, this CI-based procedure will yield one of the following four conclusions about the relationship of interest (assuming $α = 5 %$ $%$ ):

Practically significant and negative: If the $100 (1 - α)$ % CI (i.e., the 95% CI) falls entirely below the lower SESOI bound, then treat the relationship as inferior to the lower SESOI bound.

Practically equal to zero: If the $100 (1 - 2 α)$ % CI (i.e., the 90% CI) falls entirely within the SESOI bounds, then treat the relationship as if it is no farther from zero than the SESOI.

Practically significant and positive: If the $100 (1 - α)$ % CI (i.e., the 95% CI) falls entirely above the upper SESOI bound, then treat the relationship as superior to the upper SESOI bound.

Inconclusive: If none of the above conditions hold, then remain uncertain about the practical significance of the relationship.

Figure 3 illustrates this CI procedure for a significance level of $α = 5 %$ . For the equivalence test, what matters is that the entire 90% CI falls within the $H_{E q}$ region. For example, under TST, one would conclude that the third estimate from the top in Figure 3 is practically equal to zero even though the 95% CI overlaps with the SESOI bounds. In contrast, inferiority and superiority tests are significant only if the entire 95% CI falls outside the SESOI bounds. For instance, in the fifth estimate from the top in Figure 3, its 95% CI crosses the lower SESOI threshold, meaning that one cannot conclude at a 5% significance level that this relationship is inferior to $D_{L}$ .

Fig. 3.

Illustration of how different test results should be interpreted in three-sided testing (TST) at a significance level of $α = 5 %$ . Estimates are plotted alongside 90% confidence intervals (CIs; thicker bands) and 95% CIs (thinner bands). The 95% TST CIs (green bands) are displayed for the bottom estimates only to avoid visual clutter.

The TST CI

Although the $100 (1 - α)$ % and $100 (1 - 2 α)$ % CIs can be used to intuitively obtain statistical-significance conclusions under the TST framework (as shown in Fig. 3), fully inverting the TST procedure shows that TST yield a unique, potentially asymmetric CI whose bounds depend on the location of both the estimate’s effect size and classic CI bounds. Let $\hat{μ}$ be an estimate of the relationship of interest, $S E (\hat{μ})$ be that estimate’s standard error, and $t_{q}$ be the $q$ th quantile of the $t$ distribution. Then, per Goeman et al. (2010), in the notation of this article, the $100 (1 - α)$ % TST CI’s lower bound can be written as

\begin{array}{r} T S T C I_{1 - α}^{-} = {\begin{matrix} \hat{μ} + t_{α} S E (\hat{μ}) i f \hat{μ} < D_{L} - t_{α} S E (\hat{μ}) \\ D_{L} i f D_{L} - t_{α} S E (\hat{μ}) \leq \hat{μ} \leq D_{L} - t_{α / 2} S E (\hat{μ}) \\ \hat{μ} + t_{α / 2} S E (\hat{μ}) i f D_{L} - t_{α / 2} S E (\hat{μ}) < \hat{μ} < D_{U} - t_{α / 2} S E (\hat{μ}) \\ D_{U} i f D_{U} - t_{α / 2} S E (\hat{μ}) \leq \hat{μ}, \end{matrix} \end{array}

(1)

and its upper bound can be written as

\begin{array}{r} T S T C I_{1 - α}^{+} = {\begin{matrix} D_{L} i f \hat{μ} \leq D_{L} - t_{1 - α / 2} S E (\hat{μ}) \\ \hat{μ} + t_{1 - α / 2} S E (\hat{μ}) i f D_{L} - t_{1 - α / 2} S E (\hat{μ}) < \hat{μ} < D_{U} - t_{1 - α / 2} S E (\hat{μ}) \\ D_{U} i f D_{U} - t_{1 - α / 2} S E (\hat{μ}) \leq \hat{μ} \leq D_{U} - t_{1 - α} S E (\hat{μ}) \\ \hat{μ} + t_{1 - α} S E (\hat{μ}) i f D_{U} - t_{1 - α} S E (\hat{μ}) <G \hat{μ} . \end{matrix} \end{array}

(2)

Figure 3 displays the $100 (1 - α)$ % TST CI’s mechanics using triple-banded CIs. In words, (a) the $100 (1 - α)$ % TST CI hugs any bound of the $100 (1 - 2 α)$ % classic CI contained between $D_{L}$ and $D_{U}$ ; (b) when the $100 (1 - 2 α)$ % classic CI’s lower bound is less than $D_{L}$ , the $100 (1 - α)$ % TST CI’s lower bound hugs either $D_{L}$ or the lower bound of the $100 (1 - α)$ % classic CI, whichever of the two is lesser; (c) when the $100 (1 - 2 α)$ % classic CI’s upper bound is greater than $D_{U}$ , the $100 (1 - α)$ % TST CI’s upper bound hugs either $D_{U}$ or the upper bound of the $100 (1 - α)$ % classic CI, whichever of the two is greater; and (d) when both bounds of the $100 (1 - 2 α)$ % classic CI are less than $D_{L}$ (greater than $D_{U}$ ), the $100 (1 - α)$ % TST CI’s upper (lower) bound hugs $D_{L}$ ( $D_{U}$ ).

Note that the regions rejected by the TST CI yield the same practical-significance conclusions as the decision rule discussed in TST Using CIs, the latter of which is much simpler to interpret. The TST CI does not in any way change conclusions arising from the TST procedure discussed in TST Using CIs and TST Using One-Sided Tests but, rather, constrains the precision that can be reliably reported for the estimate when TST is implemented. Intuitively, this is because TST induces a power trade-off: Although TST provides researchers with more power to reach statistical-significance conclusions across multiple tests, this comes at the cost of reducing the precision with which researchers can bound the relationship of interest if it significantly exceeds the SESOI bounds.

In the software applications described below, both classic and TST CIs are computed automatically for convenience. When reporting TST results, we recommend displaying triple-banded CIs displaying the $100 (1 - α)$ % classic CI, the $100 (1 - 2 α)$ % classic CI, and the $100 (1 - α)$ % TST CI in estimate visualizations (as in Figs. 4 –8) and reporting the standard error and $100 (1 - α)$ % TST CI in text and tables (for examples, see Reporting TST Results).

Fig. 4.

ShinyTST example for an estimate of 21, standard error of 2.68, and smallest effect size of interest of 15.

Fig. 5.

ShinyTST example for an estimate of 10, standard error of 2.68, and smallest effect size of interest of 15.

Fig. 6.

ShinyTST example for an estimate of −20, standard error of 2.68, and smallest effect size of interest of 15.

Fig. 7.

ShinyTST example for an estimate of −12, standard error of 2.68, and smallest effect size of interest of 15.

Fig. 8.

A partitioning of the parameter space permitting a combination of three-sided testing (TST) and standard null hypothesis significance testing. Estimates are displayed along with $100 (1 - 2 α)$ % classic confidence intervals (CIs; thicker bands), $100 (1 - α)$ % classic CIs (thinner bands), and $100 (1 - α)$ % TST CIs (green bands). Four meaningful test regions can be rejected. $H_{-}$ and $H_{+}$ are both rejected if the $100 (1 - 2 α)$ % classic CI is bounded between $D_{L}$ and $D_{U}$ . If the $100 (1 - α)$ % classic CI is entirely bounded below $H_{0 +}$ (above $H_{0 -}$ ), then $H_{0 +}$ ( $H_{0 -}$ ) is rejected. In cases such as the top red estimate, one may not be able to conclude whether the effect is smaller or larger than $D$ but can establish the direction of the effect (negative in this case). Partitioning the parameter space in this way does not reduce power for the equivalence test; for example, the fourth relationship is still significantly bounded between $D_{L}$ and $D_{U}$ despite the fact that its $100 (1 - α)$ % CI crosses zero.

TST using the ShinyTST application

To enable as many readers as possible to apply TST in their own data analyses, we have developed a stand-alone, point-and-click application. The ShinyTST app requires four inputs and has one optional input. Required inputs include an estimate (e.g., a mean difference or a correlation coefficient), the standard error of that estimate, the SESOI (in the same units as the estimate), and the significance level $α$ . The optional input is the degrees of freedom for the test (which is required for exact tests rather than asymptotically approximate tests). Based on these inputs, the application computes the $100 (1 - α)$ % and $100 (1 - 2 α)$ % classic CIs, the $100 (1 - α)$ % TST CI, and TST test statistics and $p$ values, providing an appropriate conclusion based on the test results. ShinyTST is available online at https://jack-fitzgerald.shinyapps.io/shinyTST/. In this section, we demonstrate how to use ShinyTST and interpret the results it provides through several applied examples.

Consider again our hypothetical experiment on treatment options for insomnia. Based on differences in hours slept per night in each treatment condition, we want to decide if the novel ACT-I is practically inferior, equivalent, or superior to standard CBT-I treatment. We can conduct TST in the ShinyTST app to help us make this decision.

Example 1: a practically significant relationship

Suppose that in our experiment we observe that compared with the control group receiving CBT-I, patients treated with ACT-I sleep 21 more min per night at endline, with a standard error of 2.68 min. Our significance level is 5%, and as discussed in Example 1: Comparing Treatment Options for Insomnia, our SESOI is 15 min of sleep per night. We have 500 participants in our experiment, implying that our tests are conducted with 498 degrees of freedom.

Figure 4 displays where to input these parameters in ShinyTST and the resulting output. Following the TST procedure, we would conclude that this relationship is superior to our SESOI. The “Relevant test” output tells us that the relevant test to look at in this case is the superiority test because the observed estimate is more positive than the upper SESOI bound. The color of the “Relevant test” output signals which classic CI is relevant for drawing a conclusion. In this case, the red-colored 95% classic CI is the relevant one, and we can see in both the numeric output and the accompanying graph that the lower 95% classic CI bound is greater than the upper SESOI bound of 15 min of nighttime sleep. That is, the entire 95% classic CI falls above the SESOI. Accordingly, the test $p$ value falls below our $α$ of 5%. Note that ShinyTST has already doubled the superiority test’s one-sided $p$ value, so we do not need to multiply $p$ or divide $α$ by 2 in this case. This same functionality is provided in the tst function in the eqtesting R package and in the tsti Stata command. As reported in the output, these results indicate that the relationship is practically significant and positive. In other words, in this example, there is significant evidence that ACT-I is a superior insomnia treatment to CBT-I.

Example 2: a relationship practically equal to zero

Now suppose instead that patients receiving ACT-I sleep 10 min more than the control group, keeping everything else the same as before. The estimate is smaller than the SESOI, but is it precise enough that we can confidently rule out practically meaningful differences in sleep between treatments? Figure 5 displays the TST results for this estimate.

In this case, we would conclude that the relationship is practically equal to zero. That is, the difference in average sleep times between ACT-I and CBT-I patients is significantly bounded within $\pm 15$ min of nighttime sleep. The “Relevant test” output indicates that the blue-colored 90% classic CI is the relevant CI for this test (because the point estimate is within the SESOI bounds). This 90% classic CI falls entirely within the SESOI bounds. The $p$ -value output now displays the $p$ value of the equivalence test, which is accordingly beneath 5%. As reported in the output, this indicates that the difference in average sleep times is practically equal to zero. In other words, in this example, ACT-I and CBT-I yield practically equivalent sleep improvements.

Example 3: inconclusive results

Now suppose we observe that at endline, patients receiving ACT-I sleep an average of 20 min less per night than patients receiving CBT-I, again holding all else constant. This estimate is more negative than our lower SESOI bound, but is it estimated precisely enough that we should conclude that ACT-I is a clinically irresponsible treatment? Figure 6 displays the TST results for this estimate.

In this case, we arrive at an inconclusive result. As in Figure 4, the “Relevant test” output indicates that the red-colored 95% classic CI is relevant for this estimate because the estimate is more negative than the lower SESOI bound. The $p$ -value output now displays the (adjusted) $p$ value of the inferiority test, which exceeds 5%. Accordingly, the 95% classic CI intersects one of the SESOI bounds (specifically the lower bound). As reported in the output, given a significance threshold of $α = 5 %$ , these results indicate that the practical significance of the difference in sleep times between ACT-I and CBT-I treatments is inconclusive.

Finally, suppose instead that ACT-I patients sleep 12 min less per night than CBT-I patients on average, again holding all else constant. As in Example 2, this estimate is smaller in magnitude than the SESOI. But is it precisely bounded enough that we should give clinicians free reign to select between ACT-I and CBT-I at will? Figure 7 displays TST results for this estimate.

This case also yields an inconclusive result. As in Figure 5, the “Relevant test” output indicates that the blue-colored 90% classic CI is relevant for this estimate because the estimate is between the SESOI bounds. This estimate’s 90% classic CI intersects one of the SESOI bounds, and the TST $p$ value accordingly exceeds 5%. Just as in the case displayed in Figure 5, results are inconclusive. In both cases, this implies that we do not yet have sufficient data to make certain conclusions about the relative efficacy of ACT-I and CBT-I for treating insomnia.

TST in R, Jamovi, and Stata

Although our point-and-click application is easy to use, statistical software is better for reproducible analyses. We therefore supplement the ShinyTST app and the tutorials in this article with examples of how to conduct TST in both R and Jamovi. These examples work with data from a case similar to that discussed in Example 1: A Practically Significant Relationship and, respectively, use the tst command in the eqtesting R package (Fitzgerald, 2025) and the TOSTER module in Jamovi (Caldwell, 2022; Lakens, 2017). We direct readers more familiar with Stata to the tsti command, which can be downloaded from Statistical Software Components.

Supplementary material SM1 (https://osf.io/6wbxh) demonstrates how to conduct TST in R using either four one-sided tests or CIs. In addition, the tutorial demonstrates how to conduct TST using the tst function in the eqtesting R package (Fitzgerald, 2025). The tst function is an R-based version of the ShinyTST app. Finally, the tutorial demonstrates how to adapt the tst function to your own data and how to retrieve the statistics used as input to tst from various statistical models in R.

Supplementary material SM2 (https://osf.io/ncxvg) provides a Jamovi file that contains the same simulated data as the R script and a TST analysis conducted using Jamovi’s TOSTER module. In the Jamovi file, the “Results” pane contains two identical TOST independent sample $t$ -test analyses. In the first analysis, the “Hypothesis” parameter is set to “Equivalence Test,” which will run the TOST procedure for the difference between groups. The 90% classic CI is given in the “Effect Sizes” table and can be visualized with the “Plot Effect Sizes” parameter. In the second analysis, the “Hypothesis” parameter is set to “Minimal Effects Test,” which will run an inferiority test (“TOST Lower” in the “TOST Results” table) and a superiority test (“TOST Upper” in the “TOST Results” table) for the difference between groups. In this analysis, the “Alpha level” parameter is set to .025 so that 95% classic CIs are computed in the table and plots.

Reporting TST Results

When reporting TST results, include the following information:

The relevant test (inferiority, equivalence, or superiority): This information is automatically provided by the ShinyTST app, the tst command in R, and the tsti command in Stata, but this can also be inferred directly by examining the point estimate’s relationship with the SESOI bounds.

The TST $p$ value: If the superiority (inferiority) test is the relevant test, then this is 2 times the one-sided $p$ value of the superiority (inferiority) test. The $p$ values reported by the ShinyTST app, the tst R command, and the tsti Stata command are already adjusted and do not need to be doubled. If the equivalence test is the relevant test, then the TST $p$ value is the larger of the two one-sided $p$ values for the equivalence tests. The TOST $p$ value reported by the ShinyTST app, the tst R command, and the tsti Stata command is the larger of these two equivalence-testing $p$ values.

The standard error and $100 (1 - α)$ % TST CI: This can be textually reported as the standard error and TST CI bounds, and the TST CI can be visually reported either alone or along with the $100 (1 - α)$ % and $100 (1 - 2 α)$ % classic CIs, as in the plots produced by the ShinyTST app.

The conclusion of the test: Interpret whether the relationship is practically significant and negative, practically equivalent to zero, or practically significant and positive, or whether the TST results are inconclusive. TST regions rejected by the TST CI can also be rejected.

In addition to visualizing triple-banded CIs of the form in Figures 4 to 7, the superior TST results in Example 1: A Practically Significant Relationship could be reported as follows:

We conducted a three-sided test to assess whether ACT-I leads to at least 15 min more sleep per night than standard CBT-I care, which is the SESOI in this setting. Our experimental results imply that ACT-I yields significantly more than 15 min extra sleep on average compared with CBT-I ( $μ$ = 21, SE = 2.68, 95% TST CI = [15, 25.416]). The superiority test is the relevant test under TST, which yields a $p$ value of .026, implying that the difference is practically significant.

The practically negligible result in Example 2: A Relationship Practically Equal to Zero could instead be reported as follows:

Our experimental results imply that differences in average nighttime sleep between patients treated with ACT-I and patients treated with CBT-I is significantly smaller than 15 min per night ( $μ$ = 10, SE = 2.68, 95% TST CI = [5.584, 15]). The equivalence test is the relevant test under TST, which yields a $p$ value of .032. This implies that compared with CBT-I, the difference in average nighttime sleep yielded by ACT-I is practically equal to zero.

The uncertain result reported first in Example 3: Inconclusive Results could be reported as follows:

Given our experimental results, the practical significance of sleep differences between ACT-I and CBT-I is inconclusive ( $μ$ = −20, SE = 2.68, 95% TST CI = [−24.416, −14.734]). The 95% TST CI falls completely below zero, so we conclude that ACT-I does not increase nighttime sleep relative to CBT-I. However, the inferiority test (the relevant test under TST) yields a $p$ value of .063. We therefore cannot conclude that ACT-I yields a practically significant decline in sleep compared with CBT-I. Additional research is required to determine whether ACT-I is a harmful or harmless replacement for CBT-I.

Power Analysis and Sample-Size Planning for TST

Best practices for computing necessary sample sizes for sufficient power in TST differ from those for computing power and sample sizes for equivalence testing or minimum-effect testing. Unlike these tests, the TST procedure tests several hypotheses at once. Consequently, having sufficient a priori power for conclusive TST results requires having sufficient power for all tests. Our recommendations for computing sample sizes for sufficient power thus differ depending on whether you expect to obtain an estimate larger or smaller in magnitude than the SESOI.

Appendix A in the Supplemental Material provides interested readers with a comprehensive guide on power analysis for TST. As in other power analyses, this requires researchers to have an informative a priori expectation of both the effect size they will observe and the standard deviation(s) of their outcome(s) of interest. Both quantities can be derived from either pilot studies or the results of similar previous studies. Standard deviations may also potentially be set using normative data on population-level variation in outcomes of interest; for an example, see Example 1: Statistical and Inferential Power for a Large Effect.

TST has more inferential power than both standard NHST and TOST. That is, researchers can reach more informative conclusions with TST than they can with standard NHST and TOST while simultaneously controlling error rates for these conclusions. The price one has to pay for this improved inferential power is lower statistical power to declare “significant” effects compared with standard NHST. To illustrate this, consider the following two examples.

Example 1: statistical and inferential power for a large effect

Returning to our insomnia treatment experiment from Example 1: Comparing Treatment Options for Insomnia, suppose that we expect that the difference in average nighttime sleep between ACT-I patients and CBT-I patients will be 30 min per night, double the SESOI. We expect from population-level data that nighttime sleep varies with a standard deviation of 65 min per night (Sivertsen et al., 2020). To calculate the required sample size for a superiority test in R, use the base power.t.test function with the delta parameter set to the difference between the effect size you want to power to and the upper SESOI bound:

sesoi = 15

effect = 30

stdev = 65

power.t.test(

delta = effect - sesoi,

sd = stdev,

sig.level = 0.05,

power = 0.80,

type = ‘two.sample’,

alternative = ‘two.sided’)

To have at least 80% power to detect a sleep effect of 30 min per night with an SESOI of 15 min per night, we would need 148 people in each group. Compare this with statistical power under standard NHST, in which our goal is simply to test whether the effect is different from zero:

effect = 30

stdev = 65

power.t.test(

delta = effect,

sd = stdev,

sig.level = 0.05,

power = 0.80,

type = ‘two.sample’,

alternative = ‘two.sided’)

For 80% power to detect this effect under standard NHST, we need only 37 people in each group. The larger the SESOI bounds are, the more underpowered TST is to detect “significant” effects relative to standard NHST (conditional on the SESOI being less than the anticipated effect size; superiority and inferiority tests naturally have zero power when the SESOI exceeds the anticipated effect size). However, with standard NHST, we can conclude only that the effect is larger than zero; we cannot formally conclude anything about the practical significance of the effect. In contrast, TST allows us to formally draw a conclusion about practical significance. Therefore, compared with standard NHST, TST sacrifices statistical power to achieve greater inferential power.

Note that here, the TOST procedure has zero statistical power because the effect size is presumed to fall outside of the SESOI bounds. The TOST procedure never yields statistically significant results for effect sizes larger than the SESOI; so in this example, TST yields no additional gains in inferential power to conclude that effects are practically negligible (although this is not a loss relative to standard NHST because standard NHST also does not allow researchers to formally test this conclusion). In what follows, we provide an example in which TST does yield additional inferential power in this way.

Example 2: statistical and inferential power for a practically negligible effect

Suppose instead that we expect no differences in nighttime sleep between ACT-I and CBT-I patients (keeping everything else the same as before). In this example, we expect to observe an effect bounded within the SESOI bounds. To calculate the required sample size for an equivalence test in R, we use the power_t_TOST function in the TOSTER package (see Lakens & Caldwell, 2025):

library(TOSTER)

sesoi = 15

effect = 0

stdev = 65

power_t_TOST(

delta = effect,

sd = stdev,

eqb = sesoi,

alpha = 0.05,

power = 0.80,

type = ‘two.sample’)

Given an SESOI of 15 min of nighttime sleep, to have at least 80% power to conclude that an observed difference in average sleep times of 0 min per night is practically equivalent to 0 under TST, we would need 162 people in each group. Note that this is exactly identical to the number of people we would need to conclude that such a relationship is practically negligible under standard equivalence testing. The equivalence-test component of TST is identical to the TOST procedure, so statistical power to conclude that an effect is practically negligible is also identical between TST and the TOST procedure. This is why we say that TST is a uniform improvement over the TOST procedure. It has the same statistical power as the TOST procedure, but it has increased inferential power because TST can also reach conclusions about practically meaningful effects through inferiority and superiority tests, which the TOST procedure alone cannot.

Note also that standard NHST has neither statistical nor inferential power when the true relationship is zero. Standard NHST cannot reach the correct conclusion when there is no relationship, which is why many researchers turn to equivalence testing in the first place. TST has greater inferential power than standard NHST for both null and nonzero relationships.

TST in Complex Statistical Designs

The examples above and in the supplementary files all demonstrate how to conduct TST for relatively simple models such as $t$ tests. However, it is straightforward to apply TST to more complicated designs as well. So long as the statistical model can meaningfully produce an estimate and a standard error for a parameter that can be reasonably assumed to be Student $t$ or normally distributed, TST can be conducted. This includes models such as linear regression with covariates, logistic regression, multilevel models, and structural equation models. For example, the lm() and glm() functions in base R and the sem() function in the lavaan R package (Rosseel, 2012) provide output estimates, standard errors, and degrees of freedom for all regression coefficients by default. Regardless of model complexity, all one needs to do is take the relevant statistics from the model output and plug it into ShinyTST, the tst function in the eqtesting R package, or the tsti Stata command. These functions and applications also work with standardized coefficients, such as correlation coefficients and Cohen’s $d$ values; simply plug the standardized estimate, standard error of the standardized estimate, and degrees of freedom into the app/function.

When conducting exploratory analyses in large data sets with many variables, one should keep in mind that TST requires correction for multiple comparisons in just the same way that standard NHST does. When testing one single relationship, there is no need to correct for multiple-hypothesis testing across the superiority, inferiority, and equivalence tests in TST because the size of the union of these tests is controlled at nominal levels (provided that one employs the significance thresholds discussed in TST Using One-Sided Tests and TST Using Confidence Intervals). However, across tests of multiple relationships, each execution of TST is independent, and correction for multiple comparisons across the TST executions is therefore required.

Combining TST With NHST

If we expect the SESOI to change in the future, then it may be worth combining TST with standard NHST. Returning to our experiment on anchoring sales tactics in Example 2: Behavioral Research on Sales Tactics, the company’s SESOI for branch-level sales may increase over time as the company grows in size. Likewise, the smallest practically meaningful change in branch-level sales may decline in periods when the company experiences financial distress. It thus may be useful to record whether the anchoring tactic has some nonzero effect on sales to inform future experiments and policies.

To combine TST and standard NHST, simply add a two-sided test against zero to the TST procedure outlined in this article. Figure 8 shows how this alteration augments the TST procedure. This effectively partitions the $H_{E q}$ region in Figure 2d into two parts: $H_{0 -}$ (effect sizes greater than $D_{L}$ but less than zero) and $H_{0 +}$ (effects greater than zero but less than $D_{U}$ ). As a result, combining TST and standard NHST to test a single relationship does not require additional multiple-testing corrections and can thus be performed with no loss to statistical power or error-rate control (see Goeman et al., 2010). Whether the additional nuance afforded by standard NHST is of any interest must be for the individual researcher to decide. If the justification for the SESOI is strong, then it is not as interesting to know if the effect is different from zero. In such cases, TST should replace standard NHST as the default frequentist testing procedure. Even if TST and standard NHST results are reported together, the TST results should take precedence in the results and discussion.

Conclusion

Whenever researchers can specify a meaningful SESOI, TST is a superior testing procedure compared with both standard NHST and TOST procedures. TST allows one to detect significant evidence that a relationship is practically significant or practically negligible and to test more meaningful predictions than standard NHST (see also Lakens, 2022; Meehl, 1967). Researchers should therefore strongly consider TST as their default frequentist test procedure if they can specify a meaningful SESOI.

Researchers conducting quantitative analyses virtually always wish to make concrete statements about whether their estimates “matter.” Researchers have historically done this by leaning on the precision guarantees offered by standard NHST, labeling estimates that can be precisely bounded away from zero as “statistically significant.” However, this practice somewhat abuses the definition of the word “significant” and conflates estimates’ practical significance with their precision. This conflation between precision and practical significance means that null estimates become conflated with imprecise estimates, which is a key motivation behind publication bias against null results (Fitzgerald, 2025).

TST addresses this by separating estimates’ precision from their significance. In TST, the significance of estimates is determined by their relationship with the SESOI, and the precision of the estimate is judged by the formal hypothesis tests in the TST framework. In so doing, TST allows researchers to more credibly distinguish which estimates are practically meaningful and which ones are practically negligible.

Footnotes

Acknowledgements

We thank Jelle Goeman for helpful comments and feedback. All errors are our own. The anchoring-tactic example in this article was first suggested by ChatGPT.

Transparency

Action Editor: David A. Sbarra

Editor: David A. Sbarra

Author Contributions

Peder Mortvedt Isager: Conceptualization, Project administration, Resources, Validation, Visualization, Writing – original draft, Writing – review & editing.

Jack Fitzgerald: Conceptualization, Project administration, Resources, Software, Visualization, Writing – review & editing.

ORCID iD

Peder Mortvedt Isager

Supplemental Material

Additional supporting information can be found at

References

Aczel

Palfi

Szollosi

Kovacs

Szaszi

Szecsi

Zrubka

Gronau

Q. F.

van den Bergh

Wagenmakers

E. J.

(2018). Quantifying support for the null hypothesis in psychology: An empirical investigation. Advances in Methods and Practices in Psychological Science, 1(3), 357–366. https://doi.org/10.1177/2515245918773742

Anvari

Lakens

(2021). Using anchor-based methods to determine the smallest effect size of interest. Journal of Experimental Social Psychology, 96, Article 104159. https://doi.org/10.1016/j.jesp.2021.104159

Bakker

Cai

English

Kaiser

Mesa

Van Dooren

(2019). Beyond small, medium, or large: Points of consideration when interpreting effect sizes. Educational Studies in Mathematics, 102(1), 1–8. https://doi.org/10.1007/s10649-019-09908-4

Berger

R. L.

Hsu

J. C.

(1996). Bioequivalence trials, intersection-union tests and equivalence confidence sets. Statistical Science, 11(4), 283–319. https://doi.org/10.1214/ss/1032280304

Brañas-Garza

Estepa-Mohedano

Jorrat

Orozco

Rascón-Ramírez

(2021). To pay or not to pay: Measuring risk preferences in lab and field. Judgment and Decision Making, 16(5), 1290–1313. https://doi.org/10.1017/s1930297500008433

Brodeur

Carrell

Figlio

Lusher

(2023). Unpacking p-hacking and publication bias. American Economic Review, 113(11), 2974–3002. https://doi.org/10.1257/aer.20210795

Brodeur

Mikola

Cook

Brailey

Briggs

de Gendre

Dupraz

Fiala

Gabani

Gauriot

Haddad

McWay

Levin

Johannesson

Metson

Kinge

J. M.

Tian

Wochner

Mishra

. . .McManus

(2024). Mass reproducibility and replicability: A new hope (Discussion Paper Series No. 107). Institute for Replication. https://hdl.handle.net/10419/289437

Caldwell

(2022). Exploring equivalence testing with the updated TOSTER R package. PsyArXiv. https://doi.org/10.31234/osf.io/ty8de

Camerer

C. F.

Dreber

Forsell

T. H.

Huber

Johannesson

Kirchler

Almenberg

Altmejd

Chan

Heikensten

Holzmeister

Imai

Isaksson

Nave

Pfeiffer

Razen

(2016). Evaluating replicability of laboratory experiments in economics. Science, 351(6280), 1433–1436. https://doi.org/10.1126/science.aaf0918

10.

Camerer

C. F.

Dreber

Holzmeister

T. H.

Huber

Johannesson

Kirchler

Nave

Nosek

B. A.

Pfeiffer

Altmejd

Buttrick

Chan

Chen

Forsell

Gampa

Heikensten

Hummer

Imai

. . . Wu

(2018). Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour, 2(9), 637–644. https://doi.org/10.1038/s41562-018-0399-z

11.

Campbell

Gustafson

(2018). Conditional equivalence testing: An alternative remedy for publication bias. PLOS ONE, 13(4), Article e0195145. https://doi.org/10.1371/journal.pone.0195145

12.

Cohen

(1988). Statistical power analysis for the behavioral sciences (2nd ed.). Routledge. https://doi.org/10.4324/9780203771587

13.

Devji

Carrasco-Labra

Guyatt

(2020). Mind the methods of determining minimal important differences: Three critical issues to consider. Evidence Based Mental Health, 24(2), 77–81. https://doi.org/10.1136/ebmental-2020-300164

14.

Ferreira

M. L.

Herbert

R. D.

Ferreira

P. H.

Latimer

Ostelo

R. W.

Nascimento

D. P.

Smeets

R. J.

(2012). A critical review of methods used to determine the smallest worthwhile effect of interventions for low back pain. Journal of Clinical Epidemiology, 65(3), 253–261. https://doi.org/10.1016/j.jclinepi.2011.06.018

15.

Fitzgerald

(2025). The need for equivalence testing in economics. MetaArXiv. https://doi.org/10.31222/osf.io/d7sqr_v2.

16.

Franco

Malhotra

Simonovits

(2014). Publication bias in the social sciences: Unlocking the file drawer. Science, 345(6203), 1502–1505. https://doi.org/10.1126/science.1255484

17.

Funder

D. C.

Ozer

D. J.

(2019). Evaluating effect size in psychological research: Sense and nonsense. Advances in Methods and Practices in Psychological Science, 2(2), 156–168. https://doi.org/10.1177/2515245919847202

18.

Gates

Ealing

(2019). Reporting and interpretation of results from clinical trials that did not claim a treatment difference: Survey of four general medical journals. BMJ Open, 9(9), Article e024785. https://doi.org/10.1136/bmjopen-2018-024785

19.

Gelman

Carlin

(2014). Beyond power calculations: Assessing type S (sign) and type M (magnitude) errors. Perspectives on Psychological Science, 9(6), 641–651. https://doi.org/10.1177/1745691614551642 (Original work published 2014)

20.

Gignac

G. E.

Szodorai

E. T.

(2016). Effect size guidelines for individual differences researchers. Personality and Individual Differences, 102, 74–78. https://doi.org/10.1016/j.paid.2016.06.069

21.

Glass

G. V.

(1976). Primary, secondary, and meta-analysis of research. Educational Researcher, 5(10), 3–8. https://doi.org/10.3102/0013189x005010003

22.

Goeman

J. J.

Solari

Stijnen

(2010). Three-sided hypothesis testing: Simultaneous testing of superiority, equivalence and inferiority. Statistics in Medicine, 29(20), 2117–2125. https://doi.org/10.1002/sim.4002

23.

Greenland

Senn

S. J.

Rothman

K. J.

Carlin

J. B.

Poole

Goodman

S. N.

Altman

D. G.

(2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350. https://doi.org/10.1007/s10654-016-0149-3

24.

Hedges

L. V.

(1981). Distribution theory for Glass’s estimator of effect size and related estimators. Journal of Educational Statistics, 6(2), 107–128. https://doi.org/10.2307/1164588

25.

Jané

M. B.

Xiao

Yeung

Ben-Shachar

M. S.

Caldwell

Cousineau

Dunleavy

D. J.

Elsherif

Johnson

B. T.

Moreau

Riesthuis

Röseler

Steele

Vieira

F. F.

Zloteanu

Feldman

(2024). Guide to effect sizes and confidence intervals. OSF. https://doi.org/10.17605/OSF.IO/D8C4G

26.

Kraft

M. A.

(2020). Interpreting effect sizes of education interventions. Educational Researcher, 49(4), 241–253. https://doi.org/10.3102/0013189X20912798

27.

Kroenke

Spitzer

R. L.

Williams

J. B.

(2001). The PHQ-9: Validity of a brief depression severity measure. Journal of General Internal Medicine 16(9): 606–13.

28.

Lakens

(2017). Equivalence tests: A practical primer for t-tests, correlations, and meta-analyses. Social Psychological and Personality Science, 8(4), 355–362. https://doi.org/10.1177/1948550617697177

29.

Lakens

(2022). Improving your statistical inferences. Zenodo. https://doi.org/10.5281/ZENODO.6409077

30.

Lakens

Caldwell

(2025). TOSTER: Two One-Sided Tests (TOST) equivalence testing (Version 0.8.6) [Programvare]. https://cran.r-project.org/web/packages/TOSTER/index.html

31.

Lakens

Scheel

A. M.

Isager

P. M.

(2018). Equivalence testing for psychological research: A tutorial. Advances in Methods and Practices in Psychological Science, 1(2), 259–269. https://doi.org/10.1177/2515245918770963

32.

Lovakov

Agadullina

E. R.

(2021). Empirically derived guidelines for effect size interpretation in social psychology. European Journal of Social Psychology, 51(3), 485–504. https://doi.org/10.1002/ejsp.2752

33.

Lydick

Epstein

R. S.

(1993). Interpretation of quality of life changes. Quality of Life Research, 2(3), 221–226. https://doi.org/10.1007/bf00435226

34.

Meehl

P. E.

(1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34(2), 103–115. https://doi.org/10.1086/288135

35.

Moniz

Druckman

J. N.

Freese

(2025). The file drawer problem in Social Science Survey experiments. Proceedings of the National Academy of Sciences, 122(12), Article e2426937122. https://doi.org/10.1073/pnas.2426937122

36.

Murphy

K. R.

Myors

(1999). Testing the hypothesis that treatments have negligible effects: Minimum-effect tests in the general linear model. Journal of Applied Psychology, 84(2), 234–248. https://doi.org/10.1037/0021-9010.84.2.234

37.

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), Article aac4716. https://doi.org/10.1126/science.aac4716

38.

Paterson

T. A.

Harms

P. D.

Steel

Credé

(2016). An assessment of the magnitude of effect sizes. Journal of Leadership & Organizational Studies, 23(1), 66–81. https://doi.org/10.1177/1548051815614321

39.

Ragland

D. R.

(1992). Dichotomizing continuous outcome variables: Dependence of the magnitude of association and statistical power on the cutpoint. Epidemiology, 3(5), 434–440. https://doi.org/10.1097/00001648-199209000-00009

40.

Rosseel

(2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1–36. https://doi.org/10.18637/jss.v048.i02

41.

Sivertsen

Pallesen

Friborg

Nilsen

K. B.

Bakke

Ø. K.

Goll

J. B.

Hopstock

L. A.

(2020). Sleep patterns and insomnia in a large population-based study of middle-aged and older adults: The Tromsø study 2015–2016. Journal of Sleep Research, 30(1), Article e13095. https://doi.org/10.1111/jsr.13095

42.

Wilkinson

, & American Psychological Association Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54(8), 594–604. https://doi.org/10.1037/0003-066x.54.8.594

43.

Yang

Sánchez-Tójar

O’Dea

R. E.

Noble

D. W.

Koricheva

Jennions

M. D.

Parker

T. H.

Lagisz

Nakagawa

(2023). Publication bias impacts on effect size, statistical power, and magnitude (Type M) and sign (Type S) errors in ecology and evolutionary biology. BMC Biology, 21(1), Article 71. https://doi.org/10.1186/s12915-022-01485-y

Three-Sided Testing to Establish Practical Significance: A Tutorial

Abstract

Keywords

What Is the TST Procedure?

Example 1: comparing treatment options for insomnia

Example 2: behavioral research on sales tactics

How to Execute and Interpret TST

Specifying an SESOI

Unit interpretability and standardized effect sizes

Eliciting SESOIs from surveys

Smallest measurable differences

Cost–benefit analysis

TST using one-sided tests

TST using confidence intervals

The TST CI

TST using the ShinyTST application

Example 1: a practically significant relationship

Example 2: a relationship practically equal to zero

Example 3: inconclusive results

TST in R, Jamovi, and Stata

Reporting TST Results

Power Analysis and Sample-Size Planning for TST

Example 1: statistical and inferential power for a large effect

Example 2: statistical and inferential power for a practically negligible effect

TST in Complex Statistical Designs

Combining TST With NHST

Conclusion

Footnotes

Acknowledgements

Transparency

ORCID iD

Supplemental Material

References