Sage Journals: Discover world-class research

Abstract

The success of preclinical research hinges on exploratory and confirmatory animal studies. Traditional null hypothesis significance testing is a common approach to eliminate the chaff from a collection of drugs, so that only the most promising treatments are funneled through to clinical research phases. Balancing the number of false discoveries and false omissions is an important aspect to consider during this process. In this paper, we compare several preclinical research pipelines, either based on null hypothesis significance testing or based on Bayesian statistical decision criteria. We build on a recently published large-scale meta-analysis of reported effect sizes in preclinical animal research and elicit a non-informative prior distribution under which both approaches are compared. After correcting for publication bias and shrinkage of effect sizes in replication studies, simulations show that (i) a shift towards statistical approaches which explicitly incorporate the minimum clinically important difference reduces the false discovery rate of frequentist approaches and (ii) a shift towards Bayesian statistical decision criteria can improve the reliability of preclinical animal research by reducing the number of false-positive findings. It is shown that these benefits hold while keeping the number of experimental units low which are required for a confirmatory follow-up study. Results show that Bayesian statistical decision criteria can help in improving the reliability of preclinical animal research and should be considered more frequently in practice.

Keywords

Preclinical research reliability of research Bayesian inference minimum effect test null hypothesis significance testing

1. Introduction

Preclinical research is necessary to identify drugs and treatments for further investigation in clinical trials. According to Kimmelman et al.¹, there are two separate approaches which exist to identify promising treatments. Exploratory research is usually the core operating model in early preclinical research where the goal is to filter effective drugs and treatments from a large pool of (possibly ineffective) candidates. The primary target of exploratory research is to generate hypotheses which are then examined in a replication study in the second scientific operating model, which is confirmatory research.²

Testing the hypotheses generated in exploratory research in subsequent confirmatory research implies that the evidence about a new drug or treatment is more robust compared to when no confirmatory or replication attempt is undertaken.³ However, there are pitfalls to care about. First, there is always the risk to miss an effective treatment in exploratory research because it fails to meet a required statistical decision threshold. Second, there is the risk of running a confirmatory replication study although the exploratory research finding is a false-positive result.

A common approach to filter promising drugs and treatments in exploratory medical research for further investigation is the p-value which is the key statistic in null hypothesis significance testing (NHST).⁴ The merits and limitations of p-values have been debated widely for years,⁵ and some have argued to focus on effect size estimation instead of relying on the concept of statistical significance.^6,7 According to these approaches, the minimum clinically important difference (MCID) and the scientific or clinical relevance of research results should be placed in the center of a statistical analysis and trial design.⁸ Recently, Danziger et al.⁹ have shown that focussing on the smallest effect size of interest (SESOI)—which is equivalent to the MCID—instead of focusing on statistically significant results can improve the discovery rates of truly effective treatments in preclinical research.

The situation is complicated by the fact that the interplay between exploratory and confirmatory research influences the resulting characteristics of a preclinical animal research pipeline. Regarding the latter, we do not denote the entire process of in vitro and in vivo studies as a preclinical research pipeline. Instead, we denote by it solely an exploratory preclinical animal study and a subsequent confirmatory study. The latter could for example investigate the toxicity, pharmakokinetics or study dose–response relationships.

While sensitive decision criteria in exploratory research should detect a large number of possibly effective candidates, confirmatory studies must aim at reducing the number of false-positive results to ensure that only truly effective treatments transition to clinical studies. The statistical decision criteria employed in this process therefore strongly influence the resulting performance of a preclinical research pipeline in terms of statistical quantities such as the associated positive and negative predictive value, false discovery rate and false omission rate.

While Danziger et al.⁹ follow a p-value-based approach to improve the reliability of preclinical research there are various authors who have advocated Bayesian decision criteria (BDCs) to address the reliability of biomedical research.^10,11 Examples include reverse-Bayes methods¹² and Bayesian alternatives to p-values.^13–15 Whether a Bayesian approach is taken or not, transparent reporting of statistical analyses as well as the refinement of experimental design is a key requirement to optimize the reliability of preclinical research,¹⁶ and in this article, we focus on two approaches to filter effective drugs and treatments in exploratory preclinical research for transitioning to a confirmatory replication study. The performance of two preclinical research pipelines is assessed and compared, showing that a methodological shift to methods which incorporate the smallest effect size of interest can improve the reliability of positive preclinical research findings.

Typically, preclinical studies are not very large. Still, these studies have to provide information on drug dosing and the associated toxicity levels. Once this information is obtained, researchers review their findings and decide whether the drug should be tested in humans.¹⁷ Here, we make the following assumptions:

We make no restrictions on whether a dose-finding preclinical study is conducted, a study which gathers information about the pharmacokinetics of the drug, or a study which studies the toxicity of the new agent. We label these preclinical animal studies exploratory studies henceforth and assume that to decide whether or not to transition to a phase I clinical trial, a replication study is carried out first.¹⁸

We do, however, make the assumption that the exploratory study is already a phase I in vivo study involving animals. This aligns our setup with the ones considered by Danziger et al.,⁹ Bonapersona et al.,¹¹ and Anderson et al.⁴

We assume that the primary endpoint of the study is continuous.

Although the last point is a somewhat simplified assumption it allows to isolate the findings of this article without distractions due to a more complicated design with possibly multiple (and non-continuous) endpoints. Also, based on the meta-analysis of Carneiro et al.¹⁹ and Bonapersona et al.,¹¹ it focuses on one of the most common scenarios in preclinical animal studies.

An important further assumption is that we do not focus on adaptive designs such as group-sequential designs which allow for early stopping due to futility or efficacy based on interim analyses. While there are important advantages of adaptive designs such as yielding smaller sample sizes and solving ethical issues (e.g. it might not be feasible to conduct an independent replication study involving animal use after an exploratory study has already made use of animals), Madjid et al.²⁰ notes that “Approval to use animal subjects often requires the number of animals and methods to be specified beforehand. (…) Changes to animal use regulations may therefore be necessary in order to incorporate adaptive designs.” Furthermore, a more problematic aspect is given by the dearth of accessible software packages which implement adaptive designs, in particular, for Bayesian methods. Majid et al.²⁰ note that “Although a few packages exist that can handle the necessary calculations, most statistical software currently in use is inadequate for the task.” Nevertheless, adaptive designs, in particular, group-sequential designs can substantially increase the efficiency of a design—see, for example, Neumann et al.²¹—and should be used more often also in preclinical animal research. Here, however, we refrain from investigating adaptive designs due to the two reasons outlined above and stick to fixed sample size designs. This is also crucial as sample size calculations for the Bayesian replication studies build on the fixed sample size designs.

The plan of the article is as follows: The next section details the different approaches which are compared in this article. Section 3 then details the BDCs which are analyzed as possible alternatives to a p-value-based preclinical research pipeline. Section 4 provides information about the simulation study which was conducted to assess the statistical performance of the competing approaches, and Section 5 presents the results. The last section provides a discussion and conclusion.

2. Two competing preclinical research pipelines

2.1. Statistical significance pipeline

The first preclinical research pipeline is based on the p-value and statistical significance. Therefore, it is assumed that an exploratory study is performed at a research laboratory and is analyzed via NHST. According to the literature reviews of Carneiro et al.¹⁹ and Bonapersona et al.¹¹, the comparison of two groups by means of Welch’s two-sample t-test under the assumption that data in each group is normal is a widely established statistical method in preclinical animal research. Therefore, we assume that the primary study endpoint is measured in the treatment and control groups as

X_{i}^{e x p} \sim N (μ_{1}, σ_{1}^{2}) and Y_{i}^{e x p} \sim N (μ_{2}, σ_{2}^{2})

where

X_{i}^{e x p}

i = 1, \dots, n_{1}

denote the observations in the treatment group and

Y_{i}^{e x p}

i = 1, \dots, n_{1}

the observations in the control group. Thus, we solely treat balanced designs and the choice of

n_{1}

will be based on the typical sample sizes in preclinical animal research, see the “Methods” section. This choice is motivated by the relationship to larger power in balanced designs when performing parametric tests such as Welch’s two-sample t-test, compare Matthews.²² We admit that this is a somewhat simplified assumption, and imbalanced group sizes can weaken the performance metrics further. As a consequence, the metrics reported later for both trajectories are ideal in the sense that in practice, imbalances in group randomization can lead to smaller power, for example.

The decision criterion of the first preclinical research pipeline is the p-value, and the test level $α = 0.025$ is employed for transitioning to a confirmatory study in this research trajectory, as a one-sided test is carried out. We thus assume that researchers have an expectation about the direction of the effect, which seems justified. If the p-value does not pass the threshold $p < α$ then the exploratory study is recorded as a failure, and else a sample size calculation is performed for the replication study. Therefore, the effect size estimate $\hat{δ} (y_{e x p})$ based on $y_{e x p} := (X_{e x p}, Y_{e x p})$ —where $X_{e x p} := (X_{1}^{e x p}, \dots, X_{n_{1}}^{e x p})$ and $Y_{e x p} := (Y_{1}^{e x p}, \dots, Y_{n_{1}}^{e x p})$ —between both groups provided by the exploratory study results is used to calculate the required sample size $n_{2}$ for the replication study. The replication study with sample size $n_{2}$ is then assessed again by Welch’s two-sample t-test to level $α := 0.025$ . The primary endpoint is assumed to be measured again in the treatment and control groups as

X_{i}^{c o n} \sim N (μ_{1}, σ_{1}^{2}) and Y_{i}^{c o n} \sim N (μ_{2}, σ_{2}^{2})

where

X_{i}^{c o n}

i = 1, \dots, n_{2}

denote the observations in the treatment group and

Y_{i}^{c o n}

i = 1, \dots, n_{2}

the observations in the control group. The confirmatory replication study thus consists of data

y_{c o n} := (X_{c o n}, Y_{c o n})

where

X_{c o n} := (X_{1}^{c o n}, \dots, X_{n_{2}}^{c o n})

and

Y_{c o n} := (Y_{1}^{c o n}, \dots, Y_{n_{2}}^{c o n})

If $p < α$ , the candidate drug is judged as effective and the research trajectory was successful. Else, if $p \geq α$ , the result of the exploratory study cannot be replicated and the trajectory counts as a failure. The p-value-based research pipeline thus makes use of the two-trials rule, which asserts replication success when two significant hypothesis tests are found both in the original and replication study.^23,24

The p-value in Welch’s two-sample t-test does not aim at investigating the smallest effect size of interest, however. Therefore, the two one-sided test (TOST) procedure is conducted as an alternative. Details about the TOST procedure are given by Lakens et al.,²⁵ Schieh,²⁶ and Kelter.¹⁵ Importantly, as the goal here is not to establish equivalence but guarantee a minimum effect of interest, the terminology minimum effects test (MET) or superiority test is more appropriate.

Welch’s t-test and the MET used here differ in the hypothesis they test. Welch’s t-test can only reject the null hypothesis $H_{0} : δ = 0$ of zero effect size. The MET in this setting reduces to the test of the null hypothesis $H_{0} : δ \leq δ_{0}$ for a prespecified minimum effect size of interest $δ_{0} > 0$ against $H_{1} : δ > δ_{0}$ . Thus, the MET essentially is a shifted one-sided test. Further details are provided by Lakens et al.²⁵

In both frequentist approaches, we take a Neyman-Pearsonian stance which decides in favor of $H_{0}$ or $H_{1}$ to control the long-term error rates instead of adopting a purely Fisherian perspective which aims at solely refuting the null hypothesis and which refrains from any statement if this task fails. This Neyman-Pearsonian view is motivated by the desire to achieve good operating characteristics in terms of the false-discovery rate and other metrics detailed below.

2.2. Choice of methods

We note that an alternative frequentist method is the dual-criterion design proposed by Rosenkranz²⁷ which combines significance and scientific relevance. However, we refrain from implementing this design because the setup of the simulation study is in conflict with the assumptions of the design.¹ Other frequentist approaches are given by Errington et al.²³ and range from the same-direction of effect sizes criterion to comparing original effect size with the 95% confidence interval of the replication effect size. While these methods have their respective merit, we focus on the two-trials rule here which relies on two significant p-values, and the MET as a possible alternative. This is also the setup considered by Danziger et al.⁹ and possibly resembles current research practice most reasonably.

2.3. BDCs pipeline

The second preclinical research pipeline relies on BDCs which focus on the minimum clinically important effect size $δ_{0}$ such as the MET. Now the results of the exploratory study are analyzed based on a BDC, four options of which will be outlined in the next section. The result of the employed BDC is used to decide whether the exploratory study was a failure or whether the results warrant a confirmatory follow-up study. If the latter is the case, the sample size $n_{2} (C)$ to attain a prespecified power is calculated for each decision criterion $C$ . Details on the calculation of the sample size for the BDCs are provided in the “Methods” section below, and in the Online Appendix. In brief, Bayesian sample size planning for the replication study is based on the hierarchical normal-normal model detailed by Pawel et al.²⁴ Based on the required sample size $n_{2} (C)$ for the employed criterion $C$ , the replication study is conducted and the same BDC $C$ is used to analyze the replication study. Depending on the results, the treatment efficacy is either confirmed, leading to a success, or questioned, leading to a failure.

The performance of both research trajectories is analyzed based on a recently published distribution of effect sizes in preclinical animal research in neuroscience.¹¹ The primary performance measures of interest are the resulting positive predictive values (PPV), negative predictive values (NPV), false discovery proportion (FDP), and false omission rates (FOM) of each trajectory. As two of these values suffice to determine the remaining two, the focus will be on the false discovery proportion and the false omission rate.

The PPV is the percentage of true-positive results among all positive results. The NPV is the percentage of true-negative results among all negative results. The FDP is the percentage of false-positive results among all positive results, and the FOR is the percentage of false-negative results among all negative results. These quantities are formally defined in the “Methods” section. Importantly, the null hypothesis in the simulation study is a composite hypothesis and is given as $H_{0} : δ \leq δ_{0}$ for all research pipelines except the p-value-based pipeline, where $H_{0} : δ \leq 0$ . Details are provided below.

3. Bayesian decision criteria

In this section, the four BDCs are detailed which are used in the second preclinical research pipeline. In contrast to the p-value-based trajectory, Bayesian analysis offers a variety of alternatives to the p-value for testing a hypothesis. For an overview, see Kelter²⁸ and Quatto et al.¹³ All approaches are united by the fact that the prior distribution for the parameter(s) of interest must be elicited.

In practice, researchers often have an expectation about the direction of an effect so that the test of $H_{0} : μ_{1} - μ_{2} \leq 0$ against $H_{1} : μ_{1} - μ_{2} > 0$ is carried out in the exploratory study. The one-sided test is equivalent to testing $H_{0} : δ \leq 0$ against $H_{1} : δ > 0$ , where $δ \in R$ denotes the effect size of Cohen,²⁹ $δ := \frac{μ_{1} - μ_{2}}{σ}$ and $σ$ denotes the pooled standard deviation of the sample. As a consequence, the prior distribution for Bayesian approaches is elicited for Cohen’s $δ$ . The focus on Cohen’s $δ$ stems partially from the parameterization of most Bayesian two-sample t-tests, but also from the available large-scale meta-analysis by Bonapersona et al.¹¹ The latter allows to elicit a non-informative meta-analytic prior distribution for $δ$ for preclinical animal studies, which is outlined in the next subsection.

Henceforth, it is assumed that in the exploratory study the direction of the effect is specified and one tests

H_{0} : δ \leq δ_{0} versus H_{1} : δ > δ_{0}

(1)

for a fixed smallest effect size of interest (SESOI)

δ_{0}

. The confirmatory replication study carries out the same test if a replication study is conducted.

3.1. Prior elicitation

As primary interest lies in the effect size $δ$ , prior elicitation should be based on available knowledge about effect sizes in biomedical research. Aarts et al. (2015) estimated the reproducibility of research and concluded that while there are also large effect sizes, small to moderate effects are the rule rather than the exception. Similar results were obtained by Button et al.³⁰ in the field of neuroscience. Bonapersona et al.¹¹ conducted a large-scale meta-analysis and reported an empirical effect size distribution for Hedges’ $g$ between two groups based on $n = 2738$ extracted empirical effect sizes from preclinical research in neuroscience.² Like Danziger et al.⁹ we eliminate studies which report a huge effect size with $| g | \geq 10$ , which leaves $n = 2729$ studies.³ The resulting empirical effect size distribution is shown in the left panel of Figure 1.

Figure 1.

Empirical effect size distribution of Hedges’ $g$ (left) and of Cohen’s $d$ (right) for preclinical animal research based on $n = 2729$ articles in biomedical research.

The red solid line shows a Cauchy $C (0, γ)$ prior density with $γ = 1 / \sqrt{2}$ , which is the recommended weakly-informative choice for a Bayesian two-sample t-test.³¹ Kelter¹⁴ showed that this prior in two-sample settings yields reasonable type I error probabilities and power under the alternative. Figure 1 shows that the Cauchy prior is a reasonable approximation to the meta-analytic prior based on the Gaussian kernel density approximation for $n = 2729$ reported preclinical studies and for Hedges $g$ .⁴ Hedges’ $g$ differs from Cohens $δ$ only by a scaling factor

g = d \cdot \frac{Γ (df / 2)}{\sqrt{df / 2} Γ ((df - 1) / 2)}

(2)

where

df := n_{1} + n_{2} - 2

for an independent group design.³² For Cohen’s

d

(right plot) a wider

C (0, \sqrt{2})

prior seems more appropriate. However, as noted by Anderson and Kelley³³ a key problem with published effect sizes is publication bias. As a consequence, the effect sizes shown in Figure 2 are probably too large and the

C (0, \sqrt{2})

prior is too wide. To account for publication bias, a more sceptical

C (0, \sqrt{2} / 3)

prior was elicited. The latter is more sceptical about the presence of an effect and amounts to stating that

\approx 50 %

of effects do not exceed the SESOI of

δ_{0} = 0.5

. We decided to report results for the sceptical

C (0, \sqrt{2} / 3)

prior, which accounts for publication bias and is henceforth the default choice for all BDCs. Also, this prior can still yield large effect sizes which is in line with the current state that exploratory studies often lead to exaggerated results due to questionable research practices.^34,23

3.2. Choice of methods

There is a variety of Bayesian methods to test a hypothesis.^14,35,36 In the context of an original and replication study, there are also frameworks which aim at quantification of replication success of a study, such as reverse-Bayes methods as proposed by Held^37,12 which include meta-analytic approaches, the sceptical p-value, sceptical Bayes factor (BF), and even frequentist methods such as the two-trials rule (which essentially reduces to the p-value-based research pipeline outlined above, where replication success is defined as two significant hypothesis tests). All of these methods have their merit, and most Bayesian methods which explicitly aim at quantifying replication success are currently located inside the hierarchical normal-normal model which is outlined in the Online Appendix. Another approach which is not located inside the hierarchical normal-normal model is the replication BF.³⁸

Here, we focus on Bayesian methods which aim at testing an interval hypothesis, see Linde³⁹ and Kelter.¹⁵ These methods naturally anchor to the task of incorporating a SESOI into a hypothesis test, and relevant methods include (1) posterior probabilities,^15,36 (2) interval BFs,^40,41 (3) the region of practical equivalence (ROPE),^42,7,39 and (4) the full Bayesian evidence test (FBET).^43,44 While these methods are regularly used in the literature to test an interval hypothesis, there is no current framework to use them for judging the success of a replication study. To meet this limitation, we anchor these methods to the hierarchical normal-normal model detailed in Pawel et al.²⁴ and develop sample size calculation formulas for the replication study which assert a predefined probability of replication success. These derivations are relegated to the Online Appendix, and details about the BDCs (1) to (4) follow next.

3.3. Posterior probabilities

The first Bayesian decision criterion is to decide based on posterior probabilities of hypotheses. In the p-value-based research trajectory, we test $H_{0} : δ \leq 0$ versus $H_{1} : δ > 0$ in the exploratory study. The posterior odds of $H_{0}$ are given as

\frac{P_{ϑ | Y} (H_{0} | y)}{P_{ϑ | Y} (H_{1} | y)} = \underset{= {BF}_{01} (y)}{\underset{⏟}{\frac{f (y | H_{0})}{f (y | H_{1})}}} \frac{P_{ϑ} (H_{0})}{P_{ϑ} (H_{1})}

(3)

where

P_{ϑ}

denotes the prior distribution,

P_{ϑ | Y}

denotes the posterior distribution, and

\frac{f (y | H_{0})}{f (y | H_{1})}

is the Bayes factor

{BF}_{01} (y)

in favor of

H_{0}

.⁴⁵ Based on the posterior odds

c > 0

, the posterior probability of

H_{0}

is given by

P_{ϑ | Y} (H_{0} | y) = \frac{c}{1 + c}

(4)

and analogously for

H_{1}

. A decision in favor of

H_{0} : δ \leq 0

is made if

P_{ϑ | Y} (H_{0} | y) > p_{0}

for a specified

p_{0} \in (0, 1]

such as

p_{0} = 0.9

.⁵ However, we consider only BDCs which focus on the minimum SESOI. Therefore, we specify a

δ_{0} \in R_{+}

so that only effect sizes

| δ | > δ_{0}

are considered as clinically relevant. As a consequence, when using posterior probabilities as the BDC in the second trajectory, the hypothesis

H_{0} : | δ | \leq δ_{0}

is tested against

H_{1} : | δ | > δ_{0}

Furthermore, to align prior odds of the null and alternative hypotheses with the beliefs expressed by the $C (0, \sqrt{2} / 3)$ prior, we assign $H_{0}$ the prior probability which is given by $F_{C (0, \sqrt{2} / 3)} (δ_{0}) = F_{C (0, \sqrt{2} / 3)} (0.5) \approx 0.76$ , where $F_{C (0, \sqrt{2} / 3)}$ denotes the cumulative distribution function of the $C (0, \sqrt{(} 2) / 3)$ distribution. Thus, the prior probability of an effect being scientifically irrelevant are $\approx 76 %$ , which aligns with the skepticism provided by the $C (0, \sqrt{(} 2) / 3)$ prior.

3.4. Interval BFs

Concerning the two-sample t-test, Morey et al.⁴⁰ built on the standard Bayesian two-sample t-test model of Rouder et al.,³¹ where $X_{i} \sim N (μ - \frac{α}{2}, σ^{2})$ and $Y_{j} \sim N (μ + \frac{α}{2}, σ^{2})$ and $X_{i}$ , $i = 1, \dots, N_{x}$ and $Y_{j}$ , $i = 1, \dots, N_{y}$ denote the observations in the treatment and control groups. The parameter $μ$ denotes the so-called grand mean and $α$ the total effect, and $N_{x}$ and $N_{y}$ the corresponding group sizes. Morey et al.⁴⁰ followed Jeffreys⁴⁶ who advocated to perform inference on the normalized mean $δ := α / σ$ , and the model uses a precise null hypothesis formulated by Morey et al.⁴⁰ in terms of the priors under the null and alternative hypotheses:

H_{0}^{JZS} : δ = 0, H_{1}^{JZS} : δ \sim C (0, 1)

where

C (0, 1)

is a Cauchy distribution with scale parameter

γ = 1

under the alternative

H_{1}^{JZS}

. Thus, if

H_{0}^{JZS}

holds, then

α = 0

and group means in both groups differ by zero and are equal to the grand mean

μ

. Choosing Jeffreys’ prior

p (σ^{2}) = 1 / σ^{2}

for the prior on

σ^{2}

in both groups and a flat prior

p (μ) = 1

on the grand mean, this model is known as the Jeffreys-Zellner-Siow (JZS) prior, compare Rouder et al.³¹ A detailed description is also available by Kelter.⁴⁷

Morey et al.⁴⁰ extended the approach of Rouder et al.³¹ and introduced the non-overlapping hypotheses (NOH) model where

δ \sim t_{ν_{0}}, p (σ^{2}) \propto 1 / σ^{2}

The NOH model assigns the effect size

δ

t_{ν_{0}}

prior with

ν_{0}

degrees of freedom instead of assigning different priors under

H_{0}

and

H_{1}

. The choice

ν_{0} = 1

yields the JZS Cauchy prior, as the

t_{1}

distribution equals the

C (0, 1)

distribution. The value

ν_{0} = \infty

yields a standard normal prior. The recommended default value for

ν_{0}

ν_{0} = 1

, because the Cauchy distribution allows for a realistic range of effect sizes for biomedical research.^15,40 This choice also is backed up by the empirical meta-analytic prior, compare Figure 1. However, to account for publication bias we use the

C (0, \sqrt{2} / 3)

prior elicited above. The hypotheses for the NOH model are defined as

H_{0}^{NOH} : δ \in [- δ_{0}, δ_{0}], H_{1}^{NOH} : δ \notin [- δ_{0}, δ_{0}]

(5)

Computation of the NOH Bayes factor

{BF}_{01}^{NOH}

requires numerical integration to obtain the marginal likelihoods under both

H_{0}^{NOH}

and

H_{1}^{NOH}

. Therefore, the smallest effect size of interest

δ_{0}

must be specified. While Morey et al.⁴⁰ follow Cohen²⁹ in advocating

(- δ_{0}, δ_{0}) = (- 0.1, 0.1)

as the boundaries of the interval hypothesis, for preclinical animal research this is quite small. As discussed by Danziger et al.⁹ a more realistic choice is given by

δ_{0} = 0.5

. The decision for the interval BF is then made based on

{BF}_{01}^{NOH} < 1 / 3

, which is the moderate evidence threshold according to the scale of Jeffreys.⁴⁶ To align the interval BF test with the one-sided test of interest, we modify the hypotheses in (5) into

H_{0}^{NOH} : δ \in (- \infty, δ_{0}], H_{1}^{NOH} : δ \in (δ_{0}, \infty)

3.5. The region of practical equivalence

The third Bayesian decision criterion focusses on measuring the location of a Bayesian interval estimate like equal-tailed credible or highest-posterior-density (HPD) interval inside the region of practical equivalence, the ROPE. The concept of a ROPE was independently proposed in a wide range of scientific domains, compare Kruschke,⁷ Lakens et al.,²⁵ and Kelter.⁴⁸ Starting from the posterior distribution $P_{ϑ | Y}$ of the parameter of interest, parameter values inside the ROPE should be interpreted as equivalent for practical purposes to the value the ROPE is defined around. There are two versions of the ROPE, one in which the 95% highest-posterior-density-interval (HPD) is used for the analysis (95% ROPE), and one in which the full posterior distribution is used (full ROPE). The decision is made as follows: If the 95% HPD interval is located entirely inside the ROPE $R := [δ_{0} - ε, δ_{0} + ε]$ for some margin $ε > 0$ , then the unknown true parameter value is interpreted as practically equivalent to the specified value $δ_{0}$ . For the effect size $δ$ , Kruschke⁷ proposed to use $[- 0.1, 0.1]$ as the ROPE for the null hypothesis $H_{0} : δ = 0$ of no effect, which is half of the effect size necessary for at least a small effect according to Cohen.²⁹ This is essentially the same proposal which was made by Morey et al.⁴⁰ but is quite small in terms of preclinical animal research where the goal is to identify promising treatments or drugs among a possibly large collection of candidates which do not pass the smallest effect size of interest. In the context of the exploratory study, when using the ROPE the 95% version is employed. However, here a conceptual distinction is made which follows the approaches by Makowski,³⁵ Kelter,^14,15 and Linde,³⁹ where the ROPE is interpreted more informally as an interval hypothesis. Thus, in the one-sided testing case we use the ROPE $R = (δ_{0}, \infty)$ and if the 95% HPD (respectively the 100% HPD) is located entirely inside $R$ , the alternative hypothesis $H_{1} : δ > δ_{0}$ is accepted. Therefore, the same boundary $δ_{0}$ employed in the test based on the posterior probabilities or interval BFs is used for the ROPE which allows to compare the approaches.

3.6. The Bayesian evidence test and e-values

Interestingly, the decision rule of the ROPE is only a special case of a decision based on the Bayesian evidence values, which were recently proposed as a unification of Bayesian hypothesis testing and parameter estimation.⁴³ Kelter⁴³ proposed the full Bayesian evidence test (FBET), which is a generalization of the full Bayesian significance test (FBST). Details on the FBST can be found by Pereira and Stern⁴⁴ and Kelter,^49,28 while the FBET was proposed by Kelter.⁴³ The FBET can be used with any standard parametric statistical model, where $θ \in Θ \subseteq R^{p}$ is a (possibly vector-valued) parameter of interest, $p (y | θ)$ is the likelihood and $p (θ)$ is the density of the prior distribution $P_{ϑ}$ for the parameter $θ$ , and $y \in Y$ denote the observed sample data, $Y$ being the sample space.

3.7. The Bayesian evidence interval

The FBET uses the Bayesian evidence interval for quantifying the evidence in favor or against a hypothesis. Denote by

I (θ) := \frac{p (θ | y)}{r (θ)}

(6)

the Bayesian information function for a given reference function

r (θ)

. Equation (6) constitutes the probabilistic explication of information for a Bayesian according to Good,^45,50,51 and has a close connection to Kullback-Leibler divergence between two probability measures. In fact, the latter is defined as

KL (P, Q) := \int_{- \infty}^{\infty} p (θ) \log [\frac{p (θ)}{q (θ)}] d θ

for two measures

P

and

Q

with probability densities

p

and

q

. Substituting the prior density

p (\cdot) := d P_{ϑ} / d θ

of the parameter for

q

and the posterior density

p (\cdot | y) := d P_{ϑ | Y} / d θ

for

p

, the above reads

KL (P_{ϑ | Y}, P_{ϑ}) := \int_{- \infty}^{\infty} p (θ | y) \log [\frac{p (θ | y)}{p (θ)}] d θ

The above display shows that the logarithm of

I (θ)

can be interpreted as the point-wise information between prior and posterior, when selecting the reference function

r (θ)

as the prior density

p (θ)

. Equation (6) is, therefore, exponential or multiplicative information in the sense of Kullback-Leibler divergence. Thus, the evidence interval measures (point-wise) the information between prior and posterior, which is gained by observation of the data

y

. Henceforth, we will make this our default choice. Alternatively, selecting a flat reference function

r (θ) := 1

can be interpreted as selecting an improper flat prior density, so information is then measured without adopting any information expressed as subjective prior beliefs. The idea of

I (θ)

is to order the parameter space

Θ

into values

θ

which are informative compared to a reference function

r (θ)

(in this case, the prior density), and values which are uninformative.

The Bayesian evidence interval ${EI}_{r} (ν)$ with reference function $r (θ)$ to level $ν$ includes all parameter values $θ$ for which there is at least a predefined amount $ν$ of information provided by the observed data $y$ .

{EI}_{r} (ν) := {θ \in Θ | \frac{p (θ | y)}{r (θ)} \geq ν}

(7)

It can be shown that the Bayesian evidence interval includes standard HPD intervals, support intervals, and credible intervals as special cases under suitable regularity conditions.⁴³ The Bayesian evidence value tests an interval hypothesis

H_{0}

as follows: For a given evidence interval

{EI}_{r} (ν)

with reference function

r (θ)

to level

ν

(also called the evidence threshold), the Bayesian evidence value

{Ev}_{{EI}_{r} (ν)} (H_{0})

for the null hypothesis

H_{0}

is defined as

{Ev}_{{EI}_{r} (ν)} (H_{0}) := \int_{{EI}_{r} (ν) \cap H_{0}} p (θ | y) d θ

(8)

The corresponding Bayesian evidence value

{Ev}_{{EI}_{r} (ν)} (H_{1})

for the alternative hypothesis

H_{1}

is defined analogously. The Bayesian evidence value

{Ev}_{{EI}_{r} (ν)} (H_{0})

is the integral of the posterior density

p (θ | y)

over the intersection of the evidence interval

{EI}_{r} (ν)

with the interval hypothesis

H_{0}

. The FBET is implemented in the R package

fbst

, which is available on CRAN and details and illustrative examples can be found in the articles of Kelter.^49,43

In the context of the exploratory study the same hypothesis $H_{0} : δ \leq δ_{0}$ is tested against $H_{1} : δ > δ_{0}$ which was chosen when using posterior probabilities as the BDC.

4. Methods

Figure 2 shows the two competing preclinical research pipelines. The top part shows the frequentist research pipelines, and the lower part details the Bayesian research pipelines. Details regarding the methodology follow below.

4.1. SESOIs for exploratory and confirmatory research

To implement the preclinical research pipeline with BDCs which focus on the smallest effect size of interest, the latter must be specified. Danziger et al.⁹ used a similar approach based on the frequentist estimation of the SESOI and used Hedges’ $g = 0.5$ and $g = 1.0$ . Here, we use two distinct approaches:

$▸$
First, a double-testing strategy which employs a statistical test—NHST when using p-values (top part in Figure 2) and one of the four options outlined in the previous section when using BDCs (bottom part in Figure 2)—both in the exploratory and confirmatory studies.
$▸$
Second, an approach similar to the one of Danziger et al.⁹ where a check is performed based on the exploratory study data $y_{e x p}$ whether the 95% CI or 95% HPD includes a value at least as large in magnitude as the SESOI. If
$| min {δ \in R | δ \in {CI}_{95 %}} | > δ_{0}$
the confirmatory study applies p-value-based NHST. In the Bayesian trajectories, if
$| min {δ \in R | δ \in {HPD}_{95 %}} | > δ_{0}$
a hypothesis test based on one of the four BDCs is conducted in the confirmatory study.
Thus, the two research pipelines in Figure 2 are split into two subcases each of which are also shown in Figure 2. In all trajectories, $δ_{0} := 0.5$ is chosen as the SESOI, which equals the boundary between a small and a medium effect according to Cohen.²⁹ As preclinical animal research must filter the most promising treatments among a possibly large number of ineffective candidates, $δ_{0} := 0.5$ provides a reasonable threshold for further investigation.⁵² Although original studies often report much larger effect sizes, the Reproducibility Project: Cancer Biology found that the median and mean shrinkage of effect sizes between original and replication studies is $85 %$ and $90 %$ , compare Errington et al.²³ Based on the available literature—compare Figure 1—about 70% of the reported effect sizes in preclinical animal research pass this threshold, so in the application context this means that only $\approx 70 %$ of the effects of the $C (0, 1 / \sqrt{2})$ prior are judged as clinically meaningful a priori. However, the modified $C (0, \sqrt{2} / 3)$ prior accounts for publication bias and reduces this percentage to $24 %$ .
4.2. Data-generating model and sample size calculations

As outlined previously, the data-generating model follows the assumptions of Welch’s two-sample t-test, and a typical size of experimental units in exploratory preclinical animal research is $n_{1} := 10$ .^53,11,9 As mentioned by one reviewer, alternative sample sizes are even smaller, for example, $n_{1} = 8$ or even $n_{1} = 6$ , compare also Neumann et al.²¹ From a statistical point of view, even 10 animals are quite small, so we do not use smaller sample sizes in the exploratory study. In the NHST-based trajectory and the trajectory using BDCs $n_{1} = 10$ experimental units are used in both groups in the exploratory study. For the replication study, the required sample size $n_{2}$ in the NHST-based trajectory was determined by a power analysis for 50% power to reject $H_{0} : δ < δ_{0}$ for the estimated $\hat{δ} (y_{e x p})$ of the exploratory study. Although this sounds quite small, there are four aspects to consider.

(i)
First, the test concerns $H_{1} : δ > δ_{0}$ , and a power of 50% is only the smallest power that can be attained. If the true $δ$ is larger than $δ_{0}$ , there will be a higher power to detect an effect. For example, the power analyses in Figure 3 show that for larger power the number of animals quickly increases substantially. On the other hand, there is no reason to expect that the effect size increases in the replication study.^23,34
(ii)
Second, selecting 50% power for $δ_{0} = 0.5$ reduces the number of animals required in the confirmatory study and thereby acknowledges the ethical constraints of preclinical research.
(iii)
Third, for small to moderate effect sizes power analyses based on more than 50% power can result in unrealistically large sample sizes for confirmatory replication studies. Unrealistically means that researchers would probably not pursue such a replication attempt when faced with such a sample size requirement due to economic constraints. Figure 3 shows some of the simulation results and demonstrates that even for 50% power some trajectories require a substantial number of animals (e.g. $\approx 25$ animals per group for p-values with 50% power vs. $\approx 50$ animals per group for p-values with $80 %$ power). Thus, designing a replication study with 50% power meets the economic constraints medical research faces in practice.
(iv)
Fourth, when using $80 %$ power in the replication study, there is no guarantee that the final power of original and replication study together will meet a predefined threshold. Here, we follow common practice of preclinical animal research that $n_{1} = 10$ animals are chosen mostly by ad-hoc considerations and not formal statistical arguments, which is a widespread practice in preclinical research, see Majid et al.²⁰ If a formal statistical approach would be used, then even 80% power in an original and replication study would result in a power of $0.8 \cdot 0.8 = 0.64$ to detect a true effect. The required number of animals to meet $80 %$ power in total would be huge, as $\approx 90 %$ power in each study would be necessary.

However, there are constraints which should be considered to avoid unrealistic sample sizes. First, a lower bound for the sample size $n_{2}$ was set to $n_{2} = 5$ as fewer experimental units seem questionable to obtain reliable results.^53,2,3 Second, the largest sample size $n_{2}$ was limited to $n_{2} = 100$ which is already very large. Such a resulting sample size $n_{2}$ is almost prohibitively expensive and therefore renders an approach inferior to competing approaches which require smaller $n_{2}$ and yield similar performance characteristics. Third, as the SESOI is set to $δ_{0} := 0.5$ in all cases, the power should be calculated for this $δ_{0}$ . Thus, when $H_{1} : δ > δ_{0}$ is accepted in the exploratory study, the power calculations for $50 %$ power in the confirmatory study are based on $δ_{0} := 0.5$ . If, however, the effect size estimate $\hat{δ} (y_{e x p})$ of the exploratory study is $> 0.5$ , the power is calculated based on the smallest next value rounded to $0.10$ . For example, if $\hat{δ} (y_{e x p}) = 0.74$ , the sample size calculation is performed for 50% power to detect $δ = 0.70$ in the confirmatory study. This extra cushion safeguards against overoptimism when performing sample size calculations based on the exploratory results, and acknowledges the small sample size on which $\hat{δ} (y_{e x p})$ is based (i.e. $n_{1} = 10$ ). The latter is a form of uncertainty correction which is important as stated by Anderson and Kelley.⁴ Likewise, if the estimate $\hat{δ} (y_{e x p})$ is $\geq 1$ in magnitude, to avoid overoptimism in the sample size calculations for $n_{2}$ , the sample sizes are calculated for $δ = 1$ in these cases. In most cases, the resulting $n_{2}$ would otherwise be very small, compare Figure 3.
4.3. Monte Carlo simulations for Bayesian power

As BDCs require simulations to investigate frequentist operating characteristics such as the power, a Monte Carlo simulation can be carried out to obtain power estimates under various true values of $δ \in R$ for the trajectories using BDCs. Based on these simulation results, the sample size $n_{2} (C)$ to attain 50% power with the used (Bayesian) decision criterion $C$ in the confirmatory study can be quantified. Figure 3 shows the resulting power of the four BDCs specified above as well as the MET procedure and the p-value in Welch’s two-sample t-test.⁶ For the p-value in Welch’s t-test, the standard sample size calculations are available.⁵⁴ For the Bayesian trajectories and power analyses, the prior was set to the publication-bias corrected $δ \sim C (0, \sqrt{2} / 3)$ as elicited in Figure 1. For all BDCs, the posterior $p (δ | y)$ based on the Rouder et al.’s NOH model for interval BFs is used, which can be sampled via standard MCMC. This ensures that BDCs are based on the same posterior distribution for every data set.

A crucial aspect was raised by one reviewer, which is that BDCs should ideally be based on the Bayesian approaches to the sample size calculation for replication studies. We agree with this recommendation and, therefore, followed the hierarchical normal-normal model outlined by Pawel et al.²⁴ to calculate the required sample sizes for the replication study. The Online Appendix outlines the necessary theory of this approach, and we developed novel sample size calculation formulas for all four BDCs except interval BFs. The latter require extensive computational effort which renders the Bayesian sample size calculation inaccessible for this decision criterion. Therefore, we opted for the Bayesian sample size calculation for posterior probabilities when using interval BFs. This is motivated mainly by the fact that under prior odds of one, posterior odds and BFs are decision-theoretically equivalent, compare Robert.⁵⁵ We further note that the sample size calculations shown in Figure 3 are a special case of the more general formulas derived in the Online Appendix, but are rendered unrealistic from a Bayesian point of view because uncertainty about the true parameter value is essentially ignored. Thus, power calculations in Figure 3 are used only for p-values and MET-based research pipelines, while all Bayesian sample size calculations follow the approach detailed in the Online Appendix.

As a sidenote, Figure 3 shows that the power of various methods (e.g. the ROPE, posterior probabilities, the MET, or the FBET) does remain close to zero when the true effect size is equal to $δ_{0}$ . The reason for this phenomenon is that when the true effect size is on the boundary between $H_{0}$ and $H_{1}$ , the power does not increase at any relevant rate for increasing the sample size.

Figure 2.

Two preclinical research pipelines each comprising an exploratory and possibly a confirmatory study, depending on the results of the exploratory study; the traditional trajectory is based on frequentist methods (above the horizontal red line) and uses the NHST and statistical significance of the exploratory study results; in the confirmatory study, the p-value is used; the Bayesian trajectory (below the horizontal red line, a color version can be found in the electronic version of the article) instead employs BDCs in the exploratory study and confirmatory study. NHST: null hypothesis significance testing; BDCs: Bayesian decision criteria.

Figure 3.

Bayesian power analysis simulations for the different BDCs, the MET, and p-values in Welch’s two-sample t-test. These power analyses are a special case of the more general Bayesian sample size calculations in the hierarchical normal-normal model outlined in the Online Appendix and detailed in Pawel et al..²⁴ From a hybrid Bayesian-frequentist perspective, they are justified. From a fully Bayesian point of view, they ignore the uncertainty about the true effect size $δ$ , because power is calculated under assumption of a specific value $δ = 0.5, \dots, 1.0$ . In the main simulations, power calculations in the Bayesian trajectories are, therefore, based on the formulas derived in the Online Appendix. BDCs: Bayesian decision criteria; MET: minimum effects test.

4.4. Simulation study design

The design of the simulation study is summarized in Figure 2. For the exploratory study, the true effect size $δ_{e x p}$ is simulated from $δ \sim C (0, \sqrt{2} / 3)$ , compare Figure 1. Data in both groups is simulated as

X_{i}^{e x p} \sim N (- δ_{e x p} / 2, 1) and Y_{i}^{e x p} \sim N (δ_{e x p} / 2, 1)

(9)

for

i = 1, \dots, n_{1}

with

n_{1} := 10

. Thus, homoscedasticity is assumed for the exploratory data

y_{e x p} := (X_{e x p}, Y_{e x p})

, where

X_{e x p} := (X_{1}^{e x p}, \dots, X_{n_{1}}^{e x p})

and

Y_{e x p} := (Y_{1}^{e x p}, \dots, Y_{n_{1}}^{e x p})

. Based on the exploratory study, either a frequentist (top) or Bayesian approach (bottom) is used for the analysis of the exploratory data. The frequentist pipeline either considers NHST or the SESOI + 95% CI approach to determine whether to run a confirmatory study or not. NHST either runs a standard Welch’s two-sample t-test or conducts the MET. Sample size calculations for

n_{2}

are then based on whether Welch’s two-sample t-test or the MET is used in the confirmatory study.

The Bayesian trajectories either perform a hypothesis test based on one of the four BDCs outlined above, or check whether the 95% HPD includes values larger than the SESOI $δ_{0} = 0.5$ . Depending on the results, the sample size calculations for $n_{2}$ are then performed based on the hierarchical normal-normal model and the formulas derived in the Online Appendix for each of the BDCs (except for interval BF’s, see above). If the replication sample size calculated exceeds $100$ animals, the calculated sample size is replaced by $100$ . The same holds for the sample sizes smaller than $5$ . An important difference to the frequentist sample size calculations is that all Bayesian sample size calculations for the replication studies are designed to achieve $90 %$ probability of replication success based on the predictive distribution of the replication study effect size. Details are provided in the Online Appendix.

The analysis of the replication study data $y_{c o n} := (X_{c o n}, Y_{c o n})$ with $X_{c o n} := (X_{1}^{c o n}, \dots, X_{n_{2}}^{c o n})$ and $Y_{c o n} := (Y_{1}^{c o n}, \dots, Y_{n_{2}}^{c o n})$ is then performed via one of the four BDCs. For the posterior probability, the threshold $P_{ϑ | Y} (H_{0}) < 0.1$ is used. For the interval BF, the threshold ${BF}_{01} (y_{e x p}) < 1 / 3$ is used which amounts to moderate evidence in favor of $H_{1} : δ \geq δ_{0}$ (analogously for ${BF}_{01} (y_{c o n})$ in the confirmatory study) according to the scale of Jeffreys.⁴⁶ The ROPE test is based on the 95% HPD, compare Kruschke,⁷ and the FBET uses the threshold ${Ev}_{{EI}_{r} (ν)} (H_{1}) > 0.9$ , see Kelter.⁴³

Importantly, in the confirmatory study data in both groups is not simulated according to the same value $δ_{e x p}$ which was used in the preceding exploratory study. According to the results of Freuli, Held and Heyard³⁴ questionable research practices (QRPs) can lead to a drastic shrinkage of effect sizes in replication studies, and we assume that this holds also for the effect sizes in Figure 2. Also, empirical evidence from the Replication Project: Cancer Biology shows that median and mean shrinkage of effect sizes in animal studies is $85 %$ and $90 %$ . Thus, we apply a random shrinkage factor which varies between $20 %$ and $80 %$ to the exploratory effect size. This accounts for the possibility of studies where more or less QRPs are included.

Data in the confirmatory study is thus simulated according to

X_{i}^{c o n} \sim N (- δ_{c o n} / 2, 1) and Y_{i}^{c o n} \sim N (- δ_{c o n} / 2, 1)

(10)

where

δ_{c o n} = δ_{e x p} \cdot (1 - Z)

with

Z \sim U (0.2, 0.8)

and

U (a, b)

denotes the uniform distribution on

[a, b]

. The following primary endpoints were analyzed for each trajectory, compare Table 1: The positive predictive value (PPV)

PPV = 1 - FDP = \frac{t_{p}}{t_{p} + f_{p}}

(11)

which is the probability that a test which accepts

H_{1} : δ > δ_{0}

corresponds to a true underlying effect of

| δ | > δ_{0}

Table 1.

Contingency table for $H_{0} : | δ | < δ_{0}$ and $H_{1} : δ > δ_{0}$ illustrating false-positive and false-negative results.

	$H_{1}$ accepted	$H_{0}$ accepted	$\sum$
$H_{1}$ true	True-positive ( $t_{p}$ )	False-negative ( $f_{n}$ )	$t_{p} + f_{n}$
$H_{0}$ true	False-positive ( $f_{p}$ )	True-negative ( $t_{n}$ )	$f_{p} + t_{n}$
$\sum$	$t_{p} + f_{p}$	$f_{n} + t_{n}$

The negative predictive value (NPV)

NPV = 1 - FOR = \frac{t_{n}}{f_{n} + t_{n}}

(12)

which is the complement of the false omission rate (FOR) defined below. The false discovery rate (FDP)

FDP = 1 - PPV = \frac{f_{p}}{t_{p} + f_{p}}

(13)

which is the probability of a test accepting

H_{1} : δ > δ_{0}

being a false-positive (i.e.

H_{0} : | δ | \leq δ_{0}

is true), and the false omission rate (FOR)

FOR = 1 - NPV = \frac{f_{n}}{t_{n} + f_{n}}

(14)

which is the complement of the NPV. The quantity

t_{p}

in Table 1 is formally defined as the probability

P (S_{M} | δ > δ_{0})

, where

S_{M}

denotes the success region of method

M

. For example, for p-values

S_{M} := {p < α}

, while for posterior probabilities

S_{M} := {P_{ϑ | Y} (δ > δ_{0}) > 0.9}

. The quantities

f_{n}

f_{p}

, and

t_{n}

are defined analogously. As the PPV and FDP are complementary, we solely report the FDP here. Likewise, as the NPV and FOR are complementary, we solely report the FOR. Thus, balancing the FDP and FOR is the challenge for all research pipelines.⁵⁶

Importantly, for the frequentist trajectory using Welch’s two-sample t-test a significant test with $p < α$ for $δ < 0.5$ was counted as a false-positive which is a deviation from the usual interpretation of a false-positive, but required to focus on the SESOI $δ_{0}$ . Likewise, only tests with $δ > 0.5$ and $p \geq α$ are counted as false-negatives. Monte Carlo estimates of (11) to (14) were obtained based on $N = 25, 000$ simulated research trajectories. For each setting—four frequentist and eight Bayesian research pipelines—analyses were performed on identical data sets and known true effect sizes.

5. Results

Figure 4 shows the distribution of effect sizes which yielded a successful exploratory and confirmatory study.

Figure 4.

Visualization of effect sizes $δ$ which lead to a success in the exploratory study and follow-up confirmatory study for the different trajectories.

From the $25, 000$ simulated trajectories $24, 268$ remained after truncating the results to $| δ | \leq 10$ . As outlined earlier, the meta-analysis of Bonapersona et al.¹¹ included only nine out of 2738 studies which yielded an effect size larger than 10, so evidence about such extreme effect sizes is limited. As prior elicitation in Figure 1 excluded these studies because the reported effect sizes seem unrealistically large the trajectories with $| δ | \geq 10$ were also excluded from the analysis.

5.1. Effect sizes which transition to the confirmatory study

Figure 4 shows that there is no substantial difference between double-testing and interval-based estimation plus testing approaches when it comes to which effect sizes do lead to a successful trajectory. The results in the two top left panels for 95% CI + p-value show that compared to a p-value + p-value-based trajectory shown in the panel right to it, more effect sizes do lead to a success in both studies, but the distribution of successful effect sizes does not change substantially. The same phenomenon holds for the other approaches, too. A noticeable exception is the BF, where more effect sizes in the range of $δ \in [0, 2]$ lead to a successful trajectory when using 95% HPD + BF instead of double testing.

5.2. Performance metrics

Figure 5 shows the results for the FDP and FOM for each research pipeline. Starting in the left plot, the two left results are the frequentist research pipelines, and the four right correspond to the BDCs. Blue triangles show the results for estimation and subsequent testing, while points show the results for double-testing.

Figure 5.

FDP (left) and FOM for the different methods; vertical black line separates frequentist from Bayesian trajectories, circles indicate the results for double-testing and triangles for estimation (95% CI or HPD) and testing.

First, all approaches lead to a FDP which is below 5%. However, there are substantial differences between the methods and the double-based testing trajectory which employs p-values yields the largest FDP of $\approx 0.042$ . Shifting to estimation and subsequent testing reduces this rate to $\approx 0.027$ . The corresponding FOM shown in the right plot in Figure 5 remains approximately the same, compare also Table 2. This shows that shifting to estimation and testing improves the reliability of the p-value-based research pipeline.

Shifting to double-testing with the MET, the FDP vanishes entirely when a double-testing strategy is applied. Shifting to estimation and subsequent testing, the FDP increases to $\approx 0.001$ , which is still much better than a purely p-value-based approach. Interestingly, however, after shifting to estimation and subsequent testing the FOM is comparable to the one of the p-value double-testing-based research pipeline, so shifting to estimation and a subsequent MET is highly recommended when a frequentist approach is taken. Also, this shows that slightly more truly existing effects lead to a succesful trajectory when using estimation and subsequent testing.

For the Bayesian solutions, there are substantial differences between the methods. First, double-testing with the ROPE yields a FDP of zero, which is highly attractive. Figure 4 shows that the FBET follows with a FDP of $\approx 0.001$ irrespective of whether double-testing or estimation and subsequent testing is used. Posterior probability-based trajectories follow next with a FDP of $\approx 0.004$ and BFs yield the largest FDP. While for the FBET and posterior probabilities, there is no noticeable difference between double-testing and estimation plus subsequent testing, the FDP of BFs is substantially improved when incorporating estimation based on the 95% HPD and SESOI in the exploratory study, compare Figure 5. On the other hand, BFs yield the smallest FOM of all Bayesian approaches.

The right plot in Figure 4 shows that the differences in the FOM of the methods are less pronounced, and all methods yield a FOM which is approximately bounded by $20 %$ .

With respect to the FOM, the best approach is however the double-testing-based BF pipeline.

5.3. Why do BDCs yield a PPV close to one?

The reason why most BDCs yields a PPV close to 1 is due to the meta-analytic prior elicited in Figure 1. The sceptical Cauchy prior shrinks effect sizes towards zero, and thus when an exploratory study and confirmatory study yield an effect larger than the SESOI, the effect has to be large enough to overcome the shrinkage of the Cauchy prior towards zero. Thus, when a positive result is found, the underlying effect is with near certainty existing and FDP $\approx 0$ . Exceptions are the interval BF, where indifference between the null and alternative hypothesis yield a larger FDP. Still, shifting to estimation and testing substantially reduces the FDP then, compare Table 1. For posterior probability and the ROPE, it matters little whether double testing or estimation and subsequent testing are performed in terms of the resulting FDP. As a consequence, shifting towards estimation and subsequent testing is, in general, recommended.

An important consideration is that reducing the FDP also implies that less effect sizes of moderate magnitude transition to a successful trajectory, compare Figure 4. For example, shifting from BF + BF to 95% HPD + BF reduces the FDP substantially—compare Table 1—but at the same time the number of effects which are in $[0, 2]$ and which yield a success in exploratory and confirmatory studies is reduced. Figure 4 also shows that the distribution of effect sizes which lead to a success are more skewed for methods with a larger FDP. Methods which entirely eradicate the FDP such as the MET or ROPE only lead to a success if effect sizes are large (see the x-axes of the histograms of MET and ROPE in Figure 4). For example, the ROPE or MET pipelines only navigate effect sizes to a success which are approximately larger than $2$ . In contrast, BDCs such as the BF, FBET, or posterior probability do lead to a successful trajectory also when the effect size is in the interval $[0, 2]$ . However, due to the shrinkage of the Cauchy prior (and the additional shrinkage by setting $μ_{θ} := 0$ in the hierarchical normal-normal model for the Bayesian sample size calculation—compare the Online Appendix) fewer effect sizes of moderate magnitude yield a success for Bayesian approaches, which is also reflected in the thinned out histograms of BDCs in the area $[0, 2]$ , compare Figure 4.

5.4. Double-testing versus estimation and testing

The results in Figure 5 show that p-values and BFs yield much smaller FDPs when shifting to estimation and subsequent testing. All other approaches are influenced only slightly by this shift. Table 2 shows that the ROPE, the MET, and posterior probabilities even yield slightly increased FDRs when using estimation and testing compared to a double-testing strategy. As a consequence, it is recommended to use estimation and subsequent testing only when using p-values or interval BFs.

Table 2.
Simulation results for the PPV, NPV, FDP, and FOR of the different research pipelines and means and medians of the sample size $n_{2}$ for the confirmatory study.

Double-testing-based approaches

PPV FDP NPV FOM Mean $n_{2}$ Median $n_{2}$

p-value 0.95768 0.04232 0.82323 0.17677 6.01 5.00

MET 1.0000 0.0000 0.81683 0.18317 32.21 32.21

${BF}_{01}$ 0.97812 0.02188 0.83511 0.16489 21.86 8.00

$P_{ϑ | Y} (H_{1})$ 0.99593 0.00407 0.79992 0.20008 8.60 6.00

ROPE 1.0000 0.0000 0.79514 0.20486 17.24 10.00

${Ev}_{E I_{r} (ν)} (H_{1})$ 0.99886 0.00114 0.79636 0.20364 8.62 6.00

95% CI or 95% HPD + SESOI and testing

p-value 0.97302 0.02698 0.81979 0.18021 5.43 5.00

MET 0.99938 0.00062 0.82244 0.17756 34.52 32.21

${BF}_{01}$ 0.99181 0.00819 0.8209 0.1791 8.78 6.00

$P_{ϑ | Y} (H_{1})$ 0.99534 0.00466 0.80298 0.19702 8.78 6.00

ROPE 0.999 0.001 0.80053 0.19947 17.93 11.00

${Ev}_{E I_{r} (ν)} (H_{1})$ 0.99894 0.00106 0.79871 0.20129 8.78 6.00

Double-testing-based approaches
p-value	0.95768	0.04232	0.82323	0.17677	6.01	5.00
MET	1.0000	0.0000	0.81683	0.18317	32.21	32.21
${BF}_{01}$	0.97812	0.02188	0.83511	0.16489	21.86	8.00
$P_{ϑ \| Y} (H_{1})$	0.99593	0.00407	0.79992	0.20008	8.60	6.00
ROPE	1.0000	0.0000	0.79514	0.20486	17.24	10.00
${Ev}_{E I_{r} (ν)} (H_{1})$	0.99886	0.00114	0.79636	0.20364	8.62	6.00
95% CI or 95% HPD + SESOI and testing
p-value	0.97302	0.02698	0.81979	0.18021	5.43	5.00
MET	0.99938	0.00062	0.82244	0.17756	34.52	32.21
${BF}_{01}$	0.99181	0.00819	0.8209	0.1791	8.78	6.00
$P_{ϑ \| Y} (H_{1})$	0.99534	0.00466	0.80298	0.19702	8.78	6.00
ROPE	0.999	0.001	0.80053	0.19947	17.93	11.00
${Ev}_{E I_{r} (ν)} (H_{1})$	0.99894	0.00106	0.79871	0.20129	8.78	6.00

FDP: false-discovery proportion; FOM: false-omission rate; PPV: positive predictive value; NPV: negative predictive value; ROPE: region of practical equivalence; MET: minimum effects test; BF: Bayes factor.

Figure 6 also indicates that the required sample sizes for the replication study increase when using estimation and subsequent testing with the MET. The sample size is reduced only slightly for the p-value and is reduced substantially for interval BFs. The latter strengthens the argument to use estimation and testing solely for interval BFs and p-value-based research pipelines.

Figure 6.

Mean (top left) and median (top right) of sample size $n_{2}$ for the confirmatory study for different (Bayesian) decision criteria, ordered by decreasing sample.

5.5. Mean and median sample sizes for the confirmatory study

The results in Figure 5 and Table 2 show that shifting to estimation and subsequent testing substantially reduces the FDP when using p-values. Also, when shifting to the MET which incorporates the SESOI from the start, the FDP vanishes entirely. From a Bayesian perspective, the ROPE can reduce the FDP essentially to zero, and the FBET follows lead.

However, a decision which approach to use should incorporate further relevant quantities such as the required sample size $n_{2}$ for the confirmatory study.

The top left and right plots in Figure 6 show the mean and median sample sizes $n_{2}$ required for the confirmatory study.⁷ The horizontal blue line marks the boundary of $20$ experimental units per group, which is a reasonable threshold after which the ethical and economic costs to conduct a confirmatory follow-up study become substantial.^20,21

Results show that the MET requires the largest number of animals in a replication study. The top left plot shows that the number of animal units also decreases when shifting to 95% HPD + BF from BF + BF. Thus, estimation and subsequent testing is highly recommended for BFs. The same phenomenon holds for the medians, compare the top right plot in Figure 6.

The differences for the ROPE are less pronounced when comparing double testing and estimation plus testing, and the ROPE requires more animals than posterior probabilities, nearly twice as much, compare also Table 2.

The differences for the FBET and p-values are also less pronounced when comparing double testing and estimation plus subsequent testing. An important result is that 95% CI + p-values do require the smallest average number of animals, however. Still, Table 2 shows that the posterior probability-based testing or testing with the FBET requires on average $\approx 3$ animals more which is the price to reduce the FDP to a much smaller value than when using p-values.

The bottom plot in Figure 6 shows the violin plots which visualize the distribution of replication study sample sizes $n_{2}$ for each research pipeline. A particularity is the resulting distribution for the MET, which essentially reduces to a single value. The reason is that the MET only leads to a replication study, if the exploratory study effect size is already quite large. Therefore, compare the x-axis scaling of the MET histogram in Figure 4. If the exploratory effect is not quite large (e.g. $δ \geq 2$ ), no replication study is carried out with the MET. The resulting mean and median of $\approx 32$ animals is thus the power calculation under $δ = 1$ to safeguard against overoptimism and adjust for uncertainty³³ (compare the section detailing the simulation study design). Once the 95% CI is used in the exploratory study, the panel left to the one of the MET double-testing strategy in the bottom plot of Figure 6 shows that now smaller effect sizes lead to a success, but as a consequence the required number of animals $n_{2}$ increases, which is also reflected in the top left plot in Figure 6.

6. Discussion

From a frequentist perspective, a shift towards statistical approaches which explicitly incorporate the SESOI yields a smaller number of false-positive results than a purely null hypothesis-based approach via Welch’s two-sample t-test. This is reflected in the improvement of the FDP for the p-value-based research pipeline when shifting to 95% CI + subsequent testing. Shifting to the MET which explicitly aims at incorporating the SESOI from the start even yields a smaller FDP and increases the FOM only marginally. Results thus indicate that from a frequentist point of view, estimation and testing is superior to double testing for p-values. When using the MET, double-testing and estimation plus testing yield no substantial differences. However, a drawback of the MET is that the number of animals required for the replication study can be quite large, about 30 animals per group.

Shifting to Bayesian statistical decision criteria can also reduce the FDP substantially, in particular, when using the ROPE or FBET. However, the ROPE suffers from the same problem as the MET regarding the required number of animals for the replication study, although there only $\approx 18$ animals are required on average.

When balancing the FDP, FOM ,and the number of required animals for the replication study, the FBET-based research pipelines yields the best performance. On average, $\approx 9$ animals are required for a replication study, the FDP is reduced to $0.001$ and the FOM is $\approx 0.20$ . Unless the FOM must be kept substantially smaller—in this case, estimation with 95% HPD and subsequent interval BFs are more appropriate—the 95% HPD plus FBET pipeline is the recommended choice.

Also, compared with the frequentist research pipelines, the p-value-based pipeline yields on average only $\approx 6$ animals, but yields $\approx 2.7 %$ false-positives, which is the largest false-positive rate of all estimation + testing-based research pipelines. The MET fares better with regard to this point, but yields the largest average number of required animals for the replication study. As a consequence, shifting to BDCs such as the FBET or posterior probability can improve the reliability of preclinical research. The ROPE or MET is an attractive option if quite large effect sizes are of interest only.

Thus, we conclude with the following general recommendations:

From a frequentist perspective, when using p-values shifting to estimation and subsequent testing is strongly recommended. Shifting to the MET is only recommended if the number of experimental units is relatively easy to acquire. If ethical or economic concerns require smaller number of animals for a replication study, using 95% CI + subsequent testing with p-values is preferred over the MET.

From a Bayesian perspective, 95% HPD estimation and subsequent testing with the FBET yields the best tradeoff in terms of FDP, FOR, and the number of animals required for the replication study. Thus, unless there are specific reasons to use a different approach this is the recommended default choice.

Both the MET and ROPE are questionable from a purely ethical perspective because they require the largest number of animals.

Furthermore, two points should be noted: First, while the FBET, posterior probability and the ROPE are influenced little by shifting to estimation plus testing, the FDP of the BF is reduced. Thus, estimation and subsequent testing is explicitly recommended when using interval BFs.

Secondly, the performance of the FBET may be due to the selected reference function $r$ which was chosen as the prior density in the simulations. The theory of the FBET implies that statistical evidence is therefore regularized with respect to this prior density. As a consequence, the resulting FDP may be close to the one of the ROPE, but allows also moderate to medium effect sizes to funnel through to a successful trajectory. Regarding the FBET, more research is necessary. Concerning the last point, as the ROPE and posterior probabilities have been shown to be special cases of the FBET⁴³ it is worth mentioning that for a Bayesian it might be appealing to adopt the FBET and use only different versions of it depending on the demands of the trial.

A further important result is that all Bayesian approaches yield a smaller FDP than the p-value-based research pipeline, no matter whether double testing or estimation and subsequent testing is used, compare Figure 5.

Limitations of the results include uncertainty in the elicited Cauchy $C (0, \sqrt{2} / 3)$ prior for $δ$ and the elicitation of the SESOI $δ_{0}$ . Also, Bayesian sample size planning for the interval BFs was based on a posterior probability criterion, so interval BFs may reach similar sample sizes when a method-specific sample size calculation is applied. A further limitation of the results is that Bayesian sample size calculations aimed for 90% probability of replication, which is quite subjective. An interesting aspect is, however, that despite the uncertainty about the to-be-observed effect size in the replication study, the Bayesian sample size calculations for the replication study turned out to be remarkably small, in particular, compared to the fixed effect size-based sample size calculation of the frequentist research pipelines. In combination with the resulting FDPs and FORs of the Bayesian research pipelines, this puts trust in the hierarchical normal-normal model detailed by Pawel et al.²⁴ Regarding the comparability of frequentist and Bayesian sample size calculations, a further comment is necessary: While frequentist power was calculated for 50% to reject the null hypothesis, the Bayesian sample size calculations are based on the idea to guarantee 90% replication success. Thus, these two sample size planning approaches are not directly comparable, but as noted in the “Methods” section, using larger power values for the frequentist methods increases replication sample sizes $n_{2}$ drastically. Also, asserting replication success further drives the necessary power values up, which then result in $\approx 90 %$ power in both the original and replication study (which amounts to approximately $70$ animals in both studies when using p-values, which is prohibitively large).

Another limitation is that the thresholds for the BDCs were selected ad hoc, but simulation results show that these yield reasonable operating characteristics such as FDP smaller than 5% and FOM smaller than 25% . Also, we did not address the task of designing a research pipeline with prespecified boundaries on certain operating characteristics, which could be tackled in future research.

6.1. Other reverse-Bayes methods

There are multiple venues to extend the current results, for example, by explicitly considering replication measures such as reverse-Bayes methods.¹² The latter include, for example, the sceptical p-value,⁵⁷ replication BF,³⁸ and other measures.^24,34,12 However, these reverse-Bayes methods are even more specialized than the BDCs investigated in this article, and corresponding frequentist methods such as the dual-criterion design proposed by Rosenkranz²⁷ then would constitute reasonable competitors to judge replication success. Most of these approaches also require further modeling assumptions and sample size calculations for replication studies are a topic of ongoing research.⁵⁸

6.2. Adaptive designs

An important aspect raised by one reviewer is the fact that the naive combination of two studies might fit a sample size planning approach such as the normal-normal model, but an adaptive design might be more appropriate in practice. Thus, shifting towards an adaptive design might be a more realistic option than to opt for an approach which conducts a confirmatory replication study in case the exploratory study was successful.

A further possibility for future research is thus to explicitly consider adaptive designs such as group-sequential trial designs which allow to stop a trial early for futility or efficacy. The benefit of these approaches for preclinical animal research has recently been demonstrated by Neumann et al.,²¹ compare also Majid et al.²⁰

Regarding the choice of the SESOI $δ_{0}$ and the power of 50% for the confirmatory follow-up study, there are also two comments worth noting. First, the chosen SESOI in this article is quite small and researchers often are interested in much larger effect sizes. However, the meta-analysis of Bonapersona et al.¹¹ showed that in fact smaller effect sizes are more common in medical research. This is despite the wish of researchers to reveal large effects in an animal study.

Second, the power of 50% is underpowered from a purely statistical point of view. However, the number of animals used in preclinical research is often small to moderate and here the goal was to mirror current research practices which are often not optimal.^34,21 Future research could shift to more conservative power calculations of 80% in the replication study, and investigate how results differ. This approach could be more in line with acknowledging the ethical constraints of animal trials.

6.3. Summary

In summary, using estimation with the 95% HPD and subsequent testing with the FBET obtains an appealing overall performance according to Table 2. The average total number of required animals for both the original and replication is below $20$ when using the FBET or posterior probabilities, and the chance of a false-positive finding is reduced to $\approx 0.1 %$ for the FBET and $\approx 0.5 %$ for posterior probabilities. Both approaches yield a FOM of $\approx 20 %$ .

Other Bayesian approaches such as posterior probabilities or the ROPE fare similar but require either more animals or yield slightly larger FDPs. From a frequentist perspective, using estimation and subsequent testing is superior for p-value-based research pipelines. Shifting to the MET is feasible only when a large number of animals in the replication study are attainable and ethically defendable.

Thus, shifting to (Bayesian) SESOI-based approaches can reduce the false discovery rate of preclinical animal research while keeping the number of experimental units required for a confirmatory study at a moderate level. While more research is needed in this direction, the result provided in this article should motivate researchers to consider SESOI-based approaches—in particular, BDCs—more explicitly in preclinical animal research.

Supplemental Material

sj-pdf-1-smm-10.1177_09622802231184636 - Supplemental material for Reducing the false discovery rate of preclinical animal research with Bayesian statistical decision criteria

Supplemental material, sj-pdf-1-smm-10.1177_09622802231184636 for Reducing the false discovery rate of preclinical animal research with Bayesian statistical decision criteria by Riko Kelter in Statistical Methods in Medical Research

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship and/or publication of this article.

ORCID iD

Riko Kelter

Supplemental material

Supplemental materials for this article are available online.

Notes

References

Kimmelman

Mogil

Dirnagl

. Distinguishing between exploratory and confirmatory preclinical research will improve translation. PLoS Biol 2014; 12: e1001863.

Mogil

Macleod

. No publication without confirmation. Nature 2017; 542: 409–411.

Drude

Gamboa

Danziger

et al. Science forum improving preclinical studies through replications. eLife 2021; 10: 1–10.

Anderson

Maxwell

. Addressing the “Replication Crisis”: Using original studies to design replication studies with appropriate statistical power. Multivariate Behav Res 2017; 52: 305–324.

Wasserstein

Schirm

Lazar

. Moving to a world beyond “p<0.05”. Am Stat 2019; 73: 1–19.

McShane

Gal

Gelman

et al. Abandon statistical significance. Am Stat 2019; 73: 235–245.

Kruschke

. Rejecting or accepting parameter values in Bayesian estimation. Advances in Methods and Practices in Psychological Science 2018; 1: 270–280.

Chuang-Stein

Kirby

Hirsch

et al. The role of the minimum clinically important difference and its impact on designing a trial. Pharm Stat 2011; 10: 250–256.

Danziger

Collazo

Dirnagl

et al. Balancing sensitivity and specificity in preclinical research. bioRxiv 2022; 2022.01.17.476585. doi:10.1101/2022.01.17.476585. https://www.biorxiv.org/content/10.1101/2022.01.17.476585v3.

10.

Matthews

. Why should clinicians care about Bayesian methods? J Stat Plan Inference 2001; 94: 43–58.

11.

Bonapersona

Hoijtink

Abbinck

et al. Increasing the statistical power of animal experiments with historical control data. Nature Neurosci 2021; 24: 470–477.

12.

Held

Matthews

Ott

et al. Reverse-Bayes methods for evidence assessment and research synthesis. Res Synth Methods 2021; 13: 295–314. DOI: 10.1002/JRSM.1538.

13.

Quatto

Ripamonti

Marasini

. Beyond p <.05: A critical review of new Bayesian proposals for assessing the p-value. https://doiorg/101080/1054340620212009497 2022; 32: 308–329.

14.

Kelter

. Analysis of Bayesian posterior significance and effect size indices for the two-sample t-test to support reproducible medical research. BMC Med Res Methodol 2020; 20: 88.

15.

Kelter

. Bayesian Hodges-Lehmann tests for statistical equivalence in the two-sample setting: Power analysis, type I error rates and equivalence boundary selection in biomedical research. BMC Med Res Methodol 2021; 21: 171.

16.

Bailoo

Reichlin

Würbel

. Refinement of experimental design and conduct in laboratory animal research. ILAR J 2014; 55: 383–391.

17.

US Food and Drug Administration. Preclinical Research, 2018. https://www.fda.gov/patients/drug-development-process/step-2-preclinical-research.

18.

Held

Schwab

. Improving the reproducibility of science. Significance 2020; 17: 10–11.

19.

Carneiro

Moulin

Macleod

et al. Effect size and statistical power in the rodent fear conditioning literature – a systematic review. PLoS ONE 2018; 13: e0196258.

20.

Majid

Bae

Redgrave

et al. The potential of adaptive design in animal studies. Int J Mol Sci 2015; 16: 24048–24058.

21.

Neumann

Grittner

Piper

et al. Increasing efficiency of preclinical research by group sequential designs. PLoS Biol 2017; 15: e2001307.

22.

Matthews

. Introduction to Randomized Controlled Clinical Trials. 2nd ed. Boca Raton, FL: CRC Press, 2006.

23.

Errington

Mathur

Soderberg

et al. Investigating the replicability of preclinical cancer biology. eLife 2021; 10: e71601. DOI: 10.7554/ELIFE.71601.

24.

Pawel

Consonni

Held

. Bayesian approaches to designing replication studies 2022; doi:10.48550/arxiv.2211.02552. https://arxiv.org/abs/2211.02552v1. EPRINT2211.02552.

25.

Lakens

Scheel

Isager

. Equivalence testing for psychological research: A tutorial. Adv Methods Pract Psychol Sci 2018; 1: 259–269.

26.

Shieh

. Exact Power and Sample Size Calculations for the Two One-Sided Tests of Equivalence 2016; doi:10.1371/journal.pone.0162093.

27.

Rosenkranz

. Replicability of studies following a dual-criterion design. Stat Med 2021; 40: 4068–4076.

28.

Kelter

. How to choose between different Bayesian posterior indices for hypothesis testing in practice. Multivariate Behav Res 2021; 1: 1–29.

29.

Cohen

. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, N.J: Routledge, 1988.

30.

Button

Ioannidis

Mokrysz

et al. Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience 2013; 14: 365–376.

31.

Rouder

Speckman

Sun

et al. Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bullet and Revi 2009; 16: 225–237.

32.

Hedges

Olkin

. Statistical Methods for Meta-Analysis. San Diego, CA: Academic Press, 1985.

33.

Anderson

Kelley

. Sample size planning for replication studies: the devil is in the design. Psychol Methods 2022.

34.

Freuli

Held

Heyard

. Replication success under questionable research practices – a simulation study. MetaArXiv preprint 2023; doi:10.31222/OSF.IO/S4B65. https://osf.io/preprints/metaarxiv/s4b65/.

35.

Makowski

Ben-Shachar

Chen

SHA

et al. Indices of effect existence and significance in the Bayesian framework. Front Psychol 2019; 10: 2767.

36.

Berger

. Statistical Decision Theory and Bayesian Analysis. New York: Springer, 1985. ISBN 9781441930743.

37.

Held

. Bayesian Tail Probabilities for Decision Making. In Lesaffre E, Baio G and Boulanger B (eds.) Bayesian Methods in Pharmaceutical Research. Boca Raton, FL: CRC Press, 2020. pp. 53–73.

38.

Etz

Marsman

et al. Replication Bayes factors from evidence updating. Behav Res Methods 2019; 51: 2498–2508.

39.

Linde

Tendeiro

Selker

et al. Decisions About Equivalence: A Comparison of TOST, HDI-ROPE, and the Bayes Factor. psyarxiv preprint, https://psyarxivcom/bh8vu 2020.

40.

Morey

Rouder

. Bayes factor approaches for testing interval null hypotheses. Psychol Methods 2011; 16: 406–419.

41.

Van Ravenzwaaij

Monden

Tendeiro

et al. Bayes factors for superiority, non-inferiority, and equivalence designs. BMC Med Res Methodol 2019; 19: 1–12.

42.

Kruschke

Vanpaemel

. Bayesian estimation in hierarchical models. The Oxford Handbook of Computational and Mathematical Psychology 2015: 279–299.

43.

Kelter

. The evidence interval and the Bayesian evidence value – on a unified theory for Bayesian hypothesis testing and interval estimation. British Journal of Mathematical and Statistical Psychology 2022; 75: 550–592.

44.

Pereira

CAdB

Stern

. The e-value: A fully Bayesian significance measure for precise statistical hypotheses and its research program. São Paulo Journal of Mathematical Sciences 2020; 16: 566–584. DOI: 10.1007/s40863-020-00171-7.

45.

Good

. Weight of evidence, corroboration, explanatory power, information and the utility of experiments. J R Stat Soc: Ser B (Methodological) 1960; 22: 319–331.

46.

Jeffreys

. Theory of Probability. 3rd ed. Oxford: Oxford University Press, 1961. ISBN 0-19-850368-7.

47.

Kelter

. Bayesian and frequentist testing for differences between two groups with parametric and nonparametric two-sample tests. Wiley Interdisciplinary Reviews: Computational Statistics 2021; 13: e1523.

48.

Kelter

. bayest: An R package for effect-size targeted Bayesian two-sample t-tests. J Open Res Softw 2020; 8: 14.

49.

Kelter

. fbst: An R package for the full Bayesian significance test for testing a sharp null hypothesis against its alternative via the e value. Behav Res Methods 2021: 1–23.

50.

Good

. A derivation of the probabilistic explication of information. J R Stat Soc Ser B (Methodological) 1966; 28: 578–581.

51.

Good

. Corroboration, explanation, evolving probability, simplicity and a sharpened razor. British Journal for the Philosophy of Science 1968; 19: 123–143.

52.

Berry

. Bayesian Adaptive Methods for Clinical Trials. FL: CRC Press: Boca Raton, 2011. ISBN 9780429152429.

53.

Howells

Sena

Macleod

. Bringing rigour to translational medicine. Nature reviews Neurology 2014; 10: 37–43.

54.

Jan

Shieh

. Optimal sample sizes for Welch’s test under various allocation and cost considerations. Behav Res Methods 2011; 43: 1014–1022.

55.

Robert

Casella

. Monte Carlo Statistical Methods. New York: Springer, 2004. ISBN 1441919392.

56.

McElreath

Smaldino

. Replication, communication, and the population dynamics of scientific discovery. PLoS ONE 2015; 10: 1–16.

57.

Held

. A new standard for the analysis and design of replication studies. Journal of the Royal Statistical Society Series A: Statistics in Society 2020; 183: 431–448.

58.

Micheloud

Held

. Power calculations for replication studies. Stat Sci 2022; 37: 369–379.

59.

Labes

Schütz

Lang

. CRAN – Package PowerTOST, 2022. https://cran.r-project.org/web/packages/PowerTOST/index.html.

60.

Diletti

Hauschke

Steinijans

. Sample size determination for bioequivalence assessment by means of confidence intervals. Int J Clin Pharmacol Ther Toxicol 1992; 30: S51–S58.

61.

Efron

. Jackknife-after-bootstrap standard errors and influence functions. J R Stat Soc Ser B (Methodological) 1992; 54: 83–127.

62.

Koehler

Brown

Haneuse

. On the assessment of Monte Carlo error in simulation-based statistical analyses. Am Stat 2009; 63: 155. DOI: 10.1198/TAST.2009.0030./pmc/articles/PMC3337209//pmc/articles/PMC3337209/?report=abstract https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3337209/.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.19 MB

Double-testing-based approaches
	PPV	FDP	NPV	FOM	Mean $n_{2}$	Median $n_{2}$
p-value	0.95768	0.04232	0.82323	0.17677	6.01	5.00
MET	1.0000	0.0000	0.81683	0.18317	32.21	32.21
${BF}_{01}$	0.97812	0.02188	0.83511	0.16489	21.86	8.00
$P_{ϑ \| Y} (H_{1})$	0.99593	0.00407	0.79992	0.20008	8.60	6.00
ROPE	1.0000	0.0000	0.79514	0.20486	17.24	10.00
${Ev}_{E I_{r} (ν)} (H_{1})$	0.99886	0.00114	0.79636	0.20364	8.62	6.00
95% CI or 95% HPD + SESOI and testing
p-value	0.97302	0.02698	0.81979	0.18021	5.43	5.00
MET	0.99938	0.00062	0.82244	0.17756	34.52	32.21
${BF}_{01}$	0.99181	0.00819	0.8209	0.1791	8.78	6.00
$P_{ϑ \| Y} (H_{1})$	0.99534	0.00466	0.80298	0.19702	8.78	6.00
ROPE	0.999	0.001	0.80053	0.19947	17.93	11.00
${Ev}_{E I_{r} (ν)} (H_{1})$	0.99894	0.00106	0.79871	0.20129	8.78	6.00