Sage Journals: Discover world-class research

Abstract

Background

We aimed to determine the post-hoc power of randomized controlled trials (RCTs) in critical care, and describe the implications for long-term positive (PPV) and negative predictive value (NPV) of statistically significant and non-significant findings respectively in the research field.

Methods

We reviewed three cohorts of RCTs. “Adult-RCTs” were 216 multicenter RCTs with a mortality outcome from a published systematic review. “Pediatric-RCTs” were 120 RCTs with a mortality outcome, obtained by search of picutrials.net. “Consecutive-RCTs” were 90 recent RCTs obtained by screening publications in 6 journals. Post-hoc power for each study was calculated at α 0.05 and 0.005, for measures of small, medium, and large effect-size, using G*Power software. Long-run expected PPV and NPV of critical care research field findings were then calculated.

Results

With α 0.05, post-hoc power for small effect-size was very low in all RCT-cohorts (eg, median 24% in Adult-RCTs). For medium effect-size, post-hoc power was low, except for Adult-RCTs (eg, median 9% in Pediatric-RCTs). For large effect-size, post-hoc power for non-human-animal Consecutive-RCTs was low (median 32%). With α 0.005, post-hoc power was even lower. The corollary was that both PPV and NPV were poor for small effect-size, unless α 0.005 was used. Even with α 0.005, with realistic (vs. optimistic) prior probability of the alternative hypothesis, the PPV was low (eg, in Adult-RCTs 57.1% vs. 92.3%). Adding mild bias (0.1) reduced the PPV even further. For medium effect-size both PPV and NPV were better; nevertheless, with α 0.05 and realistic prior probability of the alternative hypothesis the PPV was poor, and with α 0.005 and mild bias (0.1) the PPV was very low (eg, Adult-RCTs median 44.1%).

Conclusions

To improve the predictive value of findings in the critical care research field, RCTs should be designed to have 80% power for realistic effect-size at α 0.005.

Keywords

critical care negative predictive value positive predictive value power randomized controlled trial

Study power is the long-run probability (Pr) that a null-hypothesis statistical test will correctly reject the null hypothesis (Ho) when the Ho is false (see Table 1 for definitions of terms used).¹ A low-powered study is one with low sample size and/or small effects, and has several consequences. First, by definition, a reduced probability of finding a true effect when one exists (ie, lower negative predictive value, NPV; a statistically non-significant finding has a higher probability of having incorrectly rejected the alternative hypothesis H1, and thus a lower probability of having correctly accepted Ho).¹ Second, a reduced probability that an observed effect that reaches “statistical significance” reflects a true effect (ie, lower positive predictive value, PPV).^1–1 Third, exaggerated estimates of the magnitude of an effect (“Winners Curse” or “effect inflation”; only those small low-powered studies that, by chance, overestimate the magnitude of the effect will pass the “statistical significance” threshold).^1–1 Fourth, a higher incidence of “vibration of effects” (ie, different estimates of the magnitude of effect depending on the analytical options implemented), publication bias (ie, smaller low-powered studies with negative findings more easily disappear into the file drawer, unavailable to contribute to evidence synthesis using meta-analysis), “Proteus phenomenon” (ie, the first study obtains an extreme result, followed by replication studies finding smaller or no effect), and lower quality (ie, less funding and personnel to examine study conduct).^1–1

Table 1.
Glossary of Terms and Methods.

Alpha (α): The long-run probability that, across many studies, a null-hypothesis statistical test will incorrectly reject the null hypothesis (Ho) when the Ho is true.

Type I error rate (α): the conditional probability of obtaining a significant test result (S) across many studies given the Ho is true, Pr(S│Ho).

Alternative hypothesis (H1): The hypothesis that the null hypothesis is false.

Bayes Factor (BF): The extent to which, across many studies, the observation that p ≤ α changes the prior odds that H1 rather than Ho is true. Formally, across many studies, the probability of finding a significant test result (S) under H1 divided by the probability of finding a significant test result (S) under Ho, that is, Pr(S│H1)/Pr(S│Ho).

Bayes's Theorem: The theorem links the posterior probability of H1 given that a significant test result (S) is obtained (Pr[H1]│S) to the prior probability of H1 (Pr[H1]) using the Bayes Factor (BF). Formally, the Posterior Odds of H1 = BF × Prior Odds of H1, that is: Pr(H1│S)/Pr(Ho│S) = BF × Pr(H1)/Pr(Ho).

Bias: “Any kind of implicit or explicit technique, manipulation, or error which can result in the outcome that a certain proportion of results which would otherwise be reported as statistically non-significant will be reported as statistically significant”.⁵ Potential sources of bias include greater flexibility in designs, definitions, outcomes, and analytic modes in a study (eg, data dredging or p-hacking, multiple testing, post-hoc selection of variables, etc).

Effect Size (ES): The point estimate of the effect of an intervention.

-Relative Risk (RR): for a dichotomous outcome, the ratio of the two RCT arms’ measure of outcome frequency. Calculated as the frequency in intervention arm / frequency in control arm.

-Standardized mean difference (d): for a continuous outcome, the mean difference between the two RCT arms” measure of outcome, contextualized using the amount of variation in scores [given by the standard deviation (SD) in the control group]. Calculated as mean difference / SD.

-Small, Medium, and Large Effect Size: For dichotomous outcomes (eg, mortality), defined as RR of 0.81, 0.54, and 0.33 respectively. For continuous outcomes, defined as d of 0.2, 0.5, and 0.8 respectively.

Negative predictive value (NPV): Also called the true negative report probability. Across many studies, when a statistical test comes out negative (according to the chosen threshold alpha), the probability that you have a true negative (ie, that there is no real effect and the results have occurred by chance). Formally, across many studies, the conditional probability of Ho being true given a statistically non-significant test result (∼S), Pr(Ho│∼S). This depends on the design of studies in a research field, that is, on the chosen α, power, and (crucially) prior probability Pr(Ho).

False Negative Report Probability: 1 – NPV.

Null hypothesis (Ho): The hypothesis that there is no difference between groups.

p-value: The probability, under the assumption of no association [no effect; Ho], of obtaining a result equal to or more extreme that what was actually observed, Pr(data or more extreme data than that obtained│Ho).

Post-hoc power: Based on the RCT sample size and control group outcome frequency (for dichotomous outcomes), by specifying α (ie, 0.05 or 0.005) and the population effect size one desires to detect (ie, small, medium, or large), the power achieved by that RCT.

-GPower: the software used to calculate post-hoc power (downloadable from:

https://download.cnet.com/G-Power/3000-2054_4-10647044.html).

Positive predictive value (PPV): Also called the true positive report probability. Across many studies, when a statistical test comes out positive (according to the chosen threshold alpha), the probability that you have a true positive (ie, that there is a real effect and the results have not occurred by chance). Formally, across many studies, the conditional probability of H1 being true given a statistically significant test result (S), Pr(H1│S). This depends on the design of studies in a research field, that is, on the chosen α, power, and (crucially) prior probability Pr(Ho).

False Positive Report Probability: 1 – PPV.

Power (1-β): The long-run probability that, across many studies, a null-hypothesis statistical test will correctly reject the null hypothesis (Ho) when the Ho is false.

Type II error rate (β): the conditional probability of obtaining a non-significant test result (∼S) across many studies given the H1 is true, Pr(∼S│H1).

Prior probability of the null hypothesis (Pr[Ho]): Based on prior knowledge, evidence, and insight, what the probability of the null hypothesis is believed to be prior to the experiment. Optimistically, with equipoise, this would be 50%. Realistically, given empirical findings in many research fields, this is more likely to be closer to 90%.

Prior probability of the alternative hypothesis (Pr[H1]): Based on prior knowledge, evidence, and insight, what the probability of the alternative hypothesis is believed to be prior to the experiment. Optimistically, with equipoise, this would be 50%. Realistically, given empirical findings in many research fields, this is more likely closer to 10%.

Retrospective Power Analysis: “Observed power” calculated using the obtained effect size estimated from the RCT's sample data. This is determined solely by the observed p-value, and thus adds no further information to the p-value.^4,17-19 This was not calculated in our study.

Sensitivity power: The minimum effect size an RCT was able to detect with specified power of 80%, given the RCT's sample size and the desired α.

-GPower: the software used to calculate sensitivity power (downloadable from:

https://download.cnet.com/G-Power/3000-2054_4-10647044.html)

Power and α (the long-run probability that a null-hypothesis statistical test will incorrectly reject the Ho when the Ho is true) never compute the probability of a hypothesis.⁶ The obtained p-value is a random variable, and should not be confused with α.⁷ To compute the long-run probability of the alternative hypothesis given “statistical significance” (ie, PPV, Pr[H1│significant finding]), and of the null hypothesis given “statistical non-significance” (ie, NPV, Pr[Ho│non-significant finding]) across all studies run in a field one must use Bayes Theorem [ie, consider α, power, and pre-study prior Pr[Ho]).^8,9

We aimed to determine the power of RCTs in the critical care literature in order to describe the implications of observed power for PPV and NPV of findings in the field. In addition, we aimed to explore predictors of the power of RCTs that might suggest subgroups of RCTs that may require the most attention. Across three cohorts of representative critical care RCTs we found that the power for small and medium effects was low, producing surprisingly low PPV and NPV. In addition, NHA-RCTs may be at particularly high-risk of low power.

Materials and Methods

As only publicly available published data was recorded, this study did not require ethics board approval.

Included Randomized Trials

We examined three cohorts of RCTs in critical care, chosen to be relevant for clinicians and researchers of diverse backgrounds (eg, adult critical care, pediatric critical care, non-human-animal researchers), and to improve generalizability of findings. “Adult-RCTs” were the 216 multicenter RCTs that examined mortality as an outcome from a published systematic review.¹⁰ This was the largest list of systematically reviewed adult critical care RCTs available, which precluded a repeat exhaustive search of the literature for this study. “Pediatric-RCTs” were 120 RCTs that reported an obtained p-value for a mortality outcome; the list was developed by search of https://picutrials.net using the terms “mortality” or “multicenter”, followed by screening of the abstracts (and full text if necessary).¹¹ This already maintained database of PICU trials precluded a repeat exhaustive search of the literature for this study. “Consecutive-RCTs” were 90 recently published RCTs obtained by screening the title and abstract of all publications in 6 journals (NEJM, JAMA, Critical Care, Critical Care Medicine, Pediatric Critical Care Medicine, and Intensive Care Medicine) starting backwards from January 2019 until 15 publications were included from each journal. Eligibility was defined as: topic involves critically ill patients; and RCT comparing groups with respect to some interventional exposure to report an outcome effect size (ES) and p-value. We excluded studies if the primary outcome had a p-value ≥0.10 (because in a separate study we aimed to explore the reverse-Bayesian implications of obtained p-values, which are most relevant to studies obtaining lower p-values),¹² or the full text did not explicitly report an exact p-value. This cohort was intended to represent a contemporary recent cohort of critical care RCTs in high-impact journals that most critical care clinicians are likely to read. In addition, this cohort allowed us to examine predictors of post-hoc power in a cohort of studies that included all of adult, pediatric, and NHA RCTs.

Data Recorded

A study instruction manual was used, giving detailed explanations (with references) of all study variable definitions, calculations of missing RCT variables, calculations of post-hoc and sensitivity power, and calculations of long-run values of PPV and NPV in the critical care research field based on different chosen ES thresholds, α, prior odds of the Ho, and amount of study bias (Supplemental Material 1). A glossary of term definitions and methods used is shown in Table 1.

We obtained descriptive information from each RCT, including: category of primary outcome; study size; study outcomes with numbers, proportions, means, and standard deviations as appropriate; ES (for studies with obtained p-value ≤0.10, assuming studies with higher p-values had negligible ES) including as relative risk (RR) and standardized mean difference (d), with 95% CI; power calculation numbers if reported; and obtained p-value. In the Consecutive-RCTs we also recorded the main secondary outcome and associated information as above. When not reported, we calculated ES based on the values reported in the published study (Supplemental Material 1).

Outcomes

Power of RCTs

Power for each RCT was calculated based on sample size, and generally accepted (sample size independent and scale-free) measures for what are small, medium, and large ES.^13–13 For categorical outcomes (eg, mortality), small, medium, and large ES was defined as RR of 0.81, 0.54, and 0.33 respectively. For continuous outcomes, small, medium, and large ES was defined as d of 0.2, 0.5, and 0.8 respectively.

Post-hoc power was calculated, using GPower, by entering the RCT sample size (assuming equal allocation to each group to maximize power), the desired two-sided α (0.05 or 0.005), and, for RR calculations, the obtained control group proportion, and the expected proportion for the desired RR (Supplemental Material 1).^16,17 This was not* a retrospective power analysis, which would have used the obtained ES to calculate the “observed power”, a value determined solely by the observed p-value and thus adding no further information to the p-value.^4,18–18 Sensitivity power was calculated, using GPower, which determined the minimum ES a study was sufficiently sensitive to detect with power of 80% given the RCT sample size and desired two-sided α (0.05 or 0.005).^16,17

Long-run Values for the Research Field

Based on median [interquartile range] post-hoc calculated power for small and medium ES we calculated values of interest over the long-term for the critical care research field (Supplemental Material 1).^1,2,5,8,9 False report probability and true report probability (PPV), false negative report probability and true negative report probability (NPV) were calculated as shown in Table 2. Values were calculated for α of 0.05 and 0.005, and for pre-study odds Ho:H1 that were optimistic (1:1) and more realistic (9:1).^1,2,5,8,9 Bayes Factor (BF), the extent to which the observation that p ≤ α changes the prior odds that H1 rather than Ho is true, was calculated as power/α.

Table 2.
Calculations of Predictive Values Based on Prior Odds of the Hypothesis, Power, Alpha, and Bias.^a

Study p-value

Reality Statistically significant(positive H1) Statistically non-significant (negative H1)

True H1 TP = Power + (bias)(β) FN = β – (bias)(β) = (1-bias)(β)

False H1 FP = O(α + [bias][1-α]) TN = O ([1-α]-[bias][1-α]) = O (1-bias)(1-α)

FN: false negative; FP: false positive; Ho: the null hypothesis; H1: the alternative hypothesis; O: the pre-study odds of Ho:H1 in a research field; TN: true negative; TP: true positive; α: Type I error (this is not equivalent to the p-value); β: Type II error (such that 1-β = Power).

^a
Calculations of outcomes are as follows: PPV = TP/(TP + FP); NPV = TN/(TN + FN). Note that NPV does not change with changes in bias [ie, the (1-bias) in numerator and denominator cancel out].

The effects of bias on PPV and BF was calculated as shown in Table 2. Bias was defined as “any kind of implicit or explicit technique, manipulation, or error which can result in the outcome that a certain proportion of results which would otherwise be reported as statistically non-significant will be reported as statistically significant”.⁵ Bias of 0.1, 0.2, and 0.3 were used, as suggested by others.^2,5,8 The BF was calculated as (Power + bias[β])/(α + bias[1-α]).

Statistics

Descriptive results are presented using counts and percentages, median, interquartile range [IQR], and range (minimum to maximum). We explored predictors of post-hoc power and sensitivity power at alpha 0.05 and 0.005 using univariate and (after excluding multicollinearity) multiple variable linear regressions for each RCT cohort. Although we initially planned to explore predictors of post-hoc power for small ES in all RCT-cohorts, because we found that post-hoc power for small ES was very low and with little variability for non-Adult-RCT cohorts, we decided to instead explore predictors of post-hoc power for medium ES in these RCT-cohorts. The three RCT cohorts were considered too different to allow pooling for this analysis; for example, Adult-RCTs did not include Pediatric or NHA studies, Pediatric-RCTs did not include Adult or NHA studies, and Consecutive-RCTs were not always based on mortality or binary outcomes. The possible predictors for univariate analyses were pre-specified: field of sepsis (the most common category for Adult-RCTs and Pediatric-RCTs), mortality as primary outcome, study year 2011 to 2019 (for Adult-RCTs and Pediatric-RCTs), multicenter study or number of centers (>20 for Adult-RCTs, >10 for Pediatric-RCTs), number of patients, mortality in control group (for Adult-RCTs and Pediatric-RCTs), p-value category, study RR, and higher mortality in intervention group (for Adult-RCTs with p ≤ 0.05). For Consecutive-RCTs we added study continent of Europe (47.8% of included studies), higher impact journal (NEJM or JAMA), species non-human-animal (NHA), power calculation reported, and d. In multiple regressions variables were included if p-value in the univariate regression was <0.10, while forcing “multicenter study” (for Pediatric-RCTs and Consecutive-RCTs), and NHA (for Consecutive-RCTs). We considered p ≤ 0.05 as statistically suggestive.

Results

Description of Included RCTs

The 216 Adult-RCTs, described in E-Table 1 (Supplemental Material 2), had a median power calculation control group mortality of 40% [29, 50], median obtained control group mortality of 31.5% [24.1, 41.7] and intervention group mortality of 30.4% [21.1, 39.4]. For the 57 RCTs that obtained p-value ≤0.10, the median obtained absolute risk difference (ARD) was 13.8% [8.6, 20.3]. The obtained ES was trivial in 10 (18%), small in 34 (60%), medium in 9 (16%), and large in 5 4 (7%).

We searched the 444 RCTs in the EPICC database; after exclusions, we included 120 Pediatric-RCTs described in E-Table 2 (Supplemental Material 2). For the 25 RCTs that obtained a p-value ≤0.10, the obtained median ARD was 19.1% [9.1, 30.3]. The obtained ES was trivial in 2 (8%), small in 7 (28%), medium in 3 (12%), and large in 13 (52%).

We screened 269 studies in 6 journals, and after exclusions (E-Table 3, Supplemental Material 2) included 90 Consecutive-RCTs described in E- Table 4 (Supplemental Material 2). These were RCTs in NHA in 21 (23%), and in children or neonates in 20 (22%). The obtained ES were trivial in 10 (11.1%, all Human-RCTs), small in 31 (34.4%, 29 [42%] of Human-RCTs and 2 [9.5%] of NHA-RCTs), medium in 14 (15.6%, all Human-RCTs), and large in 33 (36.7%, 16 [23.2%] of Human-RCTs and 17 [81%] of NHA-RCTs). The included Consecutive-RCTs are listed in Supplemental Material 3.

Power of RCTs

Post-hoc and sensitivity power of Adult-RCTs, Pediatric-RCTs, and Consecutive-RCTs (for primary and secondary outcomes) are shown in Table 3.

Table 3.
Post-hoc and sensitivity power of Adult-RCTs, Pediatric-RCTs, and Consecutive-RCTs using α of 0.05 and 0.005.

Study result Adult RCTs(n=216) Pediatric RCTs(n=120) Consecutive RCTsPrimary outcome(n=90) Consecutive RCTsSecondary outcome(n=90)

Study n/group

Multicenter
NHA
Human 201 [95, 453]
(17-3467) 40 [22, 80]
(5-2474)
57 [27, 148] 58 [18, 195]
(3 – 7942)

8 [6, 13]
106 [42, 243] 60 [18, 194]
(3 – 7942)

8 [6, 14]
97 [41, 244]

Post-hoc power at α = 0.05

For Small ES

Multicenter
NHA
Human 24 [9, 52]
(3-99) 3 [1, 7]
(1-65)
5 [2, 11] 12 [7, 29]
(2 – 100)

6 [6, 8]
16 [10, 37] 13 [7, 23]
(2 – 100)

6 [6, 8]
17 [10, 28]

For Medium ES

Multicenter
NHA
Human 92 [46, 99.9]
(6-100) 9 [2, 32]
(1-100)
18 [4, 55] 59 [31, 97]
(4 – 100)

15 [12, 34]
70 [46, 98] 62 [25, 89]
(6 – 100)

14 [12, 24]
74 [45, 96]

For Large ES

Multicenter
NHA
Human 100 [86, 100]
(13-100) 20 [3, 68]
(1-100)
43 [8, 90] 94 [64, 100]
(7 – 100)

32 [22, 68]
97 [85, 100] 95 [54, 99]
(12 – 100)

28 [22, 50]
99 [85, 100]

Sensitivity power: ES detected at power of 80%

For RR

Multicenter
NHA
Human 0.61 [0.37, 0.74]
(0.01-0.88) 0.01 [0.01, 0.26]
(0.01-0.77)
0.13 [0.01, 0.41] 0.5 [0.32, 0.68]
(0.01 – 0.99)
(n=55) ^a
0.33 [0.26, 0.56]
0.60 [0.37, 0.73] 0.54 [0.39, 0.64]
(0.01 – 0.99)
(n=47) ^a
0.27 ^b
0.55 [0.41, 0.65]

For ARD

Multicenter
NHA
Human 11.3 [8.0, 17.6]
(2.5-42.9) 12.8 [6.8, 23.3]
(1.0-68.5)
8.9 [4.7, 17.0] 16.8 [8.9, 29.9]
(1.1 – 62.0)
(n=55) ^a
53.0 [38.8, 58.4]
15.0 [8.2, 21.0] 12.1 [7.9, 23.5]
(1.0 – 58.1)
(n=47) ^a
57.2 ^b
12.0 [7.9, 22.8]

For d

NHA
Human - - 0.96 [0.57, 1.63]
(0.20, 3.07)
(n=35) ^a
1.80 [1.41, 2.02]
0.62 [0.45, 0.81] 0.91 [0.54, 1.63]
(0.20 – 3.07)
(n=43) ^a
1.63 [1.22, 1.97]
0.57 [0.41, 0.84]

Post-hoc power at α = 0.005

For Small ES

Multicenter
NHA
Human 6 [1, 21]
(1-94) 1 [1, 1]
(1-33)
1 [1, 2] 2 [1, 8]
(1 – 100)

1 [1, 1]
3 [2, 12] 2 [1, 6]
(1 – 100)

1 [1, 1]
4 [2, 8]

For Medium ES

Multicenter
NHA
Human 72 [17, 99]
(1-100) 1 [1, 10]
(1-99.8)
4 [1, 20] 26 [8, 83]
(1 – 100)

3 [2, 9]
36 [17, 91] 29 [6, 65]
(1 – 100)

2 [2, 6]
43 [14, 81]

For Large ES

Multicenter
NHA
Human 99 [58, 100]
(1-100) 4 [1, 36]
(1-100)
14 [1, 63] 74 [29, 99.9]
(1 – 100)

8 [4, 33]
86 [55, 100] 78 [22, 98]
(1 – 100)

6 [4, 18]
92 [53, 99.9]

Sensitivity power: ES detected at power of 80%

For RR

Multicenter
NHA
Human 0.51 [0.23, 0.66]
(0.01-0.84) 0.01 (0.01, 0.13]
(0.01-0.71)
0.01 [0.01, 0.25] 0.39 [0.18, 0.59]
(0.01 – 0.99)

0.17 [0.11, 0.41]
0.47 [0.24, 0.65] 0.43 [0.24, 0.55]
(0.01 – 0.99)

0.12 ^b
0.43 [0.26, 0.56]

For ARD

Multicenter
NHA
Human 14.1 [10.1, 21.6]
(3.2-52.0) 14.4 [6.9, 25.9]
(1.0-68.5)
9.1 [4.8, 20.0] 21.5 [10.6, 37.3]
(1.4 – 78.0)

64.2 [52.2, 72.5]
19.6 [10.4, 24.9] 15.8 [10.2, 28.3]
(1.3 – 76.1)

69.3 ^b
15.2 [10.2, 28.2]

For d

NHA
Human 1.29 [0.75, 2.34](0.27 – 5.76) 2.63 [3.08, 1.97] 0.83 [0.59, 1.08] 1.22 [0.71, 2.34](0.27 – 5.76) 2.34 [1.66, 2.97] 0.74 [0.54, 1.13]

^a.
The “n=x” refers to the number of Consecutive-RCTs that had either a categorical outcome (for RR and ARD) or a continuous outcome (for d).

^b.
Only one of the 21 NHA-Consecutive-RCTs had a secondary outcome that was categorical, so no IQR is provided.

Given as Median [IQR] (range). ARD: absolute risk difference; d: standardized mean difference; ES: effect size; NHA: non-human animal; RCTs: randomized controlled trials; RR: relative risk. Of note, post-hoc power for small, medium, and large ES in the 25 Pediatric-RCTs that obtained a p-value ≤0.10 were as follows: for α 0.05 were 5 [3, 10], 26 [8, 54], and 61 [19, 91], and for α 0.005 were 1 [1, 2], 7 [2, 22], and 28 [4, 68]. This supports the claim that the very low post-hoc power of Pediatric-RCTs obtaining a p-value ≤0.10 may explain a “winner’s curse” of these studies obtaining a large ES result.

Post-hoc power for small ES was low in all RCT cohorts using α 0.05: 24 [9, 52], 3 [1, 7], 12 [7, 29], and 13 [7, 23] respectively. Post-hoc power was better for medium ES at α 0.05, at 92 [46, 99.9] for Adult-RCTs, but still low for Pediatric-RCTs and Consecutive-RCTs at 9 [2, 32], 59 [31, 97], and 62 [25, 89] respectively. Post-hoc power was far lower at α 0.005, including for medium ES was far lower at α 0.005, where even for Adult-RCTs post-hoc power was at 72 [17, 99]. Post-hoc power for even large ES was low at α 0.05 for Pediatric-RCTs and Consecutive-NHA-RCTs, at 20 [3, 68] and 32 [22, 68] respectively.

Sensitivity power (ie, the ES detectable at power 80%) was for a medium or greater ES in Adult-RCTs and Consecutive-Human-RCTs; even for the best cohort of studies, Adult-RCTs, a median RR 0.61 and (corresponding to an ARD 11.3%) at α 0.05, and median RR 0.51 and (corresponding to an ARD 14.1%) at α 0.005. This sensitivity power was markedly worse for Pediatric-RCTs, at α 0.05 median RR 0.01 and ARD 12.8%. Sensitivity power was also poor for Consecutive-NHA-RCTs, at α 0.05 median RR 0.33, ARD 53%, and d 1.80.

We explored predictors of post-hoc power for small ES in Adult-RCTs, and for medium ES in Pediatric-RCTs and Consecutive-RCTs for primary outcome (E-Tables 5–8, Supplemental Material 2). On multiple variable linear regressions consistent predictors of higher post-hoc power in Adult-RCTs and Pediatric-RCTs were higher number of centers, number of patients, and control group mortality. In Consecutive-RCTs predictors were NHA-RCT (at α 0.05, coefficient −39.4 3 [95% CI −52.3, −26.5; p < 0.001]) and mortality as the primary outcome (although for Human-RCTs alone this was no longer predictive). As NHA-RCT was such a strong predictor of lower post-hoc power in the Consecutive-RCTs (E-Table 7, Supplemental Material 2), and these NHA-RCTs had many differences from Human-RCTs (E-Table 8, Supplemental Material 2), below we report long-run implications of post-hoc power separately for the subgroups Consecutive-Human-RCTs and Consecutive-NHA-RCTs.

Long-run Values for the Critical Care Research Field

Small ES:* Values are given for Adult-RCTs in Table 4, and Pediatric-RCTs and Consecutive-RCTs primary outcomes in E-Tables 9 and 10 (Supplemental Material 2; we did not calculate these for secondary outcomes as post-hoc power was similar to that for primary outcomes). Adult-RCTs had the best values, at α 0.05 with PPV and NPV median 82.8% and 55.6% respectively. Pediatric-RCTs had the worst values, at α 0.05 with PPV and NPV median 37.5% and 49.5% respectively (somewhat better for multicenter studies, with PPV and NPV median 50%; in what follows we focus on these better values for multicenter Pediatric-RCTs). Consecutive-Human-RCTs had values similar to Adult-RCTs, while Consecutive-NHA-RCTs had lower values, at α 0.05 a median PPV 54.5% and NPV 50.3%. The PPV was higher with α 0.005 (reaching median 92.3% for Adult-RCTs and 66.7% for multicenter Pediatric-RCTs and NHA-Consecutive-RCTs), without much change in NPV.

Table 4.
Long-run Values for the Critical Care Research Field for Small Effects in 216 Multicenter Adult-RCTs with a Mortality Outcome.

Values of interest Optimistic pre-study odds of Ho 1:1 Realistic pre-study odds of Ho 9:1

α = 0.05 α = 0.005 α = 0.05 α = 0.005

Post-hoc power for small effects 24 [9, 52] 6 [1, 21] 24 [9, 52] 6 [1, 21]

False Positive Report Probability 17.2 [35.7, 8.8]% 7.7 [33.3, 2.3]% 65.2 [83.3, 46.4]% 42.9 [81.8, 17.6]%

True Positive Report Probability (PPV) 82.8 [64.3, 91.2]% 92.3 [66.7, 97.7]% 34.8 [16.7, 53.6]% 57.1 [18.2, 82.4]%

False Negative Report Probability 44.4 [48.9, 33.6]% 48.6 [49.9, 44.3]% 8.2 [9.6, 5.3]% 9.5 [10.0, 8.1]%

True Negative Report Probability (NPV) 55.6 [51.1, 66.4]% 51.4 [50.1, 55.7]% 91.8 [90.4, 94.7]% 90.5 [90.0, 91.9]%

Bayes Factor 4.8 [1.8, 10.4] 12 [2, 42] 4.8 [1.8, 10.4] 12 [2, 42]

With Bias of 0.1

PPV 68.6 [55.5, 79.7]% 59.6 [51.1, 73.4]% 19.3 [12.1, 30.1]% 13.9 [10.3, 23.3]%

Bayes Factor 2.18 [1.25, 3.92] 1.47 [1.04, 2.77] 2.18 [1.25, 3.92] 1.47 [1.04, 2.77]

With bias of 0.2

PPV 62.0 [53.1, 72.0]% 55.0 [50.5, 64.3]% 15.2 [11.1, 22.0]% 11.8 [10.1, 16.6]%

Bayes Factor 1.63 [1.13, 2.57] 1.22 [1.02, 1.80] 1.63 [1.13, 2.57] 1.22 [1.02, 1.80]

With bias of 0.3

PPV 58.3 [52.0, 66.5]% 53.0 [50.3, 59.6]% 13.3 [10.6, 17.9]% 11.0 [10.0, 13.9]%

Bayes Factor 1.40 [1.08, 1.98] 1.13 [1.01, 1.47] 1.40 [1.08, 1.98] 1.13 [1.01, 1.47]

Given as Median [IQR]. NPV: negative predictive value; PPV: positive predictive value. Bayes Factor (BF) is the extent to which the observation that p ≤ α changes the prior odds that H1 rather than Ho is true. Bias is “any kind of implicit or explicit technique, manipulation, or error which can result in the outcome that a certain proportion of results which would otherwise be reported as statistically non-significant will be reported as statistically significant”.⁵

The PPV was markedly lower, and NPV much higher, when using a realistic prior Pr(H1). For Adult-RCTs at α 0.05 and 0.005, the PPV was median 34.8% and 57.1%, and the NPV was 91.8% and 90.5% respectively. For multicenter Pediatric-RCTs the respective values were, for PPV median 10% and 18.2%, and NPV 90% and 90%. Similarly, for Consecutive-Human-RCTs the PPV was median 26.2% and 40%. Adding mild bias (of 0.1) reduced the PPV and BF markedly; in Adult-RCTs the PPV at α 0.05 and 0.005 reduced to median 68.6% and 59.6%, and with a realistic prior Pr(H1) to 19.3% and 13.9 1% respectively.

Medium ES: Values are given in Tables 5 –7 for respective RCT-cohorts (and E-Table 11 for Consecutive-RCT secondary outcomes, Supplemental Material 2). Overall, the PPV and NPV were higher than for small ES. Using realistic prior Pr(H1) for Adult-RCTs the PPV at α 0.05 and 0.005 were median 67.2% and 94.1% respectively. In multicenter Pediatric-RCTs these values were median 28.6% and 47.1%. In Consecutive-Human-RCTs these values were 60.9% and 88.9% for primary outcome and 62.2% and 90.5% for secondary outcomes. Again, PPV was higher with α 0.005 and lower when using a realistic prior Pr(H1), and NPV was higher with realistic Pr(H1), always >90%. Adding mild bias (0.1) reduced the PPV and BF markedly; in Adult-RCTs, with realistic Pr(H1), median PPV at α 0.05 and 0.005 were 41.3% and 44.1% respectively.

Table 5.
Long-run Values for the Critical Care Research Field for medium Effects in 216 Multicenter Adult-RCTs with a Mortality Outcome.

Values of interest Optimistic pre-study Odds of Ho 1:1 Realistic pre-study Odds of Ho 9:1

α = 0.05 α = 0.005 α = 0.05 α = 0.005

Post-hoc power for medium effects 92 [46, 99.9] 72 [17, 99] 92 [46, 99.9] 72 [17, 99]

False Positive Report Probability 5.2 [9.8, 4.8]% 0.7 [2.9, 5.0]% 32.8 [49.5, 31.1]% 5.9 [20.9, 4.3]%

True Positive Report Probability (PPV) 94.8 [90.2, 95.2]% 99.3 [97.1, 99.5]% 67.2 [50.5, 68.9]% 94.1 [79.1, 95.7]%

False Negative Report Probability 7.8 [36.2, 0.1]% 22.0 [45.5, 1.0]% 0.9 [5.9, 0.01]% 3.0 [8.5, 0.1]%

True Negative Report Probability (NPV) 92.2 [63.8, 99.9]% 78.0 [54.5, 99.0]% 99.1 [94.1, 99.9]% 97.0 [91.5, 99.9]%

Bayes Factor 18.4 [9.2, 20.0] 144 [34, 198] 18.4 [9.2, 20.0] 144 [34, 198]

With Bias of 0.1

PPV 86.5 [78.0, 87.3]% 87.7 [70.8, 90.5]% 41.3 [28.1, 43.1]% 44.1 [21.0, 51.0]%

Bayes Factor 6.4 [3.54, 6.89] 7.16 [2.42, 9.48] 6.4 [3.54, 6.89] 7.16 [2.42, 9.48]

With bias of 0.2

PPV 79.6 [70.3, 80.6]% 79.2 [62.2, 82.9] 30.0 [20.7, 31.4] 29.5 [15.3, 34.8]%

Bayes Factor 3.9 [2.37, 4.16] 3.80 [1.65, 4.86] 3.9 [2.37, 4.16] 3.80 [1.65, 4.86]

With bias of 0.3

PPV 73.8 [65.0, 74.9]% 72.6 [58.0, 76.6]% 23.7 [17.0, 24.7]% 22.6 [13.2, 26.5]%

Bayes Factor 2.82 [1.86, 2.98] 2.65 [1.38, 3.27] 2.82 [1.86, 2.98] 2.65 [1.38, 3.27]

Given as Median [IQR]. NPV: negative predictive value; PPV: positive predictive value. Bayes Factor is the extent to which the observation that p ≤ α changes the prior odds that H1 rather than Ho is true. Bias is “any kind of implicit or explicit technique, manipulation, or error which can result in the outcome that a certain proportion of results which would otherwise be reported as statistically non-significant will be reported as statistically significant”.⁵

Table 6.
Long-run Values for the Pediatric Critical Care Research Field for medium Effects in Pediatric-RCTs with a Mortality Outcome.

Values of interest Optimistic pre-study Odds of Ho 1:1 Realistic pre-study Odds of Ho 9:1

α = 0.05 α = 0.005 α = 0.05 α = 0.005

All studies Multicenter All studies Multicenter Multicenter Multicenter

Post-hoc power for medium effects 9 [2, 32] 18 [4, 55] 1 [1, 10] 4 [1, 20] 18 [4, 55] 4 [1, 20]

False Positive Report Probability 35.7 [71.4, 13.5]% 21.7 [55.6, 8.3]% 33.3 [33.3, 4.8]% 11.1 [33.3, 2.4]% 71.4 [91.8, 45.0]% 52.9 [81.8, 18.4]%

True Positive Report Probability (PPV) 64.3 [28.6, 86.5]% 78.3 [44.4, 91.7]% 66.7 [66.7, 95.2]% 88.9 [66.7, 97.6]% 28.6 [8.2, 55.0]% 47.1 [18.2, 81.6]%

False Negative Report Probability 48.9 [50.8, 41.7]% 46.3 [50.3, 32.1]% 49.9 [49.9, 47.5]% 49.1 [49.9, 44.6]% 8.8 [10.1, 5.0]% 9.7 [10.0, 8.2]%

True Negative Report Probability (NPV) 51.1 [49.2, 58.3]% 53.7 [49.7, 67.9]% 50.1 [50.1, 52.5]% 50.9 [50.1, 55.4]% 91.2 [89.9, 95.0]% 90.3 [90.0, 91.8]%

Bayes Factor 1.8 [0.4, 6.4] 3.6 [0.8, 11.0] 2.0 [2.0, 20.0] 8.0 [2.0, 40.0] 3.6 [0.8, 11.0] 8.0 [2.0, 40.0]

With Bias of 0.1

PPV 55.6 [44.8, 72.8]% 64.4 [48.5, 80.4]% 51.0 [51.0, 64.5]% 56.5 [51.0, 72.8]% 16.7 [9.5, 31.3]% 12.6 [10.4, 22.9]%

Bayes Factor 1.25 [0.81, 2.68] 1.81 [0.94, 4.10] 1.04 [1.04, 1.82] 1.30 [1.04, 2.68] 1.81 [0.94, 4.10] 1.30 [1.04, 2.68]

With bias of 0.2

PPV 53.1 [47.4, 65.5]% 58.8 [49.2, 72.8]% 50.5 [50.5, 57.8]% 53.3 [50.5, 63.8]% 13.7 [9.7, 22.9]% 11.2 [10.2, 16.4]%

Bayes Factor 1.13 [0.90, 1.90] 1.43 [0.97, 2.67] 1.02 [1.02, 1.37] 1.14 [1.02, 1.76] 1.43 [0.97, 2.67] 1.14 [1.02, 1.76]

With bias of 0.3

PPV 51.9 [48.5, 60.9]% 55.9 [49.5, 67.1]% 50.2 [50.2, 55.0]% 51.9 [50.2, 59.2]% 12.4 [9.8, 18.5]% 10.7 [10.1, 13.9]%

Bayes Factor 1.08 [0.94, 1.56] 1.27 [0.98, 2.04] 1.01 [1.01, 1.22] 1.08 [1.01, 1.45] 1.27 [0.98, 2.04] 1.08 [1.01, 1.45]

Given as Median [IQR]. NPV: negative predictive value; PPV: positive predictive value. Bayes Factor is the extent to which the observation that p ≤ α changes the prior odds that H1 rather than Ho is true. Bias is “any kind of implicit or explicit technique, manipulation, or error which can result in the outcome that a certain proportion of results which would otherwise be reported as statistically non-significant will be reported as statistically significant”.⁵

Table 7.
Long-run Values for the Critical Care Research Field for medium Effects in 90 Recently Published Consecutive-RCTs with a Primary Outcome.

Values of interest Optimistic pre-study Odds of Ho 1:1 Realistic pre-study Odds of Ho 9:1

α = 0.05 α = 0.005 α = 0.05 α = 0.005

Human NHA Human NHA Human Human

Post-hoc power for medium effects 70 [46, 98] 15 [12, 34] 36 [17, 91] 3 [2, 9] 70 [46, 98] 36 [17, 91]

False Positive Report Probability 6.7 [9.8, 4.9]% 25.0 [29.4, 12.8]% 1.4 [2.9, 0.5]% 14.3 [20.0, 5.3]% 39.1 [49.5, 31.5]% 11.1 [20.9, 4.7]%

True Positive Report Probability (PPV) 93.3 [90.2, 95.1]% 75.0 [70.6, 87.2]% 98.6 [97.1, 99.5]% 85.7 [80.0, 94.7]% 60.9 [50.5, 68.5]% 88.9 [79.1, 95.3]%

False Negative Report Probability 24.0 [36.2, 2.1]% 47.2 [48.1, 41.0]% 39.1 [45.4, 8.3]% 49.4 [49.6, 47.8]% 3.4 [5.9, 0.2]% 6.7 [8.5, 1.0]%

True Negative Report Probability (NPV) 76.0 [63.8, 97.9]% 52.8 [51.9, 59.0]% 60.9 [54.5, 91.7]% 50.6 [50.4, 52.2]% 96.6 [94.1, 99.8]% 93.3 [91.5, 99.0]%

Bayes Factor 14.0 [9.2, 19.6] 3.0 [2.4, 6.8] 72.0 [34.0, 182.0] 6.0 [4.0, 18.0] 14.0 [9.2, 19.6] 72.0 [34.0, 182.0]

With Bias of 0.1

PPV 83.4 [78.0, 87.1]% 61.8 [58.9, 73.7]% 80.2 [70.8, 89.8]% 54.9 [53.0, 63.4]% 35.6 [28.1, 42.7] 30.9 [21.0, 49.2]%

Bayes Factor 5.03 [3.54, 6.77] 1.62 [1.43, 2.8] 4.06 [2.42, 8.79] 1.22 [1.13, 1.73] 5.03 [3.54, 6.77] 4.06 [2.42, 8.79]

With bias of 0.2

PPV 76.0 [70.3, 80.4]% 57.1 [55.2, 66.3]% 70.5 [61.1, 82.0]% 52.3 [51.4, 57.1]% 25.8 [20.7, 31.1]% 20.8 [15.3, 33.4]%

Bayes Factor 3.17 [2.37, 4.1] 1.33 [1.23, 1.97] 2.39 [1.65, 4.55] 1.10 [1.06, 1.33] 3.17 [2.37, 4.1] 2.39 [1.65, 4.55]

With bias of 0.3

PPV 70.2 [65.0, 74.6]% 54.7 [53.4, 61.6]% 64.5 [58.0, 75.5]% 51.4 [50.9, 54.5]% 20.6 [17.0, 24.5]% 16.7 [13.2, 25.4]%

Bayes Factor 2.36 [1.86, 2.94] 1.21 [1.15, 1.61] 1.82 [1.38, 3.09] 1.06 [1.03, 1.20] 2.36 [1.86, 2.94] 1.82 [1.38, 3.09]

Given as Median [IQR]. NPV: negative predictive value; PPV: positive predictive value. Bayes Factor is the extent to which the observation that p ≤ α changes the prior odds that H1 rather than Ho is true. Bias is “any kind of implicit or explicit technique, manipulation, or error which can result in the outcome that a certain proportion of results which would otherwise be reported as statistically non-significant will be reported as statistically significant”.⁵

Discussion

We examined three cohorts of critical care RCTs in order to demonstrate the implications of low-powered studies. The Adult-RCTs (multicenter, with mortality outcome) and Consecutive-RCTs (published in relatively high-impact journals) may represent some of the best RCTs in the critical care research field. The Pediatric-RCTs (published in many different journals, and often not having mortality as the primary outcome) may be more representative of RCTs in the research field. Our main findings include the following.

First, most RCTs often overestimated control group mortality, and most obtained high ARD, and obtained trivial or small ES. Exceptions were the Pediatric-RCTs and Consecutive-NHA-RCTs that often obtained large ES (likely due to Winner's Curse, as these RCTs had the lowest post-hoc power).^1,5,9 Second, with α 0.05, post-hoc power for small ES was low in all RCT-cohorts (eg, the highest was median 24% for Adult-RCTs), for medium ES was low except for Adult-RCTs (eg, median 9% in Pediatric RCTs), and for large ES was low in Consecutive-NHA-RCTs (median 32%) and Pediatric-RCTs (median 20%). With α 0.005 the post-hoc power was even lower. An exploratory analysis suggests that low power was a general phenomenon in the field, as multivariate analysis did not find consistent predictors of post-hoc power.

Third, the corollary of low post-hoc power was that the PPV and NPV were poor for small ES. These values improved when unless α 0.005 was used; nevertheless, even with α 0.005, when using a realistic prior Pr(H1) the PPV was low, while the NPV improved. The PPV and NPV were better for medium ES; nevertheless, with α 0.05 and realistic prior Pr(H1) the PPV was poor, and with α 0.005 and little bias (0.1) the PPV was also low (eg, Adult-RCTs median 44.1%). Adult-RCTs and Consecutive-Human-RCTs most often found small ES; with realistic Pr(H1) and small ES, at α 0.05 or 0.005 these RCTs had PPV median 34.8% and 57.1%, and 26.2% and 40.0% respectively. Pediatric-RCTs, with realistic Pr(H1) and medium ES, had PPV median 28.6% and 47.1% respectively. Adding even small amounts of bias (0.1) markedly reduced these PPV values, without affecting NPV.

Others have reported similar findings in different research fields. Median power in neuroscience papers was 21% to detect medium to large ES, and in recent cognitive neuroscience and psychology literature 12% and 44% to detect small and medium ES, with no sign of increase in power over six decades.^1,2,21 The authors of those papers discussed implications of low power on study PPV (or its complement, the false positive risk).^1,2 In critical care RCTs, overestimation of control group mortality and “delta inflation” (defined as biased overestimates of predicted treatment ES during trial design, with delta-gaps averaging up to 8.7%) is common; this results in RCTs that have sample sizes too low and hence low power.^22–22 The authors of those papers did not report the exact power for different ES obtained in the RCTs, and only suggested that inadequate power may account for falsely negative RCTs (ie, low NPV).^22–22 In critical care RCTs a low fragility-index is common, defined as the minimum number of reversals in outcome that need to occur for the result to no longer be statistically significant; this reflects the relatively small sample sizes and unrealistic treatment ES.^25–25 The fragility index is based on the obtained p-value and can be said to simply be “repackaging of the p-value”;²⁹ it reflects the instability of obtained p-values between 0.05-0.005. In contrast, our method considered the long-term reliability of studies done in a research field with a certain empirical power and, crucially, designed with a certain α level and a specified prior Pr(Ho). This reflects what to expect long-term from studies done in a research field that uses that specified design; importantly, after the data is in and an exact p-value is obtained, the credibility of that individual study requires a different analysis, and these reverse Bayesian implications of the cohorts of RCTs we have reported elsewhere.^12,30 Modeling of incentive structures in science determined that to maximize scientists’ fitness they should “spend most of their effort seeking novel results and conduct small studies that have only 10-40% statistical power [such that at least] half of the studies they publish will report erroneous conclusions”.³¹ To our knowledge, the current study is the first to present detailed implications of low power for a research field, and particularly in the field of critical care. By determining the surprisingly low empirical post-hoc and sensitivity power of representative cohorts of RCTs in critical care we detail many implications including low NPV (of a non-statistically significant finding), low PPV (of a statistically significant finding), and how these vary with the chosen α, ES to be detected, defensible Pr(Ho), and even small amounts of bias. Based on these findings, we suggest that, to improve the credibility of findings in the critical care research field, RCTs be designed to have at least 80% power for realistic (likely small) ES at α 0.005. Importantly, this will guard against overly-optimistic estimates of Pr(H1).

The optimistic Pr(H1) we used was 50%, reflecting clinical equipoise that is considered to justify blinding and randomization.³² The realistic Pr(H1) we used was 10%, suggesting that only 10% of all interventions studied in critical care RCTs are found to be useful. This choice can be defended. First, 10% (and often lower) has been suggested by others as a realistic estimate of the proportion of interventions tested in a field that prove successful.^5,33 Second, systematic reviews of adult and pediatric critical care RCTs consistently find that <10% of tested interventions prove successful with wide implementation in practice.^10,11,24 Third, reviews of translation from NHA studies to human clinical practice consistently find that <10% (and often closer to 0%) of interventions succeed.^34,35 Fourth, even interventions thought useful in human critical care RCTs often turn out to have been false-positive findings, with few proven interventions that improve outcomes.^10,36–39

The suggestion that RCTs be designed to have at least 80% power for realistic ES at α 0.005 will require attention to realistic control group outcome rates (to avoid over-estimation), realistic expected treatment ES (to avoid delta inflation), and design based on a more stringent α of 0.005.^22–22,33 To achieve this goal, the sample size of many RCTs (especially Pediatric-RCTs) would need to increase, often requiring larger multicenter studies; using α 0.005 instead of 0.05 while maintaining 80% power would require an increase in sample sizes of about 70%.³³ This may result in fewer and costlier RCTs being performed. We believe the many benefits of this approach will outweigh these costs. First, low powered RCTs have both poor PPV and NPV (as shown here), yield inflated ES estimates, and are more prone to publication and other biases as shown by others.^1–1 This makes results less credible individually and in meta-analyses, and may lead to premature abandonment of promising therapies, to premature adoption of false-positive or overly optimistic findings, and to falsely optimistic future studies based on misleading ES estimates. Second, low powered RCTs expose research participants to risk without sufficient benefit of expanding knowledge and improving care of future patients. Third, low powered RCTs may consume scarce research resources without resulting in sufficient scientific or clinical value.

This study has limitations. First, we did not consider all RCTs in critical care, limiting generalizability of findings. Using three cohorts of RCTs to represent the field somewhat mitigated this concern. Second, we did not assess bias in the included RCTs, and thus do not know if the bias factors considered were accurate. A bias factor of 0.1 is thought to be small, and when assessing an individual RCT determining potential biases can inform the relevance of this estimate.^5,33 Third, we only included RCTs, and only mortality outcomes for the Adult-RCTs and Pediatric-RCTs; our findings may not generalize to observational studies and non-mortality outcomes. Observational studies provide more opportunity for bias to influence results.^40–43 Non-mortality outcomes sometimes are more subjective, also providing opportunity for bias; non-mortality outcomes were included in the Consecutive-RCTs with similar results.^9,43 Fourth, mortality was not the primary outcome for 26.9% of Adult-RCTs, and 84.2% of Pediatric-RCTs, potentially reducing the calculated post-hoc power. Mortality as the primary outcome was not an independent predictor of post-hoc power for small ES in Adult-RCTs, nor of post-hoc power for medium ES in Pediatric-RCTs or Consecutive-Human-RCTs. Fifth, excluding Consecutive-RCTs obtaining a p-value ≥0.10 may have given non-generalizable results for this cohort of RCTs. Findings for Consecutive-Human-RCTs were similar to those for Adult-RCTs, and better than for Pediatric-RCTs, where we did not exclude studies based on p-value. In addition, p-value category was not a predictor of post-hoc power in Adult-RCTs or Pediatric-RCTs.

This study had several strengths. First, we examined three large cohorts of RCTs, representing some of the best critical care RCTs. Second, we used a detailed instruction manual to guide recording data and calculate outcomes. Third, we calculated not only post-hoc and sensitivity power, but also determined the implications of post-hoc power for the PPV and NPV of findings in the critical care research field.

Conclusions

Post-hoc power was low, particularly for small ES, at α 0.005, in NHA-RCTs and Pediatric-RCTs, or when RCTs may have mild bias. This translated into low PPV and often low NPV of RCT findings. We suggest that, to improve the reliability of findings in the critical care research field, RCTs be designed to have as minimal bias as possible, and with 80% power for realistic (likely small) ES at α 0.005. Otherwise, the field is likely to find that “most published research findings are false”.⁵

Supplemental Material

sj-docx-1-jic-10.1177_08850666221077203 - Supplemental material for Critical Care Randomized Trials Demonstrate Power Failure: A Low Positive Predictive Value of Findings in the Critical Care Research Field

Supplemental material, sj-docx-1-jic-10.1177_08850666221077203 for Critical Care Randomized Trials Demonstrate Power Failure: A Low Positive Predictive Value of Findings in the Critical Care Research Field by Sarah Nostedt and Ari R Joffe in Journal of Intensive Care Medicine

Supplemental Material

sj-docx-2-jic-10.1177_08850666221077203 - Supplemental material for Critical Care Randomized Trials Demonstrate Power Failure: A Low Positive Predictive Value of Findings in the Critical Care Research Field

Supplemental material, sj-docx-2-jic-10.1177_08850666221077203 for Critical Care Randomized Trials Demonstrate Power Failure: A Low Positive Predictive Value of Findings in the Critical Care Research Field by Sarah Nostedt and Ari R Joffe in Journal of Intensive Care Medicine

Supplemental Material

sj-docx-3-jic-10.1177_08850666221077203 - Supplemental material for Critical Care Randomized Trials Demonstrate Power Failure: A Low Positive Predictive Value of Findings in the Critical Care Research Field

Supplemental material, sj-docx-3-jic-10.1177_08850666221077203 for Critical Care Randomized Trials Demonstrate Power Failure: A Low Positive Predictive Value of Findings in the Critical Care Research Field by Sarah Nostedt and Ari R Joffe in Journal of Intensive Care Medicine

	Study p-value
True H1	TP = Power + (bias)(β)	FN = β – (bias)(β) = (1-bias)(β)
False H1	FP = O(α + [bias][1-α])	TN = O ([1-α]-[bias][1-α]) = O (1-bias)(1-α)

Study result	Adult RCTs(n=216)	Pediatric RCTs(n=120)	Consecutive RCTsPrimary outcome(n=90)	Consecutive RCTsSecondary outcome(n=90)
Study n/group Multicenter NHA Human	201 [95, 453] (17-3467)	40 [22, 80] (5-2474) 57 [27, 148]	58 [18, 195] (3 – 7942) 8 [6, 13] 106 [42, 243]	60 [18, 194] (3 – 7942) 8 [6, 14] 97 [41, 244]
Post-hoc power at α = 0.05
For Small ES Multicenter NHA Human	24 [9, 52] (3-99)	3 [1, 7] (1-65) 5 [2, 11]	12 [7, 29] (2 – 100) 6 [6, 8] 16 [10, 37]	13 [7, 23] (2 – 100) 6 [6, 8] 17 [10, 28]
For Medium ES Multicenter NHA Human	92 [46, 99.9] (6-100)	9 [2, 32] (1-100) 18 [4, 55]	59 [31, 97] (4 – 100) 15 [12, 34] 70 [46, 98]	62 [25, 89] (6 – 100) 14 [12, 24] 74 [45, 96]
For Large ES Multicenter NHA Human	100 [86, 100] (13-100)	20 [3, 68] (1-100) 43 [8, 90]	94 [64, 100] (7 – 100) 32 [22, 68] 97 [85, 100]	95 [54, 99] (12 – 100) 28 [22, 50] 99 [85, 100]
Sensitivity power: ES detected at power of 80%
For RR Multicenter NHA Human	0.61 [0.37, 0.74] (0.01-0.88)	0.01 [0.01, 0.26] (0.01-0.77) 0.13 [0.01, 0.41]	0.5 [0.32, 0.68] (0.01 – 0.99) (n=55) ^a 0.33 [0.26, 0.56] 0.60 [0.37, 0.73]	0.54 [0.39, 0.64] (0.01 – 0.99) (n=47) ^a 0.27 ^b 0.55 [0.41, 0.65]
For ARD Multicenter NHA Human	11.3 [8.0, 17.6] (2.5-42.9)	12.8 [6.8, 23.3] (1.0-68.5) 8.9 [4.7, 17.0]	16.8 [8.9, 29.9] (1.1 – 62.0) (n=55) ^a 53.0 [38.8, 58.4] 15.0 [8.2, 21.0]	12.1 [7.9, 23.5] (1.0 – 58.1) (n=47) ^a 57.2 ^b 12.0 [7.9, 22.8]
For d NHA Human	-	-	0.96 [0.57, 1.63] (0.20, 3.07) (n=35) ^a 1.80 [1.41, 2.02] 0.62 [0.45, 0.81]	0.91 [0.54, 1.63] (0.20 – 3.07) (n=43) ^a 1.63 [1.22, 1.97] 0.57 [0.41, 0.84]
Post-hoc power at α = 0.005
For Small ES Multicenter NHA Human	6 [1, 21] (1-94)	1 [1, 1] (1-33) 1 [1, 2]	2 [1, 8] (1 – 100) 1 [1, 1] 3 [2, 12]	2 [1, 6] (1 – 100) 1 [1, 1] 4 [2, 8]
For Medium ES Multicenter NHA Human	72 [17, 99] (1-100)	1 [1, 10] (1-99.8) 4 [1, 20]	26 [8, 83] (1 – 100) 3 [2, 9] 36 [17, 91]	29 [6, 65] (1 – 100) 2 [2, 6] 43 [14, 81]
For Large ES Multicenter NHA Human	99 [58, 100] (1-100)	4 [1, 36] (1-100) 14 [1, 63]	74 [29, 99.9] (1 – 100) 8 [4, 33] 86 [55, 100]	78 [22, 98] (1 – 100) 6 [4, 18] 92 [53, 99.9]
Sensitivity power: ES detected at power of 80%
For RR Multicenter NHA Human	0.51 [0.23, 0.66] (0.01-0.84)	0.01 (0.01, 0.13] (0.01-0.71) 0.01 [0.01, 0.25]	0.39 [0.18, 0.59] (0.01 – 0.99) 0.17 [0.11, 0.41] 0.47 [0.24, 0.65]	0.43 [0.24, 0.55] (0.01 – 0.99) 0.12 ^b 0.43 [0.26, 0.56]
For ARD Multicenter NHA Human	14.1 [10.1, 21.6] (3.2-52.0)	14.4 [6.9, 25.9] (1.0-68.5) 9.1 [4.8, 20.0]	21.5 [10.6, 37.3] (1.4 – 78.0) 64.2 [52.2, 72.5] 19.6 [10.4, 24.9]	15.8 [10.2, 28.3] (1.3 – 76.1) 69.3 ^b 15.2 [10.2, 28.2]
For d NHA Human			1.29 [0.75, 2.34](0.27 – 5.76) 2.63 [3.08, 1.97] 0.83 [0.59, 1.08]	1.22 [0.71, 2.34](0.27 – 5.76) 2.34 [1.66, 2.97] 0.74 [0.54, 1.13]

Values of interest	Optimistic pre-study odds of Ho 1:1	Realistic pre-study odds of Ho 9:1
Post-hoc power for small effects	24 [9, 52]	6 [1, 21]	24 [9, 52]	6 [1, 21]
False Positive Report Probability	17.2 [35.7, 8.8]%	7.7 [33.3, 2.3]%	65.2 [83.3, 46.4]%	42.9 [81.8, 17.6]%
True Positive Report Probability (PPV)	82.8 [64.3, 91.2]%	92.3 [66.7, 97.7]%	34.8 [16.7, 53.6]%	57.1 [18.2, 82.4]%
False Negative Report Probability	44.4 [48.9, 33.6]%	48.6 [49.9, 44.3]%	8.2 [9.6, 5.3]%	9.5 [10.0, 8.1]%
True Negative Report Probability (NPV)	55.6 [51.1, 66.4]%	51.4 [50.1, 55.7]%	91.8 [90.4, 94.7]%	90.5 [90.0, 91.9]%
Bayes Factor	4.8 [1.8, 10.4]	12 [2, 42]	4.8 [1.8, 10.4]	12 [2, 42]
With Bias of 0.1
PPV	68.6 [55.5, 79.7]%	59.6 [51.1, 73.4]%	19.3 [12.1, 30.1]%	13.9 [10.3, 23.3]%
Bayes Factor	2.18 [1.25, 3.92]	1.47 [1.04, 2.77]	2.18 [1.25, 3.92]	1.47 [1.04, 2.77]
With bias of 0.2
PPV	62.0 [53.1, 72.0]%	55.0 [50.5, 64.3]%	15.2 [11.1, 22.0]%	11.8 [10.1, 16.6]%
Bayes Factor	1.63 [1.13, 2.57]	1.22 [1.02, 1.80]	1.63 [1.13, 2.57]	1.22 [1.02, 1.80]
With bias of 0.3
PPV	58.3 [52.0, 66.5]%	53.0 [50.3, 59.6]%	13.3 [10.6, 17.9]%	11.0 [10.0, 13.9]%
Bayes Factor	1.40 [1.08, 1.98]	1.13 [1.01, 1.47]	1.40 [1.08, 1.98]	1.13 [1.01, 1.47]

Values of interest	Optimistic pre-study Odds of Ho 1:1	Realistic pre-study Odds of Ho 9:1
Post-hoc power for medium effects	92 [46, 99.9]	72 [17, 99]	92 [46, 99.9]	72 [17, 99]
False Positive Report Probability	5.2 [9.8, 4.8]%	0.7 [2.9, 5.0]%	32.8 [49.5, 31.1]%	5.9 [20.9, 4.3]%
True Positive Report Probability (PPV)	94.8 [90.2, 95.2]%	99.3 [97.1, 99.5]%	67.2 [50.5, 68.9]%	94.1 [79.1, 95.7]%
False Negative Report Probability	7.8 [36.2, 0.1]%	22.0 [45.5, 1.0]%	0.9 [5.9, 0.01]%	3.0 [8.5, 0.1]%
True Negative Report Probability (NPV)	92.2 [63.8, 99.9]%	78.0 [54.5, 99.0]%	99.1 [94.1, 99.9]%	97.0 [91.5, 99.9]%
Bayes Factor	18.4 [9.2, 20.0]	144 [34, 198]	18.4 [9.2, 20.0]	144 [34, 198]
With Bias of 0.1
PPV	86.5 [78.0, 87.3]%	87.7 [70.8, 90.5]%	41.3 [28.1, 43.1]%	44.1 [21.0, 51.0]%
Bayes Factor	6.4 [3.54, 6.89]	7.16 [2.42, 9.48]	6.4 [3.54, 6.89]	7.16 [2.42, 9.48]
With bias of 0.2
PPV	79.6 [70.3, 80.6]%	79.2 [62.2, 82.9]	30.0 [20.7, 31.4]	29.5 [15.3, 34.8]%
Bayes Factor	3.9 [2.37, 4.16]	3.80 [1.65, 4.86]	3.9 [2.37, 4.16]	3.80 [1.65, 4.86]
With bias of 0.3
PPV	73.8 [65.0, 74.9]%	72.6 [58.0, 76.6]%	23.7 [17.0, 24.7]%	22.6 [13.2, 26.5]%
Bayes Factor	2.82 [1.86, 2.98]	2.65 [1.38, 3.27]	2.82 [1.86, 2.98]	2.65 [1.38, 3.27]

Values of interest	Optimistic pre-study Odds of Ho 1:1	Realistic pre-study Odds of Ho 9:1
Post-hoc power for medium effects	9 [2, 32]	18 [4, 55]	1 [1, 10]	4 [1, 20]	18 [4, 55]	4 [1, 20]
False Positive Report Probability	35.7 [71.4, 13.5]%	21.7 [55.6, 8.3]%	33.3 [33.3, 4.8]%	11.1 [33.3, 2.4]%	71.4 [91.8, 45.0]%	52.9 [81.8, 18.4]%
True Positive Report Probability (PPV)	64.3 [28.6, 86.5]%	78.3 [44.4, 91.7]%	66.7 [66.7, 95.2]%	88.9 [66.7, 97.6]%	28.6 [8.2, 55.0]%	47.1 [18.2, 81.6]%
False Negative Report Probability	48.9 [50.8, 41.7]%	46.3 [50.3, 32.1]%	49.9 [49.9, 47.5]%	49.1 [49.9, 44.6]%	8.8 [10.1, 5.0]%	9.7 [10.0, 8.2]%
True Negative Report Probability (NPV)	51.1 [49.2, 58.3]%	53.7 [49.7, 67.9]%	50.1 [50.1, 52.5]%	50.9 [50.1, 55.4]%	91.2 [89.9, 95.0]%	90.3 [90.0, 91.8]%
Bayes Factor	1.8 [0.4, 6.4]	3.6 [0.8, 11.0]	2.0 [2.0, 20.0]	8.0 [2.0, 40.0]	3.6 [0.8, 11.0]	8.0 [2.0, 40.0]
With Bias of 0.1
PPV	55.6 [44.8, 72.8]%	64.4 [48.5, 80.4]%	51.0 [51.0, 64.5]%	56.5 [51.0, 72.8]%	16.7 [9.5, 31.3]%	12.6 [10.4, 22.9]%
Bayes Factor	1.25 [0.81, 2.68]	1.81 [0.94, 4.10]	1.04 [1.04, 1.82]	1.30 [1.04, 2.68]	1.81 [0.94, 4.10]	1.30 [1.04, 2.68]
With bias of 0.2
PPV	53.1 [47.4, 65.5]%	58.8 [49.2, 72.8]%	50.5 [50.5, 57.8]%	53.3 [50.5, 63.8]%	13.7 [9.7, 22.9]%	11.2 [10.2, 16.4]%
Bayes Factor	1.13 [0.90, 1.90]	1.43 [0.97, 2.67]	1.02 [1.02, 1.37]	1.14 [1.02, 1.76]	1.43 [0.97, 2.67]	1.14 [1.02, 1.76]
With bias of 0.3
PPV	51.9 [48.5, 60.9]%	55.9 [49.5, 67.1]%	50.2 [50.2, 55.0]%	51.9 [50.2, 59.2]%	12.4 [9.8, 18.5]%	10.7 [10.1, 13.9]%
Bayes Factor	1.08 [0.94, 1.56]	1.27 [0.98, 2.04]	1.01 [1.01, 1.22]	1.08 [1.01, 1.45]	1.27 [0.98, 2.04]	1.08 [1.01, 1.45]

Values of interest	Optimistic pre-study Odds of Ho 1:1	Realistic pre-study Odds of Ho 9:1
Post-hoc power for medium effects	70 [46, 98]	15 [12, 34]	36 [17, 91]	3 [2, 9]	70 [46, 98]	36 [17, 91]
False Positive Report Probability	6.7 [9.8, 4.9]%	25.0 [29.4, 12.8]%	1.4 [2.9, 0.5]%	14.3 [20.0, 5.3]%	39.1 [49.5, 31.5]%	11.1 [20.9, 4.7]%
True Positive Report Probability (PPV)	93.3 [90.2, 95.1]%	75.0 [70.6, 87.2]%	98.6 [97.1, 99.5]%	85.7 [80.0, 94.7]%	60.9 [50.5, 68.5]%	88.9 [79.1, 95.3]%
False Negative Report Probability	24.0 [36.2, 2.1]%	47.2 [48.1, 41.0]%	39.1 [45.4, 8.3]%	49.4 [49.6, 47.8]%	3.4 [5.9, 0.2]%	6.7 [8.5, 1.0]%
True Negative Report Probability (NPV)	76.0 [63.8, 97.9]%	52.8 [51.9, 59.0]%	60.9 [54.5, 91.7]%	50.6 [50.4, 52.2]%	96.6 [94.1, 99.8]%	93.3 [91.5, 99.0]%
Bayes Factor	14.0 [9.2, 19.6]	3.0 [2.4, 6.8]	72.0 [34.0, 182.0]	6.0 [4.0, 18.0]	14.0 [9.2, 19.6]	72.0 [34.0, 182.0]
With Bias of 0.1
PPV	83.4 [78.0, 87.1]%	61.8 [58.9, 73.7]%	80.2 [70.8, 89.8]%	54.9 [53.0, 63.4]%	35.6 [28.1, 42.7]	30.9 [21.0, 49.2]%
Bayes Factor	5.03 [3.54, 6.77]	1.62 [1.43, 2.8]	4.06 [2.42, 8.79]	1.22 [1.13, 1.73]	5.03 [3.54, 6.77]	4.06 [2.42, 8.79]
With bias of 0.2
PPV	76.0 [70.3, 80.4]%	57.1 [55.2, 66.3]%	70.5 [61.1, 82.0]%	52.3 [51.4, 57.1]%	25.8 [20.7, 31.1]%	20.8 [15.3, 33.4]%
Bayes Factor	3.17 [2.37, 4.1]	1.33 [1.23, 1.97]	2.39 [1.65, 4.55]	1.10 [1.06, 1.33]	3.17 [2.37, 4.1]	2.39 [1.65, 4.55]
With bias of 0.3
PPV	70.2 [65.0, 74.6]%	54.7 [53.4, 61.6]%	64.5 [58.0, 75.5]%	51.4 [50.9, 54.5]%	20.6 [17.0, 24.5]%	16.7 [13.2, 25.4]%
Bayes Factor	2.36 [1.86, 2.94]	1.21 [1.15, 1.61]	1.82 [1.38, 3.09]	1.06 [1.03, 1.20]	2.36 [1.86, 2.94]	1.82 [1.38, 3.09]

Footnotes

Funding

This work was supported by a University of Alberta, Department of Pediatrics Resident Research Grant awarded to Dr Sarah Nostedt. The funding agency had no role in design and conduct of the study; collection, analysis or interpretation of the data; preparation, writing, review, or approval of the manuscript; or the decision to submit the manuscript for publication.

Author's Contributions

SN and ARJ contributed to conception and design of the work; acquisition, analysis, and interpretation of the data; and substantial critical revisions of the manuscript for important intellectual content; have approved the submitted version; and have agreed to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. ARJ wrote the first draft of the article.

Ethics Approval and Consent to Participate

Not applicable. This study used only data from published studies, and thus was exempt from requirements for ethics board approval.

Availability of Data and Material

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

ORCID iD

Ari R Joffe

Supplemental material

Supplemental material for this article is available online.

References

Button

Ioannidis

JPA

Mokrysz

, et al. Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci. 2013;14(5):365–376.

Szucs

Ioannidis

JPA

. Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature. PLoS Biol. 2017;15(3):e2000797.

Colquhoun

. An investigation of the false discovery rate and the misinterpretation of p-values. R Soc Open Sci. 2014;1(3):140216.

Wagenmakers

Verhagen

, et al. A power fallacy. Behav Res. 2015;47(4):913–917.

Ioannidis

JPA

. Why most published research findings are false. PLoS Med. 2005;2(8):e124.

Greenland

Senn

Rothman

, et al. Statistical tests, P-values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016;31(4):337–350.

Hubbard

Bayarri

. Confusion over measures of evidence (p's) versus errors (α's) in classical statistical testing. Am Stat. 2003;57(3):171–178.

Szucs

Ioannidis

JPA

. When null hypothesis significance testing is unsuitable for research: a reassessment. Front Hum Neurosci. 2017;11:390.

Forstmeier

Wagenmakers

E-J

Parker

. Detecting and avoiding likely false-positive findings - a practical guide. Biol Rev. 2017;92(4):1941–1968.

10.

Santacruz

Pereira

Celis

Vincent

. Which multicenter randomized controlled trials in critical care medicine have shown reduced mortality? A systematic review. Crit Care Med. 2019;47(12):1680–1691.

11.

Duffett

Choong

Hartling

Menon

Thabane

Cook

. Randomized controlled trials in pediatric critical care: a scoping review. Crit Care. 2013;17(5):R256.

12.

Nostedt

Joffe

. Reverse Bayesian implications of p-values reported in critical care randomized trials. J Intensive Care Med. 2021. DOI: 10.1177/08850666211053793.

13.

Olivier

May

Bell

. Relative effect sizes for measures of risk. Communications in Statistics – Theory and Methods. 2017;46(14):6774–6781.

14.

Maher

Markey

Ebert-May

. The other half of the story: effect size analysis in quantitative research. CBE Life Sci Educ. 2013;12(3):345–351.

15.

Hojat

. A visitor's Guide to effect sizes: statistical significance versus practical (clinical) importance of research findings. Adv Health Sci Educ Theory Pract. 2004;9(3):241–249.

16.

Faul

Erdfelder

Lang

Buchner

. G*power 3: a flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behav Res Methods. 2007;39(2):175–191.

17.

Exact. Proportions – inequality of two independent groups (Fisher's exact-test), page 17; and t test: Means – difference between two independent means (two groups), page 49. In: G*Power 3.1 Manual. October 15, 2020. https://www.psychologie.hhu.de/arbeitsgruppen/allgemeine-psychologie-und-arbeitspsychologie/gpower. Accessed 24 April 2021.

18.

Goodman

Berlin

. The use of predicted CIs when planning experiments and the misuse of power when interpreting results. Ann Intern Med. 1994;121(3):200–206.

19.

Hoenig

Heisey

. The abuse of power: the pervasive fallacy of power calculations for data analysis. Am Stat. 2001;55(1):19–24.

20.

Greenland

. Nonsignificance plus high power does not imply support for the null over the alternative. Ann Epidemiol. 2012;22(5):364–368.

21.

Smaldino

McElreath

. The natural selection of bad science. R Soc Open Sci. 2016;3(9):160384.

22.

Harhay

Wagner

Ratcliffe

, et al. Outcomes and statistical power in adult critical care randomized trials. Am J Resp Crit Care Med. 2014;189(12):1469–1478.

23.

Aberegg

Richards

O’Brien

. Delta inflation: a bias in the design of randomized controlled trials in critical care medicine. Crit Care. 2010;14(2):R77.

24.

Abrams

Montesi

Moore

SKL

, et al. Powering bias and clinically important treatment effects in randomized trials of critical illness. Crit Care Med. 2020;48(12):1710–1719.

25.

Ridgeon

Young

Bellomo

Muchetti

Lembo

Landoni

. The fragility index in multicenter randomized controlled critical care trials. Crit Care Med. 2016;44(7):1278–1284.

26.

Grolleau

Collins

Smarandache

, et al. The fragility and reliability of conclusions of anesthesia and critical care randomized trials with statistically significant findings: a systematic review. Crit Care Med. 2019;47(3):456–462.

27.

Vargas

Buonano

Marra

Iacovazzo

Servillo

. Fragility index in multicenter randomized controlled trials in critical care medicine that have shown reduced mortality. Crit Care Med. 2020;48(3):e250–e251.

28.

Matics

Khan

Jani

Kane

. The fragility of statistically significant findings in pediatric critical care randomized controlled trials. Pediatr Crit Care Med. 2019;20(6):e258–e262.

29.

Carter

McKie

Storlie

. The fragility index: a P-value in sheep's Clothing? Eur Heart J. 2017;38(5):346–348.

30.

Held

Matthews

Ott

Pawel

. Reverse-Bayes methods for evidence assessment and research synthesis. arXiv.org. 2021. (preprint). DOI: 2102.13443.v2. Available at: https://arxiv.org/abs/2102.13443 (accessed July 30, 2021).

31.

Higginson

Munafo

. Current incentives for scientists lead to underpowered studies with erroneous conclusions. PLoS Biol. 2016;14(11):e2000995.

32.

Johnson

Lilford

Brazier

. At what level of collective equipoise does a clinical trial become ethical? J Med Ethics. 1991;17(1):30–34.

33.

Benjamin

Berger

Johannesson

, et al. Redefine statistical significance. Nat Hum Behav. 2018;2(1):6–10.

34.

Joffe

Bara

Anton

Nobis

. Expectations for the methodology and translation of animal research: a survey of the general public, medical students and animal researchers in North America. Altern Lab Anim. 2016;44(4):361–381.

35.

Pippin

. Animal research in medical sciences: seeking a convergence of science, medicine, and animal law. S Tex L Rev. 2012;54:469.

36.

Ranieri

Thompson

Barie

, D. et al. for the PROWESS-SHOCK Study Group. Drotrecogin alfa (activated) in adults with septic shock. NEJM 2012;366(22):2055–2064

37.

National Heart, Lung, and Blood Institute PETAL Clinical Trials Network; Moss

Huang

Brower

, et al.

Early neuromuscular blockade in the acute respiratory distress syndrome

. NEJM. 2019;380(21):1997–2008

38.

Mouncey

Osborn

Power

, et al. for the ProMISe Trial Investigators. Trial of early, goal-directed resuscitation for septic shock. NEJM 2015;372(14):1301–1311

39.

NICE-SUGAR Study Investigators. Intensive versus conventional glucose control in critically ill patients. NEJM. 2009;360(13):1283–1297.

40.

Munafo

Nosek

Bishop

DVM

, et al. A manifesto for reproducible science. Nat Hum Behav. 2017;1:0021.

41.

Simmons

Nelson

Simonsohn

. False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psych Sci. 2011;22(11):1359–1366.

42.

Szucs

. A tutorial on hunting statistical significance by chasing n. Front Psychol. 2016;7:1444.

43.

Ioannidis

JPA

. What have we (not) learnt from millions of scientific papers with P values? Am Stat. 2019;73(Suppl1):20–25.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.22 MB

0.06 MB

0.04 MB

	Study p-value
Reality	Statistically significant(positive H1)	Statistically non-significant (negative H1)
True H1	TP = Power + (bias)(β)	FN = β – (bias)(β) = (1-bias)(β)
False H1	FP = O(α + [bias][1-α])	TN = O ([1-α]-[bias][1-α]) = O (1-bias)(1-α)

Values of interest	Optimistic pre-study odds of Ho 1:1		Realistic pre-study odds of Ho 9:1
Values of interest	α = 0.05	α = 0.005	α = 0.05	α = 0.005
Post-hoc power for small effects	24 [9, 52]	6 [1, 21]	24 [9, 52]	6 [1, 21]
False Positive Report Probability	17.2 [35.7, 8.8]%	7.7 [33.3, 2.3]%	65.2 [83.3, 46.4]%	42.9 [81.8, 17.6]%
True Positive Report Probability (PPV)	82.8 [64.3, 91.2]%	92.3 [66.7, 97.7]%	34.8 [16.7, 53.6]%	57.1 [18.2, 82.4]%
False Negative Report Probability	44.4 [48.9, 33.6]%	48.6 [49.9, 44.3]%	8.2 [9.6, 5.3]%	9.5 [10.0, 8.1]%
True Negative Report Probability (NPV)	55.6 [51.1, 66.4]%	51.4 [50.1, 55.7]%	91.8 [90.4, 94.7]%	90.5 [90.0, 91.9]%
Bayes Factor	4.8 [1.8, 10.4]	12 [2, 42]	4.8 [1.8, 10.4]	12 [2, 42]
With Bias of 0.1
PPV	68.6 [55.5, 79.7]%	59.6 [51.1, 73.4]%	19.3 [12.1, 30.1]%	13.9 [10.3, 23.3]%
Bayes Factor	2.18 [1.25, 3.92]	1.47 [1.04, 2.77]	2.18 [1.25, 3.92]	1.47 [1.04, 2.77]
With bias of 0.2
PPV	62.0 [53.1, 72.0]%	55.0 [50.5, 64.3]%	15.2 [11.1, 22.0]%	11.8 [10.1, 16.6]%
Bayes Factor	1.63 [1.13, 2.57]	1.22 [1.02, 1.80]	1.63 [1.13, 2.57]	1.22 [1.02, 1.80]
With bias of 0.3
PPV	58.3 [52.0, 66.5]%	53.0 [50.3, 59.6]%	13.3 [10.6, 17.9]%	11.0 [10.0, 13.9]%
Bayes Factor	1.40 [1.08, 1.98]	1.13 [1.01, 1.47]	1.40 [1.08, 1.98]	1.13 [1.01, 1.47]