Sage Journals: Discover world-class research

Abstract

In this article, I introduce two commands for computing the fragility index (FI): fragility, which is used for individual randomized controlled trials, and metafrag, which is used for meta-analyses. The FI for individual studies is defined as the minimum number of patients whose status would have to change from a nonevent to an event to nullify a statistically significant result. Correspondingly, the FI for meta-analyses is defined as the minimum number of patients from one or more trials included in the meta-analysis for which a modification of the event status (that is, changing events to nonevents or nonevents to events) would change the statistical significance of the pooled treatment effect to nonsignificant. Whether for an individual study or for a meta-analysis, a low FI indicates a more “fragile” study result, and a larger FI indicates a more robust result.

Keywords

st0664 fragility metafrag fragility index meta-analysis randomized controlled trials research methodology statistical significance

1 Introduction

When considering the results of a randomized controlled trial (RCT), scientists and those who rely on scientific evidence often conclude that a treatment is effective solely based on a p-value threshold (that is, < 0.05). However, the use of a p-value threshold to declare statistical significance has been widely criticized for being overly simplistic, frequently misunderstood, and inappropriately interpreted (see, for example, Amrhein, Greenland, and McShane [2019]; Colquhoun [2017]; Feinstein [1998]; Ioannidis [2005, 2018]; Sterne and Davey Smith [2001]; Wasserstein and Lazar [2016]).

As an upshot of this discourse, several supplementary measures to the p-value have been proposed to provide more focus on the robustness of statistically significant results from RCTs. Among these are Bayesian analyses (Quatto, Ripamonti, and Marasini 2020); the type S (“sign”) error risk and exaggeration ratio (Gelman and Tuerlinckx 2000; Gelman and Carlin 2014); S-values (Greenland 2019); second-generation p-values (Blume et al. 2019); and the fragility index (FI) (Walsh et al. 2014).

In this article, I introduce two commands for computing the FI: the fragility command, which is used for individual RCTs with a binary outcome (Walsh et al. 2014), and the metafrag command for meta-analysis with a binary outcome (Atal et al. 2019). For single studies, the FI is defined as the minimum number of patients whose status would have to change from a nonevent to an event to nullify a statistically significant result. A smaller FI indicates that the statistical significance is contingent on only a small number of events, whereas a larger FI indicates a more robust result. The FI for metaanalysis is defined as the minimum number of patients from one or more trials included in the meta-analysis for which a modification of the event status (that is, changing events to nonevents or nonevents to events) would change the statistical significance of the pooled treatment effect to nonsignificant (Atal et al. 2019). As such, an FI of zero indicates that no modification of the event status is necessary to elicit a statistically nonsignificant pooled treatment effect. Conversely, a large FI score indicates that many modifications to the event status are required to change a statistically significant pooled effect to nonsignificant (and thus, the results may be considered more robust).

2 Methods

2.1 Computing the FI for individual RCTs

The FI represents the absolute number of additional events (primary endpoints) required to obtain a p-value greater than or equal to a predetermined statistical significance threshold (typically set to 0.05). The FI for individual RCTs is computed by adding an event to the study group with the smaller number of events (and subtracting a nonevent from the same group to keep the total number of patients within that group constant) and recomputing the two-sided significance. Events are iteratively added until the first time the computed p-value becomes statistically nonsignificant (Walsh et al. 2014).

fragility also computes the fragility quotient as proposed by Ahmed, Fowler, and McCredie (2016). The fragility quotient is a relative measure of fragility that simply divides the absolute FI by the total sample size (Ahmed, Fowler, and McCredie 2016).

2.2 Computing the fragility index for meta-analyses

To evaluate the FI of a meta-analysis, one sequentially recalculates the 95% confidence interval (CI) of the pooled estimate after performing all single event-status modifications that increase the estimate (or decrease it, depending on whether the treatment is expected to increase or decrease the risk of the outcome) by 1) changing a nonevent to an event for patients receiving treatment A for each single trial or 2) changing an event to a nonevent for patients receiving treatment B for each trial (Atal et al. 2019).

This process leads to 2 N newly calculated 95% CIs for the pooled estimate (where N is the total number of studies in the meta-analysis). If one of the newly calculated CIs overlaps 1.0, the FI of the meta-analysis is 1 because one unique event-status modification (that is, changing a nonevent to an event in arm A or an event to a nonevent in arm B) in one specific trail changed the statistical significance of the meta-analysis. If all the newly calculated 95% CIs for the pooled estimate remain < 1.0 (in the case of a treatment that lowers the risk of the outcome or > 1.0 if the treatment is expected to increase the probability of the outcome), the specific trial and specific event-status modification that lead to the 95% CI for the pooled estimate being closer to 1.0 as a starting point for the next iteration are selected (Atal et al. 2019).

This process is then repeated by performing a new single event-status modification in each arm of each trial in turn on top of the first selected modification. Similarly, if one of these 2 N event-status modifications leads to a newly calculated 95% CI for the pooled estimate overlapping 1.0, the FI of the meta-analysis is then equal to 2. This process is iterated until one event-status modification leads to a newly calculated 95% CI for the pooled estimate overlapping 1.0. The number of iterations needed to find a combination of event-status modifications in specific arms and trials leading to a modified meta-analysis with 95% CI for the pooled estimate overlapping 1.0 is thus the FI for the meta-analysis (Atal et al. 2019).

2.3 Differences between metafrag and the R package fragility_ma

metafrag produces results consistent with those of the R package fragility_ma and its related website http://www.clinicalepidemio.fr/fragility_ma/. However, there are some differences between the software programs: 1) Stata’s meta esize command does not support the combination of random effects with the Mantel–Haenszel method (see help meta_esize##remethod), whereas fragility_ma, which uses the R package metabin for computing pooled treatment effects, does support this combination; 2) Stata’s meta esize handles zero cells somewhat differently from metabin, possibly leading to slightly different results between software packages when some individual studies have zero cells; and 3) when there are ties between studies in the computed maximum (minimum) confidence level at any iteration, fragility_ma reports the FI that includes the modifications to all tied studies. metafrag reports both the FI for each iteration in the loop where any event modification occurs and the total number of modifications if there are ties.

3 The fragility command

This section describes the syntax of the fragility command and available options. fragility is an immediate command (see [U] 19 Immediate commands).

3.1 Syntax

fragility #n11 #n12 #n21 #n22 [, level( # ) chi2 detail]

In the syntax, variables #n11 and #n12 contain the respective numbers of events and nonevents from individuals in group 1 (treatment), and variables #n21 and #n22 contain the respective numbers for group 2 (control).

3.2 Options

level( # ) specifies the desired p-value threshold level at which to test statistical significance. Most disciplines tend to use the p-value threshold of 0.05 to imply that the observed result is unlikely to occur by chance. However, some disciplines set the threshold for statistical significance more liberally to 0.10, while others may set the threshold more conservatively, such as to 0.01. level( # ) allows users to set their own threshold. The default is level(0.05).

chi2 calculates and displays Pearson’s χ ² for the hypothesis that the rows and columns in a two-way table are independent. The default is Fisher’s exact test, which generally produces more conservative estimates.

detail displays all the 2×2 tables produced during the iterative process of adding events to the group with the lowest actual number of events until the p-value threshold is met or surpassed.

3.3 Stored results

fragility stores the following in r():

4 The metafrag command

This section describes the syntax of the metafrag command and available options. metafrag is a postestimation command for meta esize (see [META] meta esize), thereby capitalizing on the comprehensive list of options available in official Stata’s meta suite for computing effect sizes for binary outcomes.

4.1 Syntax

metafrag [, eform forest [( forestplot )]]

4.2 Options

eform reports exponentiated effect sizes and transforms their respective CIs whenever applicable. By default, the results are displayed in the metric declared with meta esize such as log odds-ratios and log risk-ratios (RRs). eform uses odds ratios when used with log odds-ratios declared with meta esize or RRs when used with the declared log RRs. eform affects how results are displayed, not how they are estimated and stored.

forest [( forestplot )] displays a forest plot of the studies after modification to the events and nonevents of included studies to move the pooled effect from statistically significant to nonsignificant (the user can set the level that “significance” represents using the level() option in meta esize). Specifying forest without options uses the default forest plot settings (with only the column headers modified). Studies that have event modifications are highlighted in blue (when events are added) and red (when events are subtracted).

4.3 Stored results

metafrag stores the following in r():

5 Examples

In this section, we demonstrate the use of fragility with two artificial examples and the use of metafrag with two empirical examples. For both commands, the first example illustrates the case of a fragile study result, and the second illustrates a more robust result. For the metafrag examples, the presented data correspond with real metaanalyses from Cochrane Systematic Reviews. The measures used for evaluating the treatment effect and for deriving the pooled treatment effects were the same as those used in the original Cochrane Systematic Reviews.

5.1 A fragile RCT

This example from Walsh et al. (2014) specifies that group 1 has 1 event and 99 nonevents and group 2 has 9 events and 91 nonevents.

As shown in the output, the resulting FI of 1 suggests that the inference of a treatment effect is “fragile.” That is, only one additional event is needed to flip the results from being statistically significant to nonsignificant at the 0.05 level.

5.2 A more robust RCT

In example 2 from Walsh et al. (2014), group 1 has 200 events and 3,800 nonevents, and group 2 has 250 events and 3,750 nonevents:

As shown in the output, the resulting FI of 9 suggests that the inference of a treatment effect is more robust than that of example 1.

5.3 A fragile meta-analysis

This meta-analysis includes 7 individual studies, with a total of 448 patients. We first load the data and then use meta esize to compute and declare effect sizes for a twogroup comparison of binary outcomes. The log RR is specified as the effect size, and the fixed-effects meta-analysis is specified using the Mantel–Haenszel method.

Next, we plot a forest plot of these data, specifying that the results be presented as exponentiated values, and modify some elements of the display (see [META] meta forestplot):

As shown in the forest plot, the treatment was associated with a statistically significant increase in the risk of the outcome (RR 1.23, 95% CI [1.00 to 1.51]). Next we use metafrag to compute the FI and specify the options forest and eform:

As shown in the output, the FI is 1, indicating that the pooled treatment effect turns statistically nonsignificant after only one event-status modification. In this metaanalysis, the one event modification was made by subtracting one event from group 1 in study 3. In the forest plot, this addition corresponds with the value highlighted in gray (red on actual screen) under group 1 in study 3. The RR for the pooled effect is now statistically nonsignificant ([RR] 1.22, 95% CI [0.99 to 1.49]).

5.4 A more robust meta-analysis

This meta-analysis includes 8 individual studies, with a total of 1,344 patients. As before, we first load the data, then use meta esize to compute and declare effect sizes for a two-group comparison of binary outcomes, and then plot the forest plot:

As shown in the forest plot, the treatment was associated with a statistically significant reduction in the risk of the outcome (RR 0.75, 95% CI [0.68 to 0.83]). Next, we use metafrag to compute the FI and specify the options forest and eform:

As shown in the output, the FI is 65, indicating that the pooled treatment effect turns statistically nonsignificant after 65 event-status modifications, with the event modifications occurring in 4 studies. In the forest plot, event additions correspond with values highlighted in bold (blue on actual screen), and event subtractions correspond with values highlighted in gray (red on actual screen). The RR is now statistically nonsignificant ([RR] 0.91, 95% CI [0.83 to 1.00]). The FI suggests that the pooled estimate from this meta-analysis is more robust than that in the previous example, where only one event modification was necessary to nullify the statistical significance of the pooled estimate.

6 Discussion

In this article, I introduced the fragility and metafrag commands, which compute the FI for individual randomized trials and meta-analyses with binary outcomes, respectively.

While the FI offers an intuitive supplemental measure to the p-value in interpreting the reliability of study findings, it has its critics. In particular, Carter, McKie, and Storlie (2017) illustrated a strong inverse relationship between the FI and the log10 of the p-value because both operate by decreasing the differences in response rates, resulting in a quantification of how extreme the observed trial results are relative to the null condition. Thus, as is true with p-values, the FI should not be misinterpreted as a measure of clinical effect. In other words, a higher FI should not be interpreted to imply greater clinical effect than a lower FI; rather, it simply illustrates the strength of the statistical significance itself (Brown et al. 2019; Narayan et al. 2018).

In conclusion, the fragility and metafrag commands provide a convenient method for evaluating the reliability of “statistical significance” in RCTs and meta-analyses. I advocate the reporting of the FI in conjunction with p-values and CIs to assist investigators and others in weighing the evidence for study robustness.

8 Programs and supplemental materials

Supplemental Material, sj-zip-1-stj-10.1177_1536867X221083856 - Computing the fragility index for randomized trials and meta-analyses using Stata

Supplemental Material, sj-zip-1-stj-10.1177_1536867X221083856 for Computing the fragility index for randomized trials and meta-analyses using Stata by Ariel Linden in The Stata Journal

Footnotes

7 Acknowledgments

I thank John Moran for advocating that I write both of these commands and for providing me with many of the references used in the introduction. I also thank Ignacio Atal for his support in testing the results reported by metafrag to assess their consistency with his R package fragility_ma and website , and I thank Houssein Assaad at StataCorp for providing details of how Stata and R differ in their respective computations for meta-analyses. Finally, I thank the chief editor and anonymous reviewer for providing helpful comments to improve the article and commands.

8 Programs and supplemental materials

To install a snapshot of the corresponding software files as they existed at the time of publication of this article, type

References

Ahmed

Fowler

R. A.

McCredie

V. A.

2016. Does sample size matter when interpreting the fragility index? Critical Care Medicine 44: e1142–e1143. https://doi.org/10.1097/CCM.0000000000001976.

Amrhein

Greenland

McShane

2019. Scientists rise up against statistical significance. Nature 567: 305–307. https://doi.org/10.1038/d41586-019-00857-9.

Atal

Porcher

Boutron

Ravaud

P. I.

2019. The statistical significance of meta-analyses is frequently fragile: Definition of a fragility index for meta-analyses. Journal of Clinical Epidemiology 111: 32–40. https://doi.org/10.1016/j.jclinepi. 2019.03.012.

Blume

J. D.

Greevy

R. A.

Welty

V. F.

Smith

J. R.

Dupont

W. D.

2019. An introduction to second-generation p-values. American Statistician 73 : 157–167. https://doi.org/10.1080/00031305.2018.1537893.

Brown

Lane

Cooper

Vassar

2019. The results of randomized controlled trials in emergency medicine are frequently fragile. Annals of Emergency Medicine 73: 565–576. https://doi.org/10.1016/j.annemergmed.2018.10.037.

Carter

R. E.

McKie

P. M.

Storlie

C. B.

2017. The fragility index: A p-value in sheep’s clothing? European Heart Journal 38: 346–348. https://doi.org/10.1093/eurheartj/ehw495.

Colquhoun

2017. The reproducibility of research and the misinterpretation of pvalues. Royal Society Open Science 4: 171085. https://doi.org/10.1098/rsos.171085.

Feinstein

A. R

. 1998. P-values and confidence intervals: Two sides of the same unsatisfactory coin. Journal of Clinical Epidemiology 51: 355–360. https://doi.org/10.1016/s0895-4356(97)00295-3.

Gelman

Carlin

2014. Beyond power calculations: Assessing type S (sign) and type M (magnitude) errors. Perspectives on Psychological Science 9: 641–651. https://doi.org/10.1177/1745691614551642.

10.

Gelman

Tuerlinckx

2000. Type S error rates for classical and Bayesian single and multiple comparison procedures. Computational Statistics 15: 373–390. https://doi.org/10.1007/s001800000040.

11.

Greenland

2019. Valid p-values behave exactly as they should: Some misleading criticisms of p-values and their resolution with s-values. American Statistician 73: 106–114. https://doi.org/10.1080/00031305.2018.1529625.

12.

Ioannidis

J. P. A.

2005. Why most published research findings are false. PLOS Medicine 2: e124. https://doi.org/10.1371/journal.pmed.0020124.

13.

Ioannidis

J. P. A.

2018. The proposal to lower p value thresholds to .005. Journal of the American Medical Association 319: 1429–1430. https://doi.org/10.1001/jama.2018.1536.

14.

Narayan

V. M.

Gandhi

Chrouser

Evaniew

Dahm

2018. The fragility of statistically significant findings from randomised controlled trials in the urological literature. BJU International 122: 160–166. https://doi.org/10.1111/bju.14210.

15.

Quatto

Ripamonti

Marasini

2020. Best uses of p-values and complementary measures in medical research: Recent developments in the frequentist and Bayesian frameworks. Journal of Biopharmaceutical Statistics 30: 121–142. https://doi.org/10.1080/10543406.2019.1632874.

16.

Sterne

J. A. C.

Davey Smith

2001. Sifting the evidence—What’s wrong with significance tests? British Medical Journal 322: 226–231. https://doi.org/10.1136/bmj.322.7280.226.

17.

Walsh

Srinathan

S. K.

McAuley

D. F.

Mrkobrada

Levine

Ribic

Molnar

A. O.

Dattani

N. D.

Burke

Guyatt

Thabane

Walter

S. D.

Pogue

Devereaux

P. J.

2014. The statistical significance of randomized controlled trial results is frequently fragile: A case for a fragility index. Journal of Clinical Epidemiology 67: 622–628. https://doi.org/10.1016/j.jclinepi.2013.10.019.

18.

Wasserstein

R. L.

Lazar

N. A.

2016. The ASA statement on p-values: Context, process, and purpose. American Statistician 70: 129–133. https://doi.org/10.1080/00031305.2016.1154108.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB