Abstract
Background:
Randomized controlled trials (RCTs) are the gold standard for treatment efficacy, but foot and ankle RCTs are often small or inconsistent. The Fragility Index (FI) evaluates the stability of significant findings. This study assessed the fragility of RCT outcomes for Achilles tendon pathology (ATP) interventions.
Methods:
This systematic review queried PubMed up to May 14, 2024, for RCTs on ATP interventions. RCTs with significant binary outcomes were included. Two reviewers assessed eligibility, extracted data, calculated FIs, and evaluated risk of bias. Frequency-weighted means were used for narrative synthesis.
Results:
Eleven RCTs with 4506 patients (mean cohort size: 409.64 ± 160.54) and a mean age of 36.97 ± 13.51 years (n = 4356; 96.67%) were included, covering 24 binary outcomes. The median FI across all outcomes was 3 (interquartile range 1-4; mean 3.92), indicating that changing the outcome of just a few patients could shift a study’s results from statistically significant to nonsignificant. Trials having an FI ≤3 comprised 58.33%. Three outcomes (12.5%) had an FI of zero after recalculating P values using the two-sided Fisher exact test. Half of the outcomes were robust. No RCT reported FIs or adjusted significance for multiple testing. Most studies (81.82%) performed 2 or more statistical tests, with an average of 30.81 ± 41.28 P values reported per study. The overall risk of bias was low in 1 study (9.09%) and moderate in 7 (63.64%). Most studies had low risk of bias in randomization (72.73%) and missing outcome data (90.91%).
Conclusion:
The FI assesses the fragility of statistically significant binary results, revealing that many ATP RCTs have fragile outcomes due to small sample sizes. A median FI of 3 means that changing the outcome of 3 patients could shift a study’s results from statistically significant to nonsignificant.
Introduction
The most reliable treatment evaluations and causal determinations come from well-powered randomized controlled trials (RCTs), yet orthopaedic surgery RCTs often yield inconsistent results.2,4,9,16,17,26,28 Analysis of these RCTs has shown that the P value and effect size have largely been utilized as the primary forms of comparing the outcomes from different treatment arms.28,44 However, relying solely on these 2 metrics can be misleading, as P values are often overemphasized and should be used alongside other tools for interpreting results.7,8,44 In foot and ankle surgery, RCTs often have smaller sample sizes compared with other orthopaedic conditions. This raises concerns about the validity of findings, as altering the outcomes of just a few patients in a treatment arm could significantly impact or even reverse the trial’s conclusions by nullifying the significance.3,33,35,44
The Fragility Index (FI) is a metric that aims to assess the robustness of statistically significant results to quantify such phenomenon. The FI is designed to be used in conjunction with P values to aid in a more comprehensive interpretation of RCTs.7,14,51 The FI of a study is defined as the smallest number of patients in the trial group with fewer outcome events whose status must change from a “non-event” to an “event” to alter a statistically significant result to a nonsignificant one.7,43 A small FI indicates statistical fragility, relying on few events for significance, whereas a large FI raises confidence in treatment impact. 10
Given the small sample sizes and few events in foot and ankle surgery trials, our objective was to assess the robustness of significant RCT results in Achilles tendon pathology (ATP). Achilles tendon ruptures, the most common in the lower extremity, occur at an annual rate of up to 40 per 100 000.13,20,27 These injuries, including tendinitis, are often seen in athletes and overuse cases.29,52 Treatments range from nonsurgical options (cast, boot, brace) to surgical procedures (reattachment, tendon transfer).29,36,52,53 Given the prevalence of ATP, high-quality evidence is crucial for comparing surgical and conservative management.
Recently, a review by Fackler et al 10 sought to examine the statistical stability of studies comparing operative vs nonoperative management for Achilles tendon rupture. However, this review was limited to Achilles tendon ruptures and only included a search of the top 10 orthopaedic journals, limiting its impact on the broader ATP literature. Additionally, it included cohort studies, not just RCTs. In contrast, we defined ATP as a broad range of Achilles tendon conditions, including both ruptures and tendinopathy (insertional and noninsertional) to provide a comprehensive assessment of treatment outcomes and fragility, avoiding the limited scope of previous studies that focused solely on specific pathologies like ruptures. This study expands on the work of Fackler et al by examining the fragility of significant findings from all RCTs on ATP interventions, applying the FI, and assessing statistical corrections. Understanding this fragility is crucial for clinicians, as fragile findings may not be robust enough to guide ATP management confidently.
Methods
Study Creation and Initial Search
This study is a systematic review of the literature examining the fragility of significant binary outcomes of RCTs. All RCTs regarding ATP were searched in PubMed from database inception until May 14, 2024. Search terms used in each database were (“Achilles Tendon”[Mesh] OR Achilles OR “Achilles tendon” OR “calcaneal tendon”) AND (“Randomized Controlled Trial”[Publication Type] OR “randomized controlled trial”). This study was performed under the guidelines of the most recent Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRIMSA) for proper data reporting. The study registration can be found within the Open Science Framework registry at osf.io/96qja.
Study Definitions
ATP was defined broadly in this study to include a range of conditions affecting the Achilles tendon, such as tendinopathy, partial and complete ruptures, and other related disorders. This inclusive definition was chosen to ensure a comprehensive search and assessment of RCTs in foot and ankle surgery. By encompassing both acute and chronic conditions, our goal was to capture the full spectrum of interventions and their respective outcomes, offering a more complete evaluation of statistical robustness in the literature.
In FI terms, “robustness” means stable, reliable RCT results, whereas “frailty” indicates vulnerability and low reliability. A high FI signals robustness; a low FI signals frailty. A study was “robust” if FI exceeded dropouts, or “fragile” if FI was less than dropouts.
Inclusion and Exclusion Criteria
Inclusion criteria were RCT that examined patients who sustained any ATP and reported at least 1 significant binary outcome (as defined by the individual study) comparing either treatment groups or comparing pre- and posttreatment change. Exclusion criteria were nonrandomized controlled studies, studies without ATP, and studies without statistically significant binary outcomes.
Article Screening Process
After the search algorithm was executed in each of the 4 databases for the initial search, all articles were uploaded into Rayyan, a public website used for systematic reviews. 39 One individual screener performed a manual deduplication of articles. Two independent reviewers performed article screening based on title and abstract, followed by full-text screening based on inclusion and exclusion criteria. Lastly, the references of each included article were manually searched for articles not initially captured. Any conflicts during the article screening process were resolved by the first author.
Data Extraction
Two authors extracted data on all significant binary outcomes, including journal name, publication year, sample size, follow-up losses, events per arm, P values, correction use, FI reporting, and relevant significant outcomes.
Article Risk Assessment
Risk of bias was assessed using the Cochrane Risk of Bias for Randomized Trials ROB-2 tool, which examines bias under the following categories: randomization process, deviations from intended intervention, missing data, measurement of the outcome, selection of the reported result.10,46 Each article is assessed and assigned a score of low risk, some concerns, or high risk of bias for each domain.10,46
Statistical Analysis
This study used the Statistical Package for the Social Sciences (SPSS) version 29.0 (IBM Corp, Armonk, NY) for statistical analysis. Frequency-weighted means and other descriptive statistics were used to describe the data where no statistical significance could be calculated. We calculated the FI for each outcome using the Fragility Index Calculator by ClinCalc statistics. 21 The FI is a recognized and validated metric that quantifies the robustness of statistically significant results by determining how many event-to-nonevent outcome changes are required to shift the P value above the significance threshold.7,21 The ClinCalc Fragility Index Calculator automates event-to-nonevent switching and recalculates the 2-sided Fisher exact test until the P value exceeds .05, determining the FI. FIs were calculated for reported significant binary outcomes. Additionally, raw binary outcomes without significance tests were analyzed with Fisher exact test to identify unreported significant outcomes.
Results
Initial Search Results
Our database query yielded 531 potential studies. After title and abstract screening, 39 articles were retrieved for full-text analysis. Thirty articles were excluded as they did not find any binary outcomes as significant. Only 9 RCTs reported at least 1 significant binary outcome and were ultimately included. An additional 2 articles were included by citation search, for a total of 11 articles further pursued for data extraction (Figure 1).5,11,22,25,30,31,34,38,40,45,54

The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) diagram outlining the entire search progress, from initial search in 4 databases to final article inclusion.
Characteristics of Trials and Outcomes
Eleven studies were included that reported at least 1 statistically significant binary variable. A total of 24 significant binary outcomes were reported across the 11 studies, and FIs were calculated for each outcome. We found that 3 studies only reported conversion of both groups to a binary endpoint without performing any statistical testing. On calculating the 2-sided Fisher exact test for the 4 binary outcomes found from the 3 studies, all 4 outcomes were found to be significant (Table 1, asterisked outcomes). There was a total of 4506 patients treated among all 11 studies, and the frequency-weighted mean age was 36.97 ± 13.51 (n = 4356 patients, 96.67%). The mean sample size of the included trials was 409.64 ± 160.54 and the mean losses to follow-up was 6.18 ± 3.25 patients (ie, 1.51% of the patients were lost to follow-up across trials). Among the included trials, overall risk of bias was low in 1 study (9.09%) and moderate in 7 studies (63.64%). Eight studies (72.73%) had low risk of bias in the randomization process and 10 studies low risk of missing outcome data (90.91%) (Figure 2). Trends emerged, where clinically important outcomes, such as rerupture rates and tendon healing, demonstrated greater robustness, with FI values of 4 or higher. Conversely, patient-reported outcomes related to satisfaction and less critical endpoints, such as mild discomfort, exhibited greater fragility with FI values often at or below 1.
Study Demographics Table With Relevant Study Characteristics Such as Intervention Type, Procedure Type, Mean Patient Age (Unless Otherwise Reported as Median) and Lost-to-Follow-up.
Abbreviations: ADL, activities of daily living; VTE, venous thromboembolism.
Statistically significant (P < .05).

Outcomes of the Cochrane Risk of Bias 2.0 tool for randomized controlled trials; n = 11. The plus sign marks a low risk of bias, and the question mark indicates that there is some concern for bias.
Fragility Index
The median FI across all outcomes for the 24 evaluated outcomes was 3 events (interquartile range [IQR] 1-4, mean 3.92) which means that adding 3 events to one of the trial’s treatment arms eliminated would eliminate its statistical significance. Three outcomes (12.50%) had an FI of zero because they lost their statistical significance when the FI calculator recalculated their P values using the 2-sided Fisher exact test. 21 In total, 12 outcomes were found to be robust and 12 to be fragile. Of the 12 robust outcomes, 1 outcome was calculated from a study that did not provide statistical analysis. No RCT reported FIs as part of their own statistical analysis, and none adjusted the significance (eg, Bonferroni correction) to reduce the risk of type I errors. The mean total P values reported by each study was 30.81 ± 41.28 (range 1, 136). Overall, 81.82% of studies (n = 9) performed 2 or more significance tests as part of their analysis. Table 2 depicts the FI values according to subgroups based on outcome type, sample size for each arm, number of events, and losses to follow-up. Figure 3 depicts the distribution of FIs across the study.
Fragility Indexes by Outcome Category.
Abbreviations: ADL, activities of daily living; IQR, interquartile range; VTE, venous thromboembolism.

Frequency distribution of FI values from 11 trials showing 24 outcomes. The median number of patients whose status would have to change from a nonevent to an event to change a statistically significant result to a nonsignificant result was 3 (IQR 1-4). Overall, 50% of the FIs were deemed fragile and 50% were found to be robust. FI, Fragility Index; IQR, interquartile range.
Discussion
In this systematic review, we evaluated the fragility of statistically significant results in RCTs on ATP in foot and ankle surgery. By applying the FI, we assessed the stability of findings across various interventions for ATP. Among the 11 RCTs reviewed, as few as 3 outcome events could reverse the statistical significance of the treatment arm. We can encourage the incorporation of the FI into foot and ankle literature, as it could be highly beneficial in improving the stability of research findings, potentially influencing clinical practice.
This study expands on the findings of recent fragility analyses conducted by Parisien et al 41 and Fackler et al. 10 The initial review by Parisien et al 41 focused on comparative studies of Achilles tendon injuries and revealed that the outcomes were less statistically stable than previously thought, warranting cautious interpretation. Fackler et al’s follow-up study on Achilles tendon ruptures also raised concerns about outcome stability. Both reviews included cohort and RCT studies, potentially confounding results and limiting clarity. Their search was restricted to the top 10 orthopaedic journals, further narrowing conclusions. In contrast, our study focused solely on RCTs for Achilles tendinopathy, without limiting the search to specific journals, providing a more comprehensive analysis.
The FI can be clinically relevant in ATP as it highlights the reliability of RCT outcomes. A low fragility indicates that results are unstable and easily reversed by a few additional events, suggesting the findings may not be robust enough for confident clinical decisions. In foot and ankle care, particularly with ATP, the FI exposes the vulnerability of conclusions from small or underpowered studies. For instance, a low fragility might suggest the effectiveness of a treatment, but minor changes in outcomes could negate its significance. 10 This may imply that clinicians should be cautious in interpreting these results and might need to consider additional factors or seek further evidence before altering their practice.47,48 Variability in Achilles tendinopathy treatments increases outcome fragility, which a low fragility can highlight. Using FI with P values and CIs helps identify robust treatments, leading to better-informed decisions and more stable outcomes.10,48 The Bonferroni correction in Achilles tendinopathy studies reduces type I errors but may increase type II errors, sparking debate as it can overly penalize studies with multiple hypotheses, limiting true effect detection.1,42 Critics argue that although it controls for false positives, it may result in the dismissal of genuinely significant findings, suggesting that a balance is necessary when interpreting results from Achilles tendinopathy research.1,42,49
Overall, our findings align with those of Fackler et al and Parisien et al. The median FI for all reported outcomes (n = 24) in our study was 3 events, comparable to the 4 events reported by Fackler et al 10 and the average of 2.9 events reported by Parisien et al. 41 Categories like postoperative complications and prescribed medications showed the greatest variation in FI, although all categories had median values within a narrow range of 1 to 4. We also identified a mean sample size of 409 patients and a median of 30 events per outcome, smaller than the median sample size of 682 patients and 112 events per outcome reported by Walsh et al 51 in their analysis of 399 RCTs from high-impact medical journals. Our mean FI of 3 events was lower than the median FI of 8 events (range 3-18) reported by Walsh et al. Only 50% of reported outcomes in Achilles tendinopathy trials had robust dichotomous outcomes, raising concerns about the validity of outcomes in up to half of the ATP RCTs. These findings suggest that ATP RCTs, compared to those in other specialties, have smaller sample sizes, higher statistical frailty, and overall poorer quality.
Our study supports using the FI in evaluating ATP management. Although some critics view the FI as a “P value in disguise,” others argue that RCTs with a priori power analysis are inherently fragile. 6 Evaluating the robustness or fragility of RCTs necessitates assessing uncertainty rather than solely focusing on statistical significance of dichotomous outcomes. 6 However, the growing body of literature in orthopaedic subspecialties that use the FI to evaluate the validity of dichotomous outcomes cannot be ignored.9,10,12,15,17,28,32,41,44,50 Previous systematic reviews within adult reconstruction, shoulder, spine, foot/ankle, and hand surgery have all found median FIs ranging from 2 to 4, with a shoulder arthroplasty study reporting the highest FI within the orthopaedic literature (FI = 6).9,10,15,28,41,44 The orthopaedic literature overall shows much lower FIs (FI = 2) and tends to have smaller cohort sizes when compared to high-impact medical journals (FI = 8), with otolaryngology coming in at the second lowest (FI = 3).7,9,37,51 Without threshold cutoffs, FIs must be contextualized within similar studies, so larger FIs cannot be evaluated in isolation. Given smaller sample sizes in orthopaedics, we redefined “fragile” and “robust” by comparing the FI with the dropout rate of the study group for better context as suggested by the literature.18,19 We believe incorporating FI with P values would demonstrate a more comprehensive view of outcomes, leading to improved patient care. Alternatively, a focus on CIs would provide an alternative to relying solely on P values and their clear limitations. They serve as a valuable tool for assessing the precision of results, evaluating data compatibility with multiple hypotheses, and gaining deeper insights. CIs present a range of values consistent with the data, with the width indicating result precision and the spectrum of potential true outcomes.
It is worth noting that many of the fragile outcomes identified in this review were secondary, rather than primary, outcome measures. This is likely reflective of the broader state of ATP literature, where secondary measures are frequently used but may not receive the same level of rigorous validation as primary outcomes, nor are the sample size powered for the same level of confidence. The diminished robustness of these secondary measures underscores the need for further refinement in the design and reporting of RCTs in this field, particularly when it comes to defining and validating clinically meaningful primary endpoints. Although the FI highlights statistical vulnerability in many outcomes, its clinical applicability varies depending on the importance of the outcome itself. For example, outcomes like pain at 6 weeks, which are often fragile, may be less critical for guiding long-term treatment decisions. These fragile results, although valuable for patient comfort, may not necessarily indicate long-term treatment efficacy. In contrast, more robust outcomes, such as rerupture rates and tendon healing, have greater clinical relevance and should be prioritized in decision making. By distinguishing between these fragile and robust outcomes, clinicians can apply the FI more effectively, using it to focus on the most reliable endpoints when making treatment decisions for ATP.
The limitations of this study must be considered. Three outcomes had an FI of zero because their statistical significance was lost when recalculating P values using the 2-sided Fisher exact test. The Fisher exact test, a more conservative alternative to the Pearson χ2 test for comparing proportions in a 2 × 2 contingency table, was used.9,23,24 The Fisher exact test is suitable for all sample sizes and is the preferred method when sample sizes are small, or outcome events are uncommon.7,9 However, because the FI is calculated using Fisher exact test, results may differ from those obtained with methods like the χ2 test. The χ2 test relies on an approximation suitable for large samples, whereas the Fisher exact test is precise, particularly for small samples. 7 Replacing Fisher exact test with another statistical method in small trials can result in a nonsignificant P value and an FI of 0, highlighting study fragility. Although only 12.5% of values were 0, this underscores the importance of consistent testing. Our reliance on a single database might miss relevant studies, potentially underestimating associations between FI and RCT outcomes, but our findings are consistent with existing literature.7,9,10,15,17,28,41,44,50 FI is only applicable to dichotomous outcomes, limiting its use in analyzing continuous or time-to-event outcomes in RCTs. This underscores the need to use FI alongside other statistical methods and evaluate continuous outcomes separately. Our analysis included both primary and secondary outcome measures, which may vary in their clinical significance. Primary outcomes, such as rerupture rates, likely will carry greater weight, whereas secondary outcomes, such as mild discomfort, may contribute less to clinical decision making.
We also recognize that the inclusion of certain studies, such as Silbernagel et al, 45 may contribute more than other smaller studies to the overall number of FI calculations. This may impact the interpretation of the results, as a portion of the fragile outcomes in this review stem from secondary measures within these studies. Future analyses could benefit from a more detailed categorization of outcomes to assess whether primary measures consistently show greater or lesser fragility than secondary measures.
Conclusion
We found that studies on ATP management generally had low FI scores, with half of the outcomes still classified as fragile after adjusting for patient dropout rates. Outcomes from low-risk bias studies had FIs similar to those with some bias concerns.
Like the P value, the FI has limitations, and clinicians should be cautious when interpreting trials with low FI or P values for patient care. However, using the FI alongside other metrics can improve the evaluation of ATP trials by identifying studies with more robust outcomes.
Supplemental Material
sj-pdf-1-fao-10.1177_24730114241300160 – Supplemental material for The Fragility of Statistically Significant Binary Outcomes for Treating Achilles Tendinopathy: A Systematic Review of Randomized Trials
Supplemental material, sj-pdf-1-fao-10.1177_24730114241300160 for The Fragility of Statistically Significant Binary Outcomes for Treating Achilles Tendinopathy: A Systematic Review of Randomized Trials by Omkar S. Anaspure, Shiv Patel, Anthony N. Baumann, Andrew Newsom, Albert T. Anastasio and Annunziato Amendola in Foot & Ankle Orthopaedics
Footnotes
Ethical Approval
Ethical approval was not sought for the present study.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. Disclosure forms for all authors are available online.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
