Abstract
Study Design
Systematic Review of Randomized Controlled Trials.
Objectives
To assess the statistical fragility of randomized controlled trials (RCTs) comparing lumbar fusion and lumbar disc arthroplasty (LDA).
Methods
Following PRISMA guidelines, PubMed, Embase, and Medline databases were searched for RCTs on lumbar fusion and LDA published between January 1, 2000 and August 1, 2023. Eligible studies reported dichotomous, categorical outcomes. A two-tailed Fisher’s exact test was used to confirm reported
Results
18 RCTs met the inclusion criteria for analysis. Across the 18 studies, 146 dichotomous outcomes were identified. The median fragility index (FI) for all outcomes was 5 (IQR: 2.0-9.0), while the median fragility quotient (FQ) was 0.022 (IQR: 0.008-0.046). Subgroup analysis revealed that adjacent segment disease outcomes had a median FI of 3 (IQR: 2.75-3) and an FQ of 0.013 (IQR: 0.013-0.014). Industry-funded studies had a significantly higher FI (6 vs 4,
Conclusions
This systematic review demonstrates that the statistically significant findings from RCTs comparing LF to LDA are susceptible to small changes in event outcomes. Of note, industry-funded studies were found to be significantly more statistically fragile compared to non-industry-funded studies.
Keywords
Introduction
Lumbar fusion has long been the procedural paradigm for treating lumbar spine pathology and degenerative disc disease.1,2 However, the introduction of lumbar disc arthroplasty (LDA) offered a novel approach aimed at preserving segmental mobility and restoring normal biomechanics, with the proposed benefit of reducing the risk of adjacent segment disease.1-5 Some studies have shown LDA to have lower incidences of ASD compared to lumbar fusion cohorts, reporting rates of 2-2.8% and 7-18%, respectively; however, these studies may carry a high risk of bias due to financial conflicts of interest, publication bias, and study design manipulation.1,6,7 Despite these proposed advantages, slow and even decreasing adoption has persisted with the majority of surgeons opting for fusion in most cases.8,9 Concurrently, the proportional utilization of cervical disc replacement relative to anterior discectomy and fusion has increased from 4.00% in 2010 to 14.47% in 2021. 10 Thus, greater examination of the randomized controlled trials (RCT) comparing LDA vs lumbar fusion is warranted.
The inconsistency between research supporting use of LDA and declining clinical usage calls into question the statistical analyses of the clinical trials. In particularly, the
In the context of disc replacements, a fragility analysis comparing cervical disc arthroplasty and anterior cervical discectomy and fusion demonstrated that RCTs comparing the two procedures had moderate statistical robustness (median FI = 7, FQ = 0.043) and did not suffer from fragile results. 11 However, no such comparison has been performed for lumbar fusion and LDA. In this study, we aim to assess the statistical fragility of RCTs comparing lumbar fusion and LDA. As a secondary objective, we aim to compare differences in statistical fragility between industry-funded and non-industry-funded studies, and studies conducted between 2000-2010 to 2011-2018. We hypothesized that statistical outcomes reported in the LDA vs fusion literature would be fragile with only a few outcome reversals required to change statistical significance. Additionally, we hypothesized that industry-funded studies would show more fragile results due to the higher risk of bias and we hypothesized that the fragility of LDA vs fusion trials would be similar for recent and older trials.
Methods
Inclusion Criteria
This systematic review was in accordance with the guidelines of the preferred reporting items for systematic reviews and meta-analyses (PRISMA).
13
The PubMed, Embase, and Medline databases were searched to identify RCTs published between January 1, 2000, and August 1, 2023 related to lumbar fusion (Figure 1). The search keywords used across all databases were “total disc replacement,” “intervertebral disc replacement,” “artificial disc replacement,” “fusion,” “lumbar degenerative diseases,” “lumbar degeneration,” “spondylolisthesis,” “lumbar disc herniation,” “lumbar disc protrusion,” “lumbar spinal stenosis,” (Posterior Lumbar Interbody Fusion), (Transforaminal Lumbar Interbody Fusion), (Anterior Lumbar Interbody Fusion), “PLIF,” “TLIF,” “ALIF.” Studies were included if they were randomized controlled trials reporting dichotomous, categorical outcomes, and had LDA vs fusion as the two treatment arms. The minimum follow-up period for studies included was 12 months postoperatively. Non-English language, biomechanical, cadaveric, animal, in vitro, and non-RCT studies were excluded. Studies were included only if the full text was available online for review. Two independent reviewers performed title/abstract screening and full-text review, and a third independent reviewer resolved conflicts. This systematic review analyzed statistical reporting and significance rather than direct outcomes and did not qualify for the International Prospective Register of Systematic Reviews. Since all data analyzed is publicly available, Institutional Review Board approval was not required. Preferred Reporting Items for Systematic Reviews and Meta-Analyses Flow Diagram Showing Identification, Screening, and Inclusion, of Eligible Articles From PubMed, Embase, and Cochrane. RCT, Randomized Controlled Trial
Risk of Bias Assessment
To ensure that RCT design bias did not confound fragility results, a risk of bias assessment using the Cochrane Risk of Bias 2 (RoB 2) tool was done. 14 A secondary analysis was also done in which outcomes with losses to follow-up greater than their fragility indices were excluded. The included studies demonstrated an overall low risk of bias across the evaluated domains, including selection, performance, detection, and attrition biases (Appendix Table 1).
Data Extraction
The first author, year of publication, and journal of publication were extracted for trial identification during the extraction process. Industry funding status was also extracted for each included study. Outcome events in each treatment arm were recorded and the number of patients lost to follow-up were recorded. After extraction, outcome categories were established by two reviewers for subgroup analysis based on clinical relevance and outcome sample size. Outcome categories included adjacent segment disease, anatomical change, composite endpoint, pain, patient satisfaction, return to functionality, and adverse event. Anatomical change included outcomes related to disc height success and fusion status while adjacent segment disease only included outcomes on adjacent segment disease. Composite endpoints were the outcomes utilized to define the primary endpoints for the RCT, and included outcomes such as neurological status, patient reported outcome scores (PROMs), and overall success of the procedure.
Fragility Analysis
A two-tailed Fisher’s exact test was used to confirm reported Illustration of the Concept of Statistical Significance Reversal Using a 2 × 2 Contingency Table and Demonstrates How the Fragility Index (FI) of 1 is Calculated, as Reported by Skold and Colleagues. + Indicates Patients With Outcome of Interest while – Indicates Patients Without the Outcome of Interest. The Contingency Table on the Left has 17 Patients With the Outcome of Interest and 80 Patients Without in the LDA Group. The 
Results
Characteristics of Included Studies: Year, Journal of Publication, Total Sample Size
FI = Fragility Index; FQ = Fragility Quotient; IQR = Interquartile Range.
Overall Fragility Data Based on Trial and Outcome Characteristics
FI = Fragility Index; FQ = Fragility Quotient; IQR = Interquartile Range.
Subgroup Analysis Based on Outcome Category (Adjacent Segment Disease, Adverse Event, Anatomical Change, Composite Endpoint, Pain, Patient Satisfaction, Return to Functionality)
FI = Fragility Index; FQ = Fragility Quotient; IQR = Interquartile Range.
Fragility Data Stratified by Year Published (2000-2010, 2011-2023)
FI = Fragility Index; FQ = Fragility Quotient; IQR = Interquartile Range
Fragility Data Based on Industry Funding Status
FI = Fragility Index; FQ = Fragility Quotient; IQR = Interquartile Range.
Discussion
The main finding of this study is that only ∼2% of the reported outcomes in studies comparing LDA and Fusion needed to change to statistically alter the RCT findings. The declining clinical usage of LDA contrasts the clinical trial results supporting the usage of LDA to motivate this study on statistical fragility of RCTs comparing LDA and fusion. Results demonstrated that the RCTs comparing LDA and fusion report significant differences that are statistically fragile, with a median FI of 5 and FQ of 0.022 across 146 outcomes. The FQ, which reflects the proportion of patients whose outcomes would need to change to alter the statistical significance of a study, ranged from 0.017 for significant outcomes to 0.023 for nonsignificant outcomes. We also found that industry-funded studies have a significantly lower FQ than non-industry funded studies (industry-funded FQ: 0.019, non-industry funded: 0.040,
ASD is a key consideration when comparing lumbar fusion to lumbar LDA. Studies have consistently highlighted that a major disadvantage of lumbar fusion is the development of ASD, often leading to the need for further surgery.6,15-19 The development of motion-preserving treatments such as LDA for symptomatic disc degeneration was largely driven by concerns that stabilizing one spinal segment, as seen in lumbar fusion, may inadvertently increase stress on adjacent levels, potentially accelerating degeneration in those areas.7,20 LDA aims to minimize the iatrogenic acceleration of degenerative disease at segments adjacent to the operative levels. However, the literature on this topic is mixed. While some studies demonstrate significant reductions in ASD following LDA, others report comparable rates between the two procedures.6,15,17,18,21,22 Even amongst systematic reviews and meta-analyses, there is disagreement on whether LDA truly reduces ASD rates.15,17,21,22 Given this variability, the statistical robustness of ASD-related outcomes warrants careful consideration. In this study, outcomes related to ASD were particularly fragile, with only 1.3% of patients requiring an outcome change to render results non-significant. Thus, while studies have reported significantly lower rates of ASD with LDA compared to lumbar fusion, the fragility of these data underscores the importance of cautious interpretation. It is interesting to note that declining rates of LDA procedures, suggests that the broader surgical community has already broadly understood that RCTs supporting LDA are of lower quality, and this paper quantifies this point with statistical fragility result.
Non-industry funded research is generally regarded as less prone to bias than industry-funded studies. Systematic reviews and meta-analyses have consistently shown that industry-sponsored research is more likely to report favorable outcomes for the sponsor’s product. For instance, a systematic review and meta-analysis by Lundh et al found that industry-sponsored trials were significantly more likely to report positive efficacy results and favorable conclusions compared to non-industry-sponsored studies.23,24 This suggests the presence of potential bias toward the sponsor’s product. Similarly, Jorgensen et al found that industry-supported meta-analyses exhibited lower methodological quality and transparency compared to those funded by non-profit sources or without external support, contributing to biased outcomes. 25 Additionally, Riaz et al 26 reported that industry-sponsored studies were almost four times more likely to yield positive outcomes than those funded by the NIH, further reinforcing the notion of bias in favor of the sponsor’s product in industry-funded research. In the context with the literature, this study found that industry-funded studies were more statistically fragile with significantly lower FQ than non-industry funded ones, consistent with the broader literature on increased potential for bias in industry-funded studies. Of note, this study found that industry-funded studies had a higher median FI compared to non-industry funded studies (median FI: 6 vs 4), suggesting that more outcome reversals were required to alter statistical significance. However, when accounting for sample size using the FQ, industry-funded studies exhibited lower robustness (median FQ: 0.019 vs 0.040). This indicates that, despite a higher raw FI, the proportion of patients required to change outcomes to reverse significance was actually smaller in industry-funded studies. This is likely due to the larger sample sizes typical of industry-funded trials, which can inflate the FI but may not truly reflect greater statistical robustness. The FQ provides essential context, revealing that these studies may remain statistically fragile despite appearing robust based on FI alone.
While our subgroup analysis accounted for differences in sample size through the use of the FQ, we acknowledge that other factors, such as differences in study design, methodological rigor, and population heterogeneity, may also influence statistical fragility. To mitigate this, we performed a formal risk of bias assessment using the Cochrane RoB 2 tool and found that the overall methodological quality was comparable between industry-funded and non-industry funded studies. However, due to variability in reporting in individual studies, we were unable to systematically control for differences in patient population heterogeneity across studies. This represents a limitation of our subgroup analysis and underscores the need for cautious interpretation when comparing fragility across funding sources.
The FQ reported in this study emphasizes the need for high-quality, controlled, and unbiased studies with sufficiently long follow-up periods. However, it is in line with other studies in the orthopedic literature examining significance and fragility.27-32 For example, Ortiz-Babilonia et al found a median FI of 7 and FQ of 0.043 in a study examining RCTs in cervical disc arthroplasty vs anterior cervical discectomy and fusion, and Parisien et al found a median FI of 4 and FQ of 0.066 in a study examining studies in the orthopaedic shoulder literature.11,28 This suggests a broader need to improve study design throughout the orthopaedic literature.
Understanding the statistical fragility of study outcomes enables clinicians to make more informed decisions. For instance, when a study has a low FI, clinicians may exercise greater caution in altering clinical practice based on its findings.33,34 Awareness of FI and FQ can also enhance future study design, as researchers can strive for higher FIs—indicating more robust results—by increasing sample sizes or implementing more rigorous follow-up protocols.33,35 FI and FQ offer additional context to
There are significant limitations to consider when interpreting the results of this study. While fragility measures are valuable for assessing the robustness of RCTs, they require the exclusion of non-RCTs, limiting their general applicability in clinical research. Small sample sizes and rare events can result in a low FI, potentially penalizing studies that are otherwise well-conducted and clinically significant.36,37 However, this issue is more relevant in fields where small sample sizes are common. 37 Additionally, the FI can be misleading in the context of high loss to follow-up rates; if the number of patients lost exceeds the FI, the study’s findings could be overturned with just better follow-up. 38 Standardized thresholds for the FI and FQ have not yet been established to evaluate the fragility of outcomes in comparative trials. A standardized definition of these thresholds would enhance the ability to assess the robustness of study findings. Lastly, when using statistical fragility in clinical decision-making, it’s crucial to understand that a study’s fragility isn’t the sole determinant of its clinical utility. A low FI indicates that a study’s significant findings could be overturned by a small number of events or minor changes in data. This highlights the vulnerability of the statistical significance, not necessarily the clinical importance of the intervention. Therefore, while a low FI should prompt a closer look at the study’s design, sample size, event rates, and follow-up completeness, it shouldn’t automatically lead to the dismissal of potentially beneficial interventions. Instead, consider the FI as one piece of evidence among many, including the biological plausibility of the intervention, the magnitude of the observed effect, and its consistency with other research, to make informed and patient-centered decisions.
Conclusion
This systematic review demonstrates that the statistically significant findings from RCTs comparing LF to LDA are statistically fragile and susceptible to a different conclusion with only small changes in event outcomes. The reversal of events in as few as 2.2 out of 100 patients overall and 1.3 out of 100 patients regarding ASD, may be enough to reverse the statistical significance of results from the RCTs included in this analysis. Given the low FI and FQ across many outcomes, the robustness of current evidence in the LDA vs fusion literature should be interpreted with caution. Further high-quality trials are needed to confirm and strengthen these findings.
Footnotes
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The datasets generated and analyzed during this study are available from the corresponding author upon reasonable request. All data were extracted from publicly available randomized controlled trials (RCTs) identified through a systematic search of PubMed, Embase, and Medline. The search strategy and inclusion/exclusion criteria are detailed in the Methods section of this manuscript to ensure transparency and reproducibility.
IRB Statement
This research was not considered human subjects research given the availability of all data through the three databases mentioned in the manuscript.
Appendix
Risk of Bias Assessment using the Cochrane Risk of Bias 2.0 (RoB 2) Tool
First author
Domain 1: Risk of bias arising from randomization process
Domain 2: Risk of bias due to deviations from the intended interventions
Domain 3: Risk of bias due to missing outcome data
Domain 4: Risk of bias in measurement of the outcome
Domain 5: Risk of bias in selection of the reported result
Overall risk of bias
Auerbach
Low risk
Low risk
Low risk
Low risk
Low risk
Low risk
Auerbach
Low risk
Low risk
Low risk
Low risk
Low risk
Low risk
Berg
Low risk
Some risk
Low risk
Low risk
Low risk
Low risk
Berg
Low risk
Low risk
Low risk
Low risk
Low risk
Low risk
Berg
Low risk
Some risk
Low risk
Low risk
Low risk
Low risk
Blumenthal
Low risk
Low risk
Low risk
Low risk
Low risk
Low risk
Delamarter
Low risk
Some risk
Low risk
Some risk
Low risk
Some risk
Geisler
Low risk
Low risk
Low risk
Low risk
Low risk
Low risk
Gornet
Low risk
Low risk
Low risk
Low risk
Low risk
Low risk
Gornet
Low risk
Low risk
Low risk
Low risk
Low risk
Low risk
Guyer
Low risk
Low risk
Low risk
Low risk
Low risk
Low risk
Holt
Low risk
Low risk
Low risk
Low risk
Low risk
Low risk
Radcliff
Low risk
Low risk
Some risk
Low risk
Low risk
Low risk
Skold
Low risk
Low risk
Some risk
Some risk
Low risk
Some risk
Tropp
Low risk
Low risk
Low risk
Low risk
Low risk
Low risk
Zigler
Low risk
Low risk
Low risk
Low risk
Low risk
Low risk
Zigler
Low risk
Low risk
Low risk
Low risk
Low risk
Low risk
