Abstract
Study Design
Narrative Review.
Objectives
Observational studies using real-world data (RWD) have become increasingly popular, though they are susceptible to selection bias. Propensity score methods offer a powerful statistical approach to mitigate bias by balancing patient characteristics across treatment groups. This review demystifies the four primary propensity score techniques and highlights the growing role of a newer strategy called inverse probability weighting in registry-based spine research.
Methods
We explore the applications of propensity score methods in spine surgery research through the presentation of a number of hypothetical and real-world examples from recent literature. Further, we compare their utility to traditional analytic techniques such as multivariable regression.
Results
The four primary applications are (1) covariate adjustment using the propensity score, (2) stratification based on the propensity score, (3) matching on the propensity score, and (4) inverse probability of treatment weighting. These techniques aid in the minimization of confounding leading to spurious results, allowing for similar effects to randomization within the setting of observational research.
Conclusions
While propensity score methods are not a substitute for randomization, these tools provide an essential framework for strengthening causal inference assessments when randomized controlled trials (RCTs) are not feasible or appropriate. When RCTs are practical, propensity score methods may aid spine care practitioners in deriving objective, complementary findings from observational studies of RWD. Inverse probability of treatment methods are particularly promising due to their greater sample size efficiency, capability for multivariable comparisons, and potentially reduced bias compared to traditional propensity score methods.
Introduction
Evaluating treatment effects in surgery is inherently challenging. Despite researchers’ best efforts to control for confounding variables, there remain numerous patient-, surgeon-, and treatment-specific factors that are difficult to measure and directly compare. Historically, randomized controlled trials (RCTs) have served as the gold standard for comparing treatments and outcomes in both medicine and surgery. While their utility is clear, RCTs are not without inherent challenges. Such trials are often prohibitively expensive and logistically difficult to conduct, particularly in surgical disciplines. Many randomized controlled trials do not study real-world populations; excluding patients with significant comorbidities may lead to unbiased results, though in a distinct target population from actual practice. Furthermore, simple randomization alone addresses, but cannot guarantee, covariate balance. Finally, there are substantial patient factors that interfere with patient enrollment for prospective RCTs, such as preference for one treatment over another, dislike, or distrust of randomization to the point of feeling like a ‘guinea pig,’ concerns over increased burdens involved with the study and other factors.1,2 As a result, observational studies have gained traction due to their relative greater feasibility and lower cost.
In this context, propensity score methods have emerged as a preferred statistical approach to address confounding in observational spine surgery research. A propensity score reflects the likelihood that a patient receives a particular treatment based on observed baseline characteristics. In spine surgery, many factors influence surgical decision-making, and it is common for patients undergoing different procedures for the same condition to differ systematically at baseline. Propensity scores provide a way to adjust for these imbalances when comparing outcomes across treatment groups. Their utility is evidenced by their rapid proliferation throughout spine research. A PubMed search found that, where studies on “Spine surgery” have increased almost fourfold from 2015 to 2025 (1336 articles in 2015 to 4372 in 2025), studies investigating both “Spine surgery” and “propensity score” have increased by over forty-fold over the same time period (3 articles in 2015 to 131 in 2025). Indeed, propensity scores have already been applied to regulatory processes, proving relevant in the approval of medical devices like the Simplify and M6-C artificial cervical discs.3-6 In such a dynamic landscape, we thus believe a firm understanding of propensity score techniques will prove highly beneficial for spine care practitioners to both interpret and design cutting-edge research.
There are four primary propensity score techniques: (1) covariate adjustment using the propensity score, (2) stratification based on the propensity score, (3) matching on the propensity score, and (4) inverse probability of treatment weighting (IPTW).7,8 In this article, we aim to demystify these methods through the lens of hypothetical spine surgery research examples.
To Randomize, or Not: Randomized Controlled Studies May Not Reflect Real-World Clinical Scenarios
With unlimited funding, dedicated research staff, and two treatments with clear equipoise based on a meaningful research question, a RCT has the chance to provide the highest level of evidence. However, particularly in surgical research, RCTs are often impractical due to logistical complexity, high costs, lengthy timelines with challenging enrollment dynamics, and concerns over generalizability or ultimately ethical feasibility. 9 Randomization ideally ensures that, on average, patients in each treatment arm are similar with respect to both measured and unmeasured factors that influence outcomes independently of the intervention. Yet even when budget, staffing, and patient participation are not limiting factors, the strict controls imposed in RCTs may not reflect real-world clinical practice.
As an alternative, real-world data (RWD) and real-world evidence (RWE) are increasingly recognized as valuable complements to traditional trials, even in regulated drug and device studies. The U.S. Food and Drug Administration (FDA) has endorsed the use of RWD and RWE in regulatory submissions. 10 Sources of RWD include electronic health records, administrative claims, clinical registries, patient-reported outcomes, wearable devices, and even social media, many of which are highly relevant to spine surgery research. However, the quality and completeness of these data sources can vary widely and should be carefully evaluated and transparently reported.
This distinction highlights a fundamental difference in research focus. RCTs are designed to answer the question of efficacy: “can the treatment work?” as applied under ideal conditions. Observational studies using RWD, by contrast, aim to evaluate effectiveness: “does the treatment work?” as applied in real-world settings. 11
Observational Studies Using Real-World Data Are Powerful Tools, but Confounding Factors Must Be Handled Thoughtfully to Avoid Misleading Findings
In observational studies, patients are observed based on the treatments they receive in clinical practice, without random assignment. These studies may be designed retrospectively or prospectively to compare an outcome. However, unlike randomized trials, observational studies face a fundamental challenge: patients in different treatment groups often differ in baseline characteristics, many of which may influence the measured outcomes. If these characteristics are associated with both the treatment and the outcome, they can confound the results. Failing to account for confounding can lead to biased estimates and misleading conclusions, undermining the applicability of findings to clinical care. Discretion as to what variables to include in a propensity score model is also consequential. Brookhart et al noted that, while neither will alter the bias of the results, including variables related to the treatment but not the outcome will increase the variance of the results and should be avoided, whereas variables related to the outcome but not the treatment will decrease the variance and should always be included. 12 Addressing potential confounding using properly selected variables is therefore critical in observational research to ensure the validity and trustworthiness of the result.
Traditional Methods to Deal With Confounding
Hypothetical Data Showing Unstratified Risk of Subsequent Compression Fracture Among Patients Undergoing Vertebroplasty Versus Conservative Care
At first glance, the data suggest that patients undergoing vertebroplasty have double the risk of subsequent vertebral fracture compared to those managed conservatively. The resultant ‘relative risk’ (RR) is 2. However, this conclusion may be misleading. It is essential to consider whether other variables such as age, BMI, and smoking status independently influence the risk of future fracture and differ between treatment groups. These differences, if present, can confound the observed association.
In this hypothetical example using RWD, baseline characteristics were reviewed at the time of the index fracture. Notably, a greater proportion of patients who underwent vertebroplasty were smokers compared to those in the conservative care group. Given that smoking is known to affect bone health and fracture risk, this imbalance could partially or fully explain the observed difference in outcomes.
Hypothetical Data Distribution Showing Age, BMI and Smoking Status Among Patients Undergoing Vertebroplasty Versus Conservative Care
Hypothetical Stratified Data Based on Smoking Status and Risk of Subsequent Compression Fracture Among Patients Undergoing Vertebroplasty Versus Conservative Care
While stratification is a useful method for illustrating confounding, it is limited in its ability to control for multiple confounders simultaneously, particularly when those confounders interact or are continuous variables. Multivariable regression offers a more powerful solution by incorporating several known confounders into a single statistical model. This approach enables the estimation of an adjusted treatment effect, accounting for baseline differences across groups. In essence, multivariable regression can be viewed as a more flexible and computationally sophisticated extension of stratification, allowing researchers to control for a broader set of covariates when comparing treatment outcomes.
Are Propensity Scores a Better Option?
Propensity scores are not new to clinical research. First introduced by Rosenbaum and Rubin in 1983, a propensity score represents the probability from 0 to 1 of an individual receiving a particular treatment (or exposure) based on observed baseline characteristics. 13 The score is typically estimated using multivariable statistical techniques such as logistic regression for different treatments. In observational research, propensity scores are used to address confounding by creating balance in baseline characteristics between treatment groups, thereby achieving a result like randomization. Simply stated, the propensity score is a balancing score. One major advantage over traditional regression methods is that it condenses multiple confounders into a single scalar value. This can be especially beneficial when studying outcomes with few events (eg, mortality, surgical site infection, postoperative fractures), or when the number of potential confounders is large relative to the sample size, as is common in surgical studies. However, it is critical to note that propensity scoring methods primarily address measured confounders. While bias from unmeasured covariates may be mitigated in the event that they are associated with measured confounders, the extent of this mitigation remains uncertain.14,15 Unlike randomization, propensity score methods may not reliably adjust for unmeasured or unknown variables, which can still bias the results.
Hypothetical Examples of Propensity Score Methodologies in Spine Surgery Research
VAS = Visual Analog Scale
Covariate Adjustment Using the Propensity Score
In this method, the propensity score representing the probability of receiving a particular treatment based on baseline covariates is included as an independent variable in a multivariable regression model, along with the treatment variable. The outcome of interest serves as the dependent variable. This approach allows estimation of the treatment effect while adjusting for selection bias, using a single balancing score rather than multiple covariates. Covariate adjustment is conceptually similar to traditional regression methods but provides a more streamlined and statistically efficient model. It is a common approach in spine and orthopedic research, with illustrative examples such as Phan et al.’s investigation of resident involvement in spine surgery outcomes. 16 In this study, propensity scores for complication risk were fed into a multivariate logistic regression model and divided into tertiles, allowing the Phan et al to adjust for baseline covariates. In turn, their adjusted analyses for each tertile distributed between risk, location of surgery, fusion vs non-fusion surgery, and inpatient vs outpatient status showed no significant differences in complication rates between surgeries with and without trainee involvement. 16 Covariate adjustment using the propensity score thus offers a readily applied method to decrease selection bias, one which has historically been amongst the most commonly employed propensity score models. As noted by Austin and Mamdani (2006), caution must still be applied with simple covariate adjustment; accurate and unbiased estimates of treatment effects generally require proper specification and fit testing for these models. 7
Stratification (Subclassification) Based on the Propensity Score
Stratification divides the study population into strata, commonly quintiles, based on their propensity scores. Within each stratum, outcomes are compared between treatment groups, and results are aggregated across strata to estimate the overall treatment effect. While less frequently used in spine literature, this method is well-established in observational research. Notable applications include the study by Pugely et al comparing complication rates between spinal and general anesthesia in total knee arthroplasty, and more recent work by Maislin et al evaluating RCT vs observational data methodologies.17,18 Pugely et al first recorded all patient characteristics, comorbidities, and operative characteristics before calculating unadjusted complication rates for both general and spinal anesthesia groups. Then, they controlled for selection bias between the anesthesia groups using a covariate adjustment to eliminate significant baseline differences, followed with a stratification by propensity score quintiles to enable adjusted comparison between the two cohorts. 17 Similarly, Maislin et al showed how stratification by quintiles could be used to compare operations like dynamic sagittal tethering to decompression and Transforaminal Lumbar Interbody Fusion, using quintile stratification by propensity score to eliminate differences in baseline covariates. 18 Stratification and subclassification into quintiles carries advantages in that it both preserves sample size, and thus statistical power, while eliminating at least 90% of covariate-related bias.19,20
Matching on the Propensity Score
Matching involves pairing treated and untreated individuals with similar propensity scores to create comparable groups. One-to-one or one-to-many matching can be used, and unmatched individuals are excluded, which may reduce sample size but improves internal validity. This method is widely used in spine surgery research. For example, Mahan et al conducted a matched analysis of postoperative infection rates in full-endoscopic vs traditional spine surgery, using propensity scores to control for the significantly higher rates of comorbidities in patients who underwent full-endoscopic spine surgery and enabling a valid comparison. 21 Echt et al used a similar matching protocol to evaluate fusion vs decompression in adult lumbar scoliosis, reporting that both groups had similar rates of significant improvements after adjusted comparison. 22 Compared to traditional regression techniques, matching offers an advantage in that it can better elucidate areas in which there is not substantial overlap between treated and untreated groups, thereby decreasing potential effects of extrapolation. 19
Inverse Probability of Treatment Weighting (IPTW)
IPTW assigns each patient a weight based on the inverse probability of receiving the treatment they actually received, creating a “pseudopopulation” where covariates are balanced across treatment groups. This allows for estimation of marginal (population-average) treatment effects and preserves sample size. Weights are defined as 1/propensity score for treated individuals and 1/(1 − propensity score) for controls. IPTW has become increasingly prevalent in spine research over the past decade, with recent examples by Nakajima et al (2023), Yadav et al (2024), Huang et al (2015), and Bovonratwet et al (2025).23-26 In the following, we provide a more holistic overview of IPTW techniques supplemented by another hypothetical example.
A Closer Look: Inverse Probability of Treatment Weighting May be a Better Choice
IPTW offers several advantages over alternative propensity score methods. Unlike propensity score matching, which requires discarding unmatched individuals, IPTW retains the entire sample, thereby preserving statistical power. Furthermore, while matching typically facilitates comparisons between only two groups, IPTW can accommodate multiple treatment categories or continuous exposure measures. Compared to traditional propensity score adjustment or stratification, IPTW has also been shown to yield less biased estimates of treatment effects. 27
To illustrate this, let us revisit the hypothetical vertebroplasty example. In this hypothetical dataset, a higher percentage of smokers undergo vertebroplasty compared to conservative treatment. We previously determined that smoking status is associated with a higher risk of subsequent fracture, regardless of treatment type. Therefore, to achieve balance between treatment groups with respect to smoking status, we apply IPTW.
Among smokers, the probability of receiving conservative treatment is 20 out of 60 (33.3%), equivalent to a propensity score of 0.333. To balance the groups, each smoker in the conservative care group is assigned a weight equal to the inverse of this score: 1/0.333 ≈ 3. This effectively triples their influence in the analysis, creating what is often referred to as a pseudopopulation. 28 Conversely, smokers who received vertebroplasty are assigned a weight of 1/(1 − 0.333) ≈ 1.5.
Comparison of Subsequent Vertebral Fracture Rates Between the Original Unadjusted Sample and the IPTW-Adjusted Pseudopopulation Using Hypothetical Data
Most modern statistical software programs can calculate IPTW for both single and multiple covariates. Once calculated, these weights can be applied within regression models such as weighted logistic regression for dichotomous outcomes or weighted linear regression for continuous outcomes, thereby allowing the estimation of treatment effects. Caution must be exercised when dealing with extreme weights, as they can disproportionately influence the model, inflating variance and widening confidence intervals. 27 To address this, various techniques such as weight truncation, trimming, or stabilization are available in most statistical packages. Trimming (also referred to as restriction or propensity score trimming) involves excluding individuals whose estimated propensity scores fall outside a pre-specified range, such as below the 1st percentile or above the 99th percentile. This approach is most appropriate when there is clear lack of overlap between treatment groups. By restricting the analytic sample to individuals with adequate overlap, trimming improves internal validity and estimation stability but changes the target population. In contrast, truncation (or weight capping) retains all observations but limits the magnitude of extreme inverse probability weights by capping them at a specified threshold (eg, the 99th percentile of the weight distribution). Truncation is generally used when overall overlap is acceptable, but a small number of observations generate disproportionately large weights that inflate variance and reduce effective sample size. Methodologically, both trimming and truncation should be implemented after propensity score estimation but before final treatment effect estimation.
Further, decisions regarding trimming or truncation should be prespecified whenever possible and should not be based on treatment effect estimates or statistical significance. Although a full review of these methods is beyond the scope of this article, their use is critical for robust estimation. Finally, while the propensity score functions as a balancing score, it assumes that all relevant confounders are correctly specified and measured. If these assumptions are violated, residual imbalance may persist. Accordingly, statistical software packages provide diagnostic tools (eg, standardized mean differences, balance plots) to assess covariate balance before estimating final treatment effects.
Well-designed and executed observational studies can often produce similar results as RCTs, with Rosenbaum and Rubin (1984) providing helpful guidance on an iterative approach to take when specifying propensity score models.9,29,30 As discussed above, each of the four primary propensity score methods carries its own strengths and weaknesses. Regardless of which propensity score methods researchers ultimately select for their investigation, ensuring the correct specification of the model is perhaps one of the most important considerations in any study design. There should be no systemic differences in baseline covariates between the treatment groups. 8 Adequacy of specification can be checked via a number of techniques including variance ratios, five-number summaries, graphical summaries, and the comparison of variable interactions.7,8,31 Researchers may also elect to consider other aspects of propensity score methods not discussed in detail here. There is a distinction between analysis of the Average Treatment Effect (ATE), which represents the average effect of treatment on the overall population of interest. 32 The average treatment effect for the treated (ATT), by contrast, measures the average effect only on treated subjects. 32 Though beyond the scope of this review, discernment regarding which of these metrics to evaluate, and in what contexts, may be of benefit when designing studies using propensity score methods.8,32,33 For interested readers, we have selected seven core articles as providing a foundational overview of propensity score methods and their applications.7,8,12,13,18,29,33 We have selected an additional ten references as further reading, providing a deeper understanding of these tools and their applications.3,4,9,14,19,20,27,28,30,32
Summary
Propensity score methods, particularly inverse probability of treatment weighting (IPTW), offer a powerful strategy for addressing the problem of confounding that may impair observational spine surgery research when randomization is not feasible. The core principle involves balancing treatment groups on key baseline characteristics that are both unequally distributed and associated with the outcome of interest. Among the available approaches, IPTW has gained traction for its ability to preserve sample size and estimate marginal treatment effects. However, like all methods, it has limitations: the most notable is the influence of extreme weights, which can increase variance and distort estimates. With careful implementation, diagnostic assessment, and attention to underlying assumptions, IPTW and other propensity score techniques can significantly enhance the validity of observational studies in spine research. We encourage researchers to consider incorporating these statistical tools thoughtfully in future analyses.
Footnotes
Ethical Considerations
This narrative review was based exclusively on publicly available research and did not involve the collection or analysis of real-world patient data. Institutional review board approval was not required.
Consent to Participate
No patients were included in this study; therefore, no informed consent was necessary.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: J.R. Chapman reports a relationship with Globus Medical Inc. that includes consulting or advisory. R.J. Oskouian reports a relationship with Globus Medical Inc., DePuy Synthes, NuVasive, and Stryker that includes consulting or advisory. The remaining authors have no conflicts to report.
