Propensity Scores for Observational Studies: Putting Real World Data and Experience to Good Use in Spine Research!

Abstract

Study Design

Narrative Review.

Objectives

Observational studies using real-world data (RWD) have become increasingly popular, though they are susceptible to selection bias. Propensity score methods offer a powerful statistical approach to mitigate bias by balancing patient characteristics across treatment groups. This review demystifies the four primary propensity score techniques and highlights the growing role of a newer strategy called inverse probability weighting in registry-based spine research.

Methods

We explore the applications of propensity score methods in spine surgery research through the presentation of a number of hypothetical and real-world examples from recent literature. Further, we compare their utility to traditional analytic techniques such as multivariable regression.

Results

The four primary applications are (1) covariate adjustment using the propensity score, (2) stratification based on the propensity score, (3) matching on the propensity score, and (4) inverse probability of treatment weighting. These techniques aid in the minimization of confounding leading to spurious results, allowing for similar effects to randomization within the setting of observational research.

Conclusions

While propensity score methods are not a substitute for randomization, these tools provide an essential framework for strengthening causal inference assessments when randomized controlled trials (RCTs) are not feasible or appropriate. When RCTs are practical, propensity score methods may aid spine care practitioners in deriving objective, complementary findings from observational studies of RWD. Inverse probability of treatment methods are particularly promising due to their greater sample size efficiency, capability for multivariable comparisons, and potentially reduced bias compared to traditional propensity score methods.

Keywords

propensity score randomized control trial inverse probability weighting

Introduction

Evaluating treatment effects in surgery is inherently challenging. Despite researchers’ best efforts to control for confounding variables, there remain numerous patient-, surgeon-, and treatment-specific factors that are difficult to measure and directly compare. Historically, randomized controlled trials (RCTs) have served as the gold standard for comparing treatments and outcomes in both medicine and surgery. While their utility is clear, RCTs are not without inherent challenges. Such trials are often prohibitively expensive and logistically difficult to conduct, particularly in surgical disciplines. Many randomized controlled trials do not study real-world populations; excluding patients with significant comorbidities may lead to unbiased results, though in a distinct target population from actual practice. Furthermore, simple randomization alone addresses, but cannot guarantee, covariate balance. Finally, there are substantial patient factors that interfere with patient enrollment for prospective RCTs, such as preference for one treatment over another, dislike, or distrust of randomization to the point of feeling like a ‘guinea pig,’ concerns over increased burdens involved with the study and other factors.^1,2 As a result, observational studies have gained traction due to their relative greater feasibility and lower cost.

In this context, propensity score methods have emerged as a preferred statistical approach to address confounding in observational spine surgery research. A propensity score reflects the likelihood that a patient receives a particular treatment based on observed baseline characteristics. In spine surgery, many factors influence surgical decision-making, and it is common for patients undergoing different procedures for the same condition to differ systematically at baseline. Propensity scores provide a way to adjust for these imbalances when comparing outcomes across treatment groups. Their utility is evidenced by their rapid proliferation throughout spine research. A PubMed search found that, where studies on “Spine surgery” have increased almost fourfold from 2015 to 2025 (1336 articles in 2015 to 4372 in 2025), studies investigating both “Spine surgery” and “propensity score” have increased by over forty-fold over the same time period (3 articles in 2015 to 131 in 2025). Indeed, propensity scores have already been applied to regulatory processes, proving relevant in the approval of medical devices like the Simplify and M6-C artificial cervical discs.^3-6 In such a dynamic landscape, we thus believe a firm understanding of propensity score techniques will prove highly beneficial for spine care practitioners to both interpret and design cutting-edge research.

There are four primary propensity score techniques: (1) covariate adjustment using the propensity score, (2) stratification based on the propensity score, (3) matching on the propensity score, and (4) inverse probability of treatment weighting (IPTW).^7,8 In this article, we aim to demystify these methods through the lens of hypothetical spine surgery research examples.

To Randomize, or Not: Randomized Controlled Studies May Not Reflect Real-World Clinical Scenarios

With unlimited funding, dedicated research staff, and two treatments with clear equipoise based on a meaningful research question, a RCT has the chance to provide the highest level of evidence. However, particularly in surgical research, RCTs are often impractical due to logistical complexity, high costs, lengthy timelines with challenging enrollment dynamics, and concerns over generalizability or ultimately ethical feasibility.⁹ Randomization ideally ensures that, on average, patients in each treatment arm are similar with respect to both measured and unmeasured factors that influence outcomes independently of the intervention. Yet even when budget, staffing, and patient participation are not limiting factors, the strict controls imposed in RCTs may not reflect real-world clinical practice.

As an alternative, real-world data (RWD) and real-world evidence (RWE) are increasingly recognized as valuable complements to traditional trials, even in regulated drug and device studies. The U.S. Food and Drug Administration (FDA) has endorsed the use of RWD and RWE in regulatory submissions.¹⁰ Sources of RWD include electronic health records, administrative claims, clinical registries, patient-reported outcomes, wearable devices, and even social media, many of which are highly relevant to spine surgery research. However, the quality and completeness of these data sources can vary widely and should be carefully evaluated and transparently reported.

This distinction highlights a fundamental difference in research focus. RCTs are designed to answer the question of efficacy: “can the treatment work?” as applied under ideal conditions. Observational studies using RWD, by contrast, aim to evaluate effectiveness: “does the treatment work?” as applied in real-world settings.¹¹

Observational Studies Using Real-World Data Are Powerful Tools, but Confounding Factors Must Be Handled Thoughtfully to Avoid Misleading Findings

In observational studies, patients are observed based on the treatments they receive in clinical practice, without random assignment. These studies may be designed retrospectively or prospectively to compare an outcome. However, unlike randomized trials, observational studies face a fundamental challenge: patients in different treatment groups often differ in baseline characteristics, many of which may influence the measured outcomes. If these characteristics are associated with both the treatment and the outcome, they can confound the results. Failing to account for confounding can lead to biased estimates and misleading conclusions, undermining the applicability of findings to clinical care. Discretion as to what variables to include in a propensity score model is also consequential. Brookhart et al noted that, while neither will alter the bias of the results, including variables related to the treatment but not the outcome will increase the variance of the results and should be avoided, whereas variables related to the outcome but not the treatment will decrease the variance and should always be included.¹² Addressing potential confounding using properly selected variables is therefore critical in observational research to ensure the validity and trustworthiness of the result.

Traditional Methods to Deal With Confounding

Several statistical methods can be employed to control for confounding in observational studies. Two of the most used are patient stratification and multivariable regression. To clarify, we present an example using data stratification. Consider the hypothetical study question: Does vertebroplasty increase the risk of subsequent vertebral compression fracture compared to conservative management in patients with an index osteoporotic fracture within the first year after treatment? What other factors may influence the risk of subsequent fracture in a patient independent of the treatment? Certainly, patient age, body mass index (BMI), and smoking status likely play a role. In this hypothetical example, suppose 400 patients with index vertebral fractures are studied. Two hundred receive vertebroplasty and 200 undergo conservative non-operative management. During the study period, we observe 45 subsequent fractures, distributed as demonstrated in Table 1.

Table 1.

Hypothetical Data Showing Unstratified Risk of Subsequent Compression Fracture Among Patients Undergoing Vertebroplasty Versus Conservative Care

Vertebroplasty (N = 200)	Conservative care (N = 200)	Relative risk (RR) (95% confidence interval)
30/200 (15%)	15/200 (7.5%)	2.0 (1.1, 3.6)

At first glance, the data suggest that patients undergoing vertebroplasty have double the risk of subsequent vertebral fracture compared to those managed conservatively. The resultant ‘relative risk’ (RR) is 2. However, this conclusion may be misleading. It is essential to consider whether other variables such as age, BMI, and smoking status independently influence the risk of future fracture and differ between treatment groups. These differences, if present, can confound the observed association.

In this hypothetical example using RWD, baseline characteristics were reviewed at the time of the index fracture. Notably, a greater proportion of patients who underwent vertebroplasty were smokers compared to those in the conservative care group. Given that smoking is known to affect bone health and fracture risk, this imbalance could partially or fully explain the observed difference in outcomes.

Given the additional data presented in Table 2, stratifying the results by smoking status reveals that the risk of subsequent fractures is similar between treatment groups within each stratum (smokers vs non-smokers). In both subgroups, the relative risk approximates 1.0, suggesting no significant treatment effect (Table 3). This finding indicates that the initial observed difference in fracture rates was likely due to confounding by smoking status, rather than the treatment itself. In other words, patients who smoked not only had a higher baseline risk of fracture but also exhibited a greater propensity to receive vertebroplasty compared to conservative management.

Table 2.

Hypothetical Data Distribution Showing Age, BMI and Smoking Status Among Patients Undergoing Vertebroplasty Versus Conservative Care

	Vertebroplasty N = 200	Conservative care N = 200
Age (years), mean ± SD	78.2 ± 4.1	79.0 ± 5.2
BMI (kg/m²), mean ± SD	31.5 ± 2.3	29.5 ± 2.1
Smoking status, n, (%)	40 (20%)	20 (10%)

Table 3.

Hypothetical Stratified Data Based on Smoking Status and Risk of Subsequent Compression Fracture Among Patients Undergoing Vertebroplasty Versus Conservative Care

Smoker (N = 60)			Non-smoker (N = 340)
Vertebroplasty	Conservative	RR (95%CI)	Vertebroplasty	Conservative	RR (95%CI)
10/40 (25%)	5/20 (25%)	1.0 (0.4, 2.5)	15/160 (9%)	15/180 (8%)	1.1 (0.6, 2.2)

While stratification is a useful method for illustrating confounding, it is limited in its ability to control for multiple confounders simultaneously, particularly when those confounders interact or are continuous variables. Multivariable regression offers a more powerful solution by incorporating several known confounders into a single statistical model. This approach enables the estimation of an adjusted treatment effect, accounting for baseline differences across groups. In essence, multivariable regression can be viewed as a more flexible and computationally sophisticated extension of stratification, allowing researchers to control for a broader set of covariates when comparing treatment outcomes.

Are Propensity Scores a Better Option?

Propensity scores are not new to clinical research. First introduced by Rosenbaum and Rubin in 1983, a propensity score represents the probability from 0 to 1 of an individual receiving a particular treatment (or exposure) based on observed baseline characteristics.¹³ The score is typically estimated using multivariable statistical techniques such as logistic regression for different treatments. In observational research, propensity scores are used to address confounding by creating balance in baseline characteristics between treatment groups, thereby achieving a result like randomization. Simply stated, the propensity score is a balancing score. One major advantage over traditional regression methods is that it condenses multiple confounders into a single scalar value. This can be especially beneficial when studying outcomes with few events (eg, mortality, surgical site infection, postoperative fractures), or when the number of potential confounders is large relative to the sample size, as is common in surgical studies. However, it is critical to note that propensity scoring methods primarily address measured confounders. While bias from unmeasured covariates may be mitigated in the event that they are associated with measured confounders, the extent of this mitigation remains uncertain.^14,15 Unlike randomization, propensity score methods may not reliably adjust for unmeasured or unknown variables, which can still bias the results.

Once the propensity score has been calculated, it can be incorporated into statistical analyses using one of four primary methods: (1) Adjustment using the propensity score (2) Stratification based on propensity score (3) Matching on propensity score and (4) Inverse probability weighting. Here we discuss all four strategies with both hypothetical examples and examples from the literature (Table 4).

Table 4.

Hypothetical Examples of Propensity Score Methodologies in Spine Surgery Research

Method	Illustrative hypothetical study	Implementation summary
Covariate adjustment	Does anterior cervical discectomy and fusion (ACDF) lead to better functional outcomes compared to posterior cervical laminectomy and fusion for multilevel cervical spondylotic myelopathy?	Propensity scores are calculated for each patient based on baseline characteristics (eg, age, sex, BMI, nurick grade, etc.). These scores are then included as covariates in a multivariable regression model alongside the treatment group to estimate the adjusted treatment effect (eg, 12-month mJOA score).
Stratification (subclassification)	Is lumbar fusion associated with lower rates of mechanical back pain at 1 year compared to laminectomy alone in patients with degenerative grade I spondylolisthesis?	Patients are stratified into quintiles based on their propensity scores, estimated using covariates such as age, sex, BMI, and pelvic incidence. Outcomes are compared within each stratum, and pooled to calculate an overall adjusted treatment effect (eg, VAS back pain)
Matching	Does minimally invasive TLIF result in lower postoperative infection rates compared to open TLIF?	Propensity scores are calculated from baseline factors (eg, age, sex, BMI, operative level). Each MIS-TLIF patient is matched 1:1 to an open TLIF patient with a similar score. Unmatched patients are excluded. Outcomes (eg, infection rates) are compared within the matched cohort.
Inverse probability of treatment weighting (IPTW)	Does early surgery (<24 hours) improve motor recovery compared to delayed surgery (>24 hours) in patients with acute traumatic central cord syndrome?	Propensity scores are calculated based on variables such as age, gender, preoperative motor scores, and MRI findings. Weights are assigned as: early = 1/PS, delayed = 1/(1–PS). These are incorporated into a weighted logistic regression model to estimate the likelihood of achieving a ≥2-grade ASIA motor improvement.

VAS = Visual Analog Scale

Covariate Adjustment Using the Propensity Score

In this method, the propensity score representing the probability of receiving a particular treatment based on baseline covariates is included as an independent variable in a multivariable regression model, along with the treatment variable. The outcome of interest serves as the dependent variable. This approach allows estimation of the treatment effect while adjusting for selection bias, using a single balancing score rather than multiple covariates. Covariate adjustment is conceptually similar to traditional regression methods but provides a more streamlined and statistically efficient model. It is a common approach in spine and orthopedic research, with illustrative examples such as Phan et al.’s investigation of resident involvement in spine surgery outcomes.¹⁶ In this study, propensity scores for complication risk were fed into a multivariate logistic regression model and divided into tertiles, allowing the Phan et al to adjust for baseline covariates. In turn, their adjusted analyses for each tertile distributed between risk, location of surgery, fusion vs non-fusion surgery, and inpatient vs outpatient status showed no significant differences in complication rates between surgeries with and without trainee involvement.¹⁶ Covariate adjustment using the propensity score thus offers a readily applied method to decrease selection bias, one which has historically been amongst the most commonly employed propensity score models. As noted by Austin and Mamdani (2006), caution must still be applied with simple covariate adjustment; accurate and unbiased estimates of treatment effects generally require proper specification and fit testing for these models.⁷

Stratification (Subclassification) Based on the Propensity Score

Stratification divides the study population into strata, commonly quintiles, based on their propensity scores. Within each stratum, outcomes are compared between treatment groups, and results are aggregated across strata to estimate the overall treatment effect. While less frequently used in spine literature, this method is well-established in observational research. Notable applications include the study by Pugely et al comparing complication rates between spinal and general anesthesia in total knee arthroplasty, and more recent work by Maislin et al evaluating RCT vs observational data methodologies.^17,18 Pugely et al first recorded all patient characteristics, comorbidities, and operative characteristics before calculating unadjusted complication rates for both general and spinal anesthesia groups. Then, they controlled for selection bias between the anesthesia groups using a covariate adjustment to eliminate significant baseline differences, followed with a stratification by propensity score quintiles to enable adjusted comparison between the two cohorts.¹⁷ Similarly, Maislin et al showed how stratification by quintiles could be used to compare operations like dynamic sagittal tethering to decompression and Transforaminal Lumbar Interbody Fusion, using quintile stratification by propensity score to eliminate differences in baseline covariates.¹⁸ Stratification and subclassification into quintiles carries advantages in that it both preserves sample size, and thus statistical power, while eliminating at least 90% of covariate-related bias.^19,20

Matching on the Propensity Score

Matching involves pairing treated and untreated individuals with similar propensity scores to create comparable groups. One-to-one or one-to-many matching can be used, and unmatched individuals are excluded, which may reduce sample size but improves internal validity. This method is widely used in spine surgery research. For example, Mahan et al conducted a matched analysis of postoperative infection rates in full-endoscopic vs traditional spine surgery, using propensity scores to control for the significantly higher rates of comorbidities in patients who underwent full-endoscopic spine surgery and enabling a valid comparison.²¹ Echt et al used a similar matching protocol to evaluate fusion vs decompression in adult lumbar scoliosis, reporting that both groups had similar rates of significant improvements after adjusted comparison.²² Compared to traditional regression techniques, matching offers an advantage in that it can better elucidate areas in which there is not substantial overlap between treated and untreated groups, thereby decreasing potential effects of extrapolation.¹⁹

Inverse Probability of Treatment Weighting (IPTW)

IPTW assigns each patient a weight based on the inverse probability of receiving the treatment they actually received, creating a “pseudopopulation” where covariates are balanced across treatment groups. This allows for estimation of marginal (population-average) treatment effects and preserves sample size. Weights are defined as 1/propensity score for treated individuals and 1/(1 − propensity score) for controls. IPTW has become increasingly prevalent in spine research over the past decade, with recent examples by Nakajima et al (2023), Yadav et al (2024), Huang et al (2015), and Bovonratwet et al (2025).^23-26 In the following, we provide a more holistic overview of IPTW techniques supplemented by another hypothetical example.

A Closer Look: Inverse Probability of Treatment Weighting May be a Better Choice

IPTW offers several advantages over alternative propensity score methods. Unlike propensity score matching, which requires discarding unmatched individuals, IPTW retains the entire sample, thereby preserving statistical power. Furthermore, while matching typically facilitates comparisons between only two groups, IPTW can accommodate multiple treatment categories or continuous exposure measures. Compared to traditional propensity score adjustment or stratification, IPTW has also been shown to yield less biased estimates of treatment effects.²⁷

To illustrate this, let us revisit the hypothetical vertebroplasty example. In this hypothetical dataset, a higher percentage of smokers undergo vertebroplasty compared to conservative treatment. We previously determined that smoking status is associated with a higher risk of subsequent fracture, regardless of treatment type. Therefore, to achieve balance between treatment groups with respect to smoking status, we apply IPTW.

Among smokers, the probability of receiving conservative treatment is 20 out of 60 (33.3%), equivalent to a propensity score of 0.333. To balance the groups, each smoker in the conservative care group is assigned a weight equal to the inverse of this score: 1/0.333 ≈ 3. This effectively triples their influence in the analysis, creating what is often referred to as a pseudopopulation.²⁸ Conversely, smokers who received vertebroplasty are assigned a weight of 1/(1 − 0.333) ≈ 1.5.

By applying these weights, the proportion of smokers becomes equal across both treatment groups, thus eliminating confounding from this variable and allowing a more valid comparison of treatment effects (Table 5).

Table 5.

Comparison of Subsequent Vertebral Fracture Rates Between the Original Unadjusted Sample and the IPTW-Adjusted Pseudopopulation Using Hypothetical Data

Percentage of smokers	Vertebroplasty	Conservative care
Original sample	40/200 = 20%	20/200 = 10%
Weighted sample	60/400 = 15%	60/400 = 15%

Most modern statistical software programs can calculate IPTW for both single and multiple covariates. Once calculated, these weights can be applied within regression models such as weighted logistic regression for dichotomous outcomes or weighted linear regression for continuous outcomes, thereby allowing the estimation of treatment effects. Caution must be exercised when dealing with extreme weights, as they can disproportionately influence the model, inflating variance and widening confidence intervals.²⁷ To address this, various techniques such as weight truncation, trimming, or stabilization are available in most statistical packages. Trimming (also referred to as restriction or propensity score trimming) involves excluding individuals whose estimated propensity scores fall outside a pre-specified range, such as below the 1st percentile or above the 99th percentile. This approach is most appropriate when there is clear lack of overlap between treatment groups. By restricting the analytic sample to individuals with adequate overlap, trimming improves internal validity and estimation stability but changes the target population. In contrast, truncation (or weight capping) retains all observations but limits the magnitude of extreme inverse probability weights by capping them at a specified threshold (eg, the 99th percentile of the weight distribution). Truncation is generally used when overall overlap is acceptable, but a small number of observations generate disproportionately large weights that inflate variance and reduce effective sample size. Methodologically, both trimming and truncation should be implemented after propensity score estimation but before final treatment effect estimation.

Further, decisions regarding trimming or truncation should be prespecified whenever possible and should not be based on treatment effect estimates or statistical significance. Although a full review of these methods is beyond the scope of this article, their use is critical for robust estimation. Finally, while the propensity score functions as a balancing score, it assumes that all relevant confounders are correctly specified and measured. If these assumptions are violated, residual imbalance may persist. Accordingly, statistical software packages provide diagnostic tools (eg, standardized mean differences, balance plots) to assess covariate balance before estimating final treatment effects.

Well-designed and executed observational studies can often produce similar results as RCTs, with Rosenbaum and Rubin (1984) providing helpful guidance on an iterative approach to take when specifying propensity score models.^9,29,30 As discussed above, each of the four primary propensity score methods carries its own strengths and weaknesses. Regardless of which propensity score methods researchers ultimately select for their investigation, ensuring the correct specification of the model is perhaps one of the most important considerations in any study design. There should be no systemic differences in baseline covariates between the treatment groups.⁸ Adequacy of specification can be checked via a number of techniques including variance ratios, five-number summaries, graphical summaries, and the comparison of variable interactions.^7,8,31 Researchers may also elect to consider other aspects of propensity score methods not discussed in detail here. There is a distinction between analysis of the Average Treatment Effect (ATE), which represents the average effect of treatment on the overall population of interest.³² The average treatment effect for the treated (ATT), by contrast, measures the average effect only on treated subjects.³² Though beyond the scope of this review, discernment regarding which of these metrics to evaluate, and in what contexts, may be of benefit when designing studies using propensity score methods.^8,32,33 For interested readers, we have selected seven core articles as providing a foundational overview of propensity score methods and their applications.^{7,8,12,13,18,29,33} We have selected an additional ten references as further reading, providing a deeper understanding of these tools and their applications.^{3,4,9,14,19,20,27,28,30,32}

Summary

Propensity score methods, particularly inverse probability of treatment weighting (IPTW), offer a powerful strategy for addressing the problem of confounding that may impair observational spine surgery research when randomization is not feasible. The core principle involves balancing treatment groups on key baseline characteristics that are both unequally distributed and associated with the outcome of interest. Among the available approaches, IPTW has gained traction for its ability to preserve sample size and estimate marginal treatment effects. However, like all methods, it has limitations: the most notable is the influence of extreme weights, which can increase variance and distort estimates. With careful implementation, diagnostic assessment, and attention to underlying assumptions, IPTW and other propensity score techniques can significantly enhance the validity of observational studies in spine research. We encourage researchers to consider incorporating these statistical tools thoughtfully in future analyses.

Footnotes

ORCID iDs

Luke L. Jouppi

Amir Abdul-Jabbar

Ethical Considerations

This narrative review was based exclusively on publicly available research and did not involve the collection or analysis of real-world patient data. Institutional review board approval was not required.

Consent to Participate

No patients were included in this study; therefore, no informed consent was necessary.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: J.R. Chapman reports a relationship with Globus Medical Inc. that includes consulting or advisory. R.J. Oskouian reports a relationship with Globus Medical Inc., DePuy Synthes, NuVasive, and Stryker that includes consulting or advisory. The remaining authors have no conflicts to report.

References

Abraham

Young

Solomon

. A systematic review of reasons for nonentry of eligible patients into surgical randomized controlled trials - PubMed. Surgery. 2006;139:469-483. doi:10.1016/j.surg.2005.08.014.

Phelps

Tutton

Griffin

Baird

TrAFFix study co-applicants . Facilitating trial recruitment: a qualitative study of patient and staff experiences of an orthopaedic trauma trial. Trials. 2019;20(1):492. doi:10.1186/s13063-019-3597-8.

Yue

. Statistical and regulatory issues with the application of propensity score analysis to nonrandomized medical device clinical studies - PubMed. J Biopharm Stat. 2007;17:1-13. doi:10.1080/10543400601044691.

Coric

Guyer

Bae

, et al. Prospective, multicenter study of 2-level cervical arthroplasty with a PEEK-on-ceramic artificial disc - PubMed. J Neurosurg Spine. 2022;37:357-367. doi:10.3171/2022.1.SPINE211264.

Guyer

Coric

Nunley

, et al. Single-level cervical disc replacement using a PEEK-on-Ceramic implant: results of a multicenter FDA IDE trial with 24-Month Follow-up. Int J Spine Surg. 2021;15:15-644. doi:10.14444/8084.

Phillips

Coric

Sasso

, et al. Prospective, multicenter clinical trial comparing M6-C compressible six degrees of freedom cervical disc with anterior cervical discectomy and fusion for the treatment of single-level degenerative cervical radiculopathy: 2-Year results of an FDA investigational device exemption study - PubMed. Spine J 2021;21:239-252. doi:10.1016/j.spinee.2020.10.014

Austin

Mamdani

. A comparison of propensity score methods: a case-study estimating the effectiveness of post-AMI statin use - PubMed. Statistics in medicine 06/30/. 2006;25:2084-2106. doi:10.1002/sim.2328.

Austin

. An introduction to propensity score methods for reducing the effects of confounding in observational studies - PubMed. Multivariate Behav Res. 2011;46:399-424. doi:10.1080/00273171.2011.568786.

Concato

Shah

Horwitz

. Randomized, controlled trials, observational studies, and the hierarchy of research designs - PubMed. N Engl J Med 2000;342:1887-1892. doi:10.1056/NEJM200006223422507

10.

United States Food and Drug

Administration.

Real-world data: assessing electronic health records and medical claims data to support regulatory decision-making for drug and biological products. Guidance for Industry

. July 2024.

11.

Thompson

. Replication of randomized, controlled trials using real-world data: what could Go wrong? - PubMed. Value Health. 2021;24:112-115. doi:10.1016/j.jval.2020.09.015.

12.

Brookhart

Schneeweiss

Rothman

Glynn

Avorn

Stürmer

. Variable selection for propensity score models. Am J Epidemiol. 2006;163:163-1156. doi:10.1093/aje/kwj149.

13.

Rosenbaum

Rubin

. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70:41-55. doi:10.2307/2335942.

14.

Zhou

Lam

. Discussion of: statistical and regulatory issues with the application of propensity score analysis to nonrandomized medical device clinical studies. J Biopharm Stat. 2007;17:25-27. doi:10.1080/10543400601044782.

15.

Maislin

Rubin

. Design of nonrandomized medical device trials based on subclassification using propensity score quintiles. In: 2010 Joint statistical meetings; Sunday, August 1, 2010; Vancouver, British Columbia.

16.

Phan

Stratton

Kingwell

Hoda

Wai

. Impact of resident involvement on cervical and lumbar spine surgery outcomes - PubMed. Spine J. 2019;19:1905-1910. doi:10.1016/j.spinee.2019.07.006.

17.

Pugely

Martin

Gao

Mendoza-Lattes

Callaghan

. Differences in short-term complications between spinal and general anesthesia for primary total knee arthroplasty - PubMed. J Bone Joint Surg Am. 2013;95:193-199. doi:10.2106/JBJS.K.01682.

18.

Maislin

Keenan

Alamin

, et al. Are randomized trials better? Comparison of baseline covariate balance of a propensity score-balanced lumbar spine IDE trial and comparable RCTs - PubMed. Glob Spine J. 2025;15:2923-2930. doi:10.1177/21925682251316287.

19.

Stuart

. Matching methods for causal inference: a review and a look forward - PubMed. Stat Sci. 2010;25:1-21. doi:10.1214/09-STS313.

20.

Rosenbaum

Rubin

. Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. Am Statistician 1985;39:33-88. doi:10.1080/00031305.1985.10479383

21.

Mahan

Prasse

Kim

, et al. Full-endoscopic spine surgery diminishes surgical site infections - a propensity score-matched analysis - PubMed. Spine J. 2023;23:695-702. doi:10.1016/j.spinee.2023.01.009.

22.

Echt

Bakare

Varela

, et al. Comparison of minimally invasive decompression alone versus minimally invasive short-segment fusion in the setting of adult degenerative lumbar scoliosis: a propensity score-matched analysis - PubMed. J Neurosurg Spine. 2023;39:394-403. doi:10.3171/2023.4.SPINE221047.

23.

Nakajima

Miyahara

Ohtomo

, et al. Impact of body mass index on outcomes after lumbar spine surgery - PubMed. Sci Rep. 2023;13:7862. doi:10.1038/s41598-023-35008-8.

24.

Yadav

Gold

Zaidi

Hwang

Wang

. Spinal fusion surgery use among adults with low back pain enrolled in a digital musculoskeletal program: an observational study - PubMed. BMC Muscoskelet Disord. 2024;25:520. doi:10.1186/s12891-024-07573-0.

25.

Huang

, et al. The relationship between changes of cervical sagittal alignment after anterior cervical discectomy and fusion and spino-pelvic sagittal alignment under roussouly classification: a four-year follow-up study - PubMed. BMC Muscoskelet Disord. 2017;18:87. doi:10.1186/s12891-017-1447-y.

26.

Bovonratwet

Holly

Xiang

, et al. Clinical outcomes following elective lumbar spine surgery in patients living with dementia - PubMed. Spine J. 2025;26:18-28. doi:10.1016/j.spinee.2025.01.040.

27.

Cole

Hernán

. Constructing inverse probability weights for marginal structural models - PubMed. Am J Epidemiol 2008;168:656-664. doi:10.1093/aje/kwn164

28.

Chesnaye

Stel

Tripepi

, et al. An introduction to inverse probability of treatment weighting in observational research - PubMed. Clin Kidney J. 2021;15:14-20. doi:10.1093/ckj/sfab158.

29.

Wang

Schneeweiss

Franklin

, et al. Emulation of randomized clinical trials with nonrandomized database analyses: results of 32 clinical trials - PubMed. JAMA 2023;329:1376-1385. doi:10.1001/jama.2023.4221

30.

Rosenbaum

Rubin

. Reducing bias in observational studies using subclassification on the propensity score. J Am Stat Assoc. 1984;79:79-524. doi:10.1080/01621459.1984.10478078.

31.

Austin

. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Stat Med. 2009;28:3083-3107. doi:10.1002/sim.3697.

32.

Imbens

. Nonparametric estimation of average treatment effects under exogeneity: a review. Rev Econ Stat. 2004;86:4-29. doi:10.1162/003465304323023651.

33.

Rubin

DB.

Matched Sampling for Causal Effects. Cambridge, England: Cambridge University Press; 2006:167-169. doi:10.1017/CBO9780511810725