Abstract
Background: Machine learning (ML) has emerged as a method to determine patient-specific risk for prolonged postoperative opioid use after orthopedic procedures. Purpose: We sought to analyze the efficacy and validity of ML algorithms in identifying patients who are at high risk for prolonged opioid use following orthopedic procedures. Methods: PubMed, EMBASE, and Web of Science Core Collection databases were queried for articles published prior to August 2021 for articles applying ML to predict prolonged postoperative opioid use following orthopedic surgeries. Features pertaining to patient demographics, surgical procedures, and ML algorithm performance were analyzed. Results: Ten studies met inclusion criteria: 4 spine, 3 knee, and 3 hip. Studies reported postoperative opioid use over 30 to 365 days and varied in defining prolonged use. Prolonged postsurgical opioid use frequency ranged from 4.3% to 40.9%. C-statistics for spine studies ranged from 0.70 to 0.81; for knee studies, 0.75 to 0.77; and for hip studies, 0.71 to 0.77. Brier scores for spine studies ranged from 0.039 to 0.076; for knee, 0.01 to 0.124; and for hip, 0.052 to 0.21. Seven articles reported calibration intercept (range: –0.02 to 0.16) and calibration slope (range: 0.88 to 1.08). Nine articles included a decision curve analysis. No investigations performed external validation. Thematic predictors of prolonged postoperative opioid use were preoperative opioid, benzodiazepine, or antidepressant use and extremes of age depending on procedure population. Conclusions: This systematic review found that ML algorithms created to predict risk for prolonged postoperative opioid use in orthopedic surgery patients demonstrate good discriminatory performance. The frequency and predictive features of prolonged postoperative opioid use identified were consistent with existing literature, although algorithms remain limited by a lack of external validation and imperfect adherence to predictive modeling guidelines.
Introduction
Opioid misuse is an increasingly deadly and costly crisis in the United States. In 2020, opioids were involved in nearly 74.8% of all drug overdose deaths [7]. The number of opioid-related fatalities has surpassed 6 times that of the 1990s, when opioid prescriptions surged in response to a call for better treatment of pain [28,44]. Strikingly, a recent study found that orthopedic surgeons prescribe nearly 8% of all opioids in the United States [5]. This position endows orthopedic surgeons with both accountability and opportunity to be responsible stewards of opioid prescription practices.
The clinical use of opioid prescribing guidelines and opioid-sparing pain management protocols, including alternative postoperative analgesic regimens, is essential to protect orthopedic patients from the risks of prolonged postoperative opioid use and misuse. Studies analyzing large patient registries and insurance databases have attempted to retrospectively identify trends in postoperative opioid use and pinpoint risk factors for prolonged use after orthopedic surgery [4,24]. While some factors, such as preoperative opioid use, are widely reported as having important associations with prolonged use [24,35], there remain conflicting reports regarding other independent risk factors, such as age. Some studies have cited an increased risk at age over 50 years, while others cite an age less than 30 years [30,43]. Predictors of opioid use may vary between different orthopedic populations; thus, identification of specific risk factors is necessary to properly assess and educate patients on individualized risk. Furthermore, this information is essential when considering interventions to mitigate risk, such as opioid holidays prior to surgical intervention.
To identify predictive factors for prolonged postoperative opioid use, machine learning (ML) studies have emerged as a method to determine patient-specific risk. ML offers statistically driven predictive modeling wherein algorithm-based tools improve their predictive ability using new experience and data. ML models are also capable of modeling complex associations built upon large datasets, enabling the clinician to integrate a wide variety of patient-specific features to predict a personalized outcome. ML algorithms can be described as supervised, unsupervised, semi-supervised, or reinforcement based; supervised models are common in the medical literature as they are easily built upon preexisting patient databases in which input and output data have been collected [8]. Common supervised ML models used include random forest, neural networks, support vector machines, and naive Bayes, among many others [18]. ML models arrive at conclusions through differing pathways, each with its own flaws and strengths and none being grossly superior across all scenarios. Thus, in designing an ML model, investigators traditionally employ algorithmic methodologies, then select for the best-performing model. Complex modeling achieved by ML is more dynamic than traditional statistical modeling, which is inherently static and limited by collinearity [17]. Also, ML tools can be delivered via readily accessible online applications, which enable clinician and patient to calculate individualized risk in the clinical setting. Given the predictive potential of these tools in estimating patients’ risk for prolonged opioid use, a better understanding of the efficacy and validity of these models may aid in the prevention of long-term opioid use by elucidating who may be at higher risk, thus allowing for targeted interventions.
We set out to conduct a systematic review to analyze the efficacy and validity of ML algorithms in identifying factors that increase patients’ risk for prolonged opioid use after orthopedic surgery. Performance of both internal and external validation was examined. Specifically, definitions of prolonged opioid use, quantitative measures of opioid consumption, risk factors for prolonged opioid use, and successful clinical implementation of ML algorithms were investigated. We hypothesized that current ML algorithms would demonstrate good-to-excellent performance for predicting prolonged postoperative opioid use in patients after orthopedic surgery, but that there would be substantial variability in the parameters used to define prolonged use and adherence to algorithm reporting guidelines.
Methods
Study identification and selection process for this systematic review was performed according to the 2020 Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines (Supplemental File 1) [33]. The review was registered with PROSPERO prior to commencement (ID: CRD42021259523). The following databases were searched for articles published prior to August 1, 2021: PubMed, EMBASE, and the Web of Science Core Collection. The search terminology used to query the databases is available in Supplemental File 2. All articles were reviewed with no additional restrictions.
Two independent reviewers (L.M.K. and K.J.) screened all abstracts of the identified articles for agreement with the following inclusion criteria: (1) available in English; (2) presenting original data; and (3) reporting on the use of ML in orthopedic literature to predict postoperative opioid use. The following exclusion criteria were applied to the queried articles: (1) basic science or biomechanics articles; (2) review articles; (3) case reports; (4) technical notes; (5) editorial notes; and (6) articles reporting on patient outcomes outside the context of postoperative opioid use. Full-length texts were reviewed when the article title and abstract were insufficient for screening purposes. The references of the included articles were also screened to ensure all relevant studies were included in this review. All queried articles were screened using Covidence, an online systematic review manager.
Two independent investigators (L.M.K and K.J.) extracted the following from each study: surgical intervention, sample size, average patient age, number and type of study sites, level of preoperative opioid exposure, definition of prolonged postoperative opioid exposure, types of ML algorithms, and ML model performance metrics such as discrimination, Brier score, calibration, and decision curve analysis [41]. These performance metrics are summarized in Table 1; the pearls and pitfalls are detailed throughout the ML literature [3,10,42]. As available, feature selection, handling of missing data, predictive features of prolonged opioid use, and validation methods were also collected. Exact data and statistics were reported when provided. Disagreement in extracted content between investigators was settled by a third, independent reviewer (K.N.K.).
Machine learning performance metrics.
The Methodological Index for Non-Randomized Studies (MINORS) criteria were used to assess the methodologic quality of each study [39]. Noncomparative studies are assessed with a maximum score of 16 and comparative studies with a maximum score of 24. Higher MINORS scores are representative of greater methodological quality. Two independent reviewers scored each article; inter-rater reliability is represented by Cohen’s κ coefficient calculated in Microsoft Excel (Version 16.51).
Reported adherence to the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) statement and Journal of Medical Internet Research (JMIR) Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research was also recorded [9,27]. Articles were further assessed for adherence to each criterion of the TRIPOD checklist of items recommended to be included in the development, reporting, and interpretation of predictive modeling such as ML to enhance the reproducibility and transparency of the models.
Results
Of 1480 studies reviewed, 10 studies that used ML to predict prolonged postoperative opioid use in patients after orthopedic surgery met inclusion criteria (Fig. 1). All 10 studies developed new ML models and were published between 2019 and 2021; no studies performed external validation of an existing ML model. Four (40%) studies reported prolonged opioid use following spine surgery, 3 (30%) following knee surgery, and 3 (30%) following hip surgery (Table 2). The search strategy did not identify studies that used ML to predict prolonged postoperative opioid use following hand or wrist, elbow, shoulder, foot or ankle, pediatric, or trauma surgery.

Preferred Reporting Items for Systematic Reviews and Meta-Analysis diagram.
Study features and demographics.
IQR interquartile range, ACL anterior cruciate ligament, TKA total knee arthroplasty, THA total hip arthroplasty, FAIS femoroacetabular impingement syndrome.
The average MINORS criteria score was 11.3 ± 0.32 points, with almost perfect inter-rater reliability between reviewers (κ = 0.92) [29]. Eight (80%) studies reported adherence to both the TRIPOD and JMIR guidelines; 2 (20%) studies did not report adherence to either set of guidelines [23,46]. All 10 investigations met at least 17 of 20 TRIPOD checklist items (average 18.8 ± 0.79). Items 12 (validation) and 17 (model updating) were omitted from the 22-item checklist as is standard for non-updated, development studies [31]. Missing checklist items included (1) lack of confidence interval reporting [19,23], (2) absence of supplemental information such as study protocol development or access to a web-based application [2,23], and (3) failure to address missing data [46]. Only 1 investigation performed risk grouping [14]. Three (30%) studies compared ML models to logistic regression (LR).
Frequency of Prolonged Opioid Use
Reported frequency of prolonged postoperative opioid use ranged from 4.3% to 40.9% (Table 2), with the lowest rates of prolonged postoperative opioid use reported in 3 of the 4 articles pertaining to spine surgery (range, 4.3%–9.9%). Six (60%) articles used definitions that required sustained opioid use over a predetermined follow-up period; the remaining 4 (40%) articles defined prolonged use by the filling of an opioid prescription after a predetermined time point in the follow-up period [2,14,25,46]. Two (20%) articles [2,46], 1 from either method of defining prolonged use, reported opioid use according to standardized dosing as defined by the Centers for Disease Control and Prevention [2,11,12,46]. Nine (90%) articles used a benchmark of at least 90 days to define prolonged postoperative opioid use, while the remaining study defined prolonged opioid use at ≥ 30 days following the index surgery [23].
Thematic Factors Associated With Prolonged Opioid Use
Preoperative opioid, antidepressant, or benzodiazepine use and age were the most frequently identified risk factors for prolonged postoperative opioid use (Fig. 2). Three investigations defined preoperative opioid use as use for > 180 days prior to surgery [20–22]; 2 studies defined preoperative opioid use as use between 30 and 365 days prior [2,14]; and 2 studies used conditional or undefined timelines in defining preoperative opioid use [23,46]. Preoperative benzodiazepine and antidepressant use were reported as binary risk factors (use vs no use) without a specified timeline. Four studies identified older age as a positive predictor [14,22,23,25]; 2 identified younger age as a positive predictor [2,14]; and 1 study noted age as a positive predictor but did not specify younger or older [26]. A total of 9 out of 10 studies applied a lower limit of age 18 as an inclusion criterion, with no upper limit to age; only Kunze et al [25] applied no lower age limit.

Machine learning methods determine predictive features of prolonged postoperative opioid use following orthopedic surgery.
Spine Surgery
Four studies used ML to predict prolonged postoperative opioid use in patients undergoing spine surgery (Table 2). The average patient age was 52 years. All 4 studies excluded patients younger than 18 years old, with no upper age limit. Three studies investigated lumbar spine fusion and/or decompression, while 1 investigated anterior cervical discectomy and fusion (ACDF). All 4 studies developed novel ML algorithms and utilized a randomized 80:20 training:test population split, where the ML algorithm developed on the training set of patients was independently tested on the remaining 20% of patients not used for algorithm development. Three studies utilized 10-fold cross-validation to assess model performance; Zhang et al [46] did not report validation methodology. All 4 spine investigations converted the premier algorithm into an open-access web application capable of generating individualized predictions for risk of prolonged postsurgical opioid use.
Karhade et al [21] utilized a stochastic gradient boosting algorithm to predict postoperative opioid use following ACDF in both opioid naive patients and opioid users (median age: 51, interquartile range (IQR): 44–59). Model performance was as follows: area under the curve (AUC): 0.81; Brier: 0.076; calibration intercept: −0.01; calibration slope: 1.05. Global explanations of this model highlighted 4 variables as predictors of prolonged opioid use following ACDF from the 12 features isolated by recursive feature selection with random forest algorithms: preoperative opioid use greater than 180 days, antidepressant use, tobacco use, and Medicaid insurance. Similar to their ACDF investigation, Karhade et al [20] identified preoperative opioid use >180 days as an essential predictor of prolonged opioid use following surgery for lumbar disk herniation, along with comorbid depression and instrumentation (median age: 46, IQR: 37–58). This investigation identified elastic net penalized logistic regression (ENPLR) as the optimal ML model (AUC: 0.81; Brier: 0.064; calibration intercept: 0.13; calibration slope: 1.02). In a separate study, this same group [19] applied ML to a study population of opioid naive patients, where they found that instrumentation, uninsured status, and preoperative use of benzodiazepines, antidepressants, or gabapentin were most predictive of prolonged postoperative opioid use (median age: 60, IQR: 46–71). Again, the ENPLR algorithm had the best relative performance among all ML algorithms developed in this study (AUC: 0.70; Brier: 0.039; calibration intercept: 0.06; calibration slope: 1.02).
Zhang et al [46] developed a least absolute shrinkage and selection operator (LASSO) regression model for feature selection and determined that documented preoperative opioid use conferred a 2.70 times greater odds of prolonged opioid use and was the maximally predictive feature. Median age of patients with prolonged opioid use was 52 years and of patients without prolonged opioid use was 51 years. In their comparison of 3 traditional LR models and 4 ML models, LR was shown to be superior to all ML models. Each model utilized a random 80:20 training:test split. Full LR (AUC: 0.847; Brier: 0.039; Sensitivity: 0.749) accurately predicted 80.2% of patients who demonstrated prolonged opioid use and was thus used to construct an online predictive tool. The best-performing ML model, a time-varying convolutional neural network (AUC: 0.800; Brier: 0.041; Sensitivity; 0.809), performed with greater sensitivity but underperformed on discrimination.
Knee Surgery
Three studies developed ML models to predict prolonged opioid use following knee surgeries including anterior cruciate ligament (ACL) reconstruction (median age: 27 years, IQR: 27–33) [2], total knee arthroplasty (TKA) (median age: 67 years, IQR: 60–74) [23], and knee arthroscopy (median age: 50.5 years, IQR: 37.3–60.7) [26] (Table 2).
Anderson et al [2] identified 4 positive predictive features (preoperative morphine equivalents, pharmacy location, shorter deployment time, and age ≤ 23 years) using the Boruta algorithm for elimination with random forest algorithms. Katakam et al [23] identified the following positive predictive features via recursive feature elimination with random forest algorithms: age > 68 years, marital status (unmarried), opioid use between days 30 and 365 preoperatively, diabetes, and preoperative medications (antidepressants, benzodiazepines, gabapentin, nonsteroidal anti-inflammatory drugs, and beta-2-agonists). Anderson et al [2] and Katakam et al [23] utilized a random 80:20 training:test split and cross-validation of the training set, but neither study converted their chosen algorithm (gradient boosting machine, AUC: 0.77, Brier: 0.010; stochastic gradient boosting, AUC: 0.76, Brier: 0.073; Calibration intercept: 0.16; Calibration slope: 1.08, respectively) into an open-access web application. Anderson et al also included a comparison of their ML algorithms to traditional LR, where LR performed similarly but inferiorly (AUC: 0.76; Brier: 0.10) to the gradient boosting machine and superiorly to the remaining ML models.
Lu et al [26] performed training and validation using bootstrapping [40]; their model was the only algorithm developed primarily with preoperative patient-reported outcomes as predictive features, determined by recursive feature elimination with random forest algorithms. They reported that the preoperative International Knee Documentation Committee (IKDC); the Knee Injury and Osteoarthritic Outcomes Score (KOOS) pain, activities of daily living, and sports and activities subscales; and the Veterans RAND 12 Mental Component Score (VR12 MCS), age (unspecified), duration of symptoms, perioperative oral morphine equivalents, previous injections or nerve blocks, and days of exercise per week were the most important predictive features. Reduced baseline patient reported outcome metrics were associated with prolonged postoperative opioid use, although thresholds for identifying low preoperative scores were not defined. Their linear ensemble model demonstrated superior discrimination (AUC: 0.75; Brier: 0.124; Calibration intercept: 0.001; Calibration slope: 0.99) in comparison to LR when compared by decision curve analysis and was converted into a web application.
Hip Surgery
Three of the included studies developed ML models to predict prolonged opioid use following hip surgery (Table 2), including total hip arthroplasty (THA) (median patient age: 66 years, IQR: 57–74) [22] and hip arthroscopy for femoroacetabular impingement syndrome (median age: 34 years, IQR: 23–44 [25]; median age: 31 [14]). Karhade et al and Kunze et al tested 5 ML models (stochastic gradient boosting, random forest, support vector machine, neural network, and ENPLR,; implemented recursive feature elimination with random forest algorithms for feature selection, utilized a random 80:20 training:test split, and performed model assessment via 10-fold cross-validation (Supplemental File 3). Karhade et al [22] identified the ENPLR (AUC: 0.77; Brier: 0.052; calibration intercept: 0.01; calibration slope: 0.97) as the best performing model for predicting opioid use after THA and identified the following features as predictive: age > 66 years, opioid use >180 days preoperatively, preoperative hemoglobin (anemia), and preoperative medications (antidepressants, benzodiazepines, nonsteroidal anti-inflammatory drugs, and beta-2-agonists). Kunze et al [25] selected a stochastic gradient boosting algorithm (AUC: 0.75; Brier: 0.13; calibration intercept: −0.02; calibration slope: 0.88) and identified the following predictive factors for prolonged opioid use following hip arthroscopy: preoperative modified Harris hip score (mHHS), age, body mass index, preoperative visual analog scale (VAS) for pain, and workers compensation status.
Grazal et al [14] also examined hip arthroscopy, testing 6 algorithms (naive Bayes, gradient boosting machine, extreme gradient boosting, random forest, elastic net regularization, and artificial neural network); employed the Boruta algorithm for feature selection; and used a randomized 80:20 training:test split. Grazal et al did not report training model assessment methodology, such as cross-validation or bootstrapping. The artificial neural network demonstrated the best performance (AUC: 0.71; Brier: 0.21), and 5 features were identified as maximally predictive of prolonged opioid use (age > 40 or ≤ 25, opioid use between 30 and 365 days prior to surgery, opioid filling between 14 and 90 days postoperatively, mental health comorbidity, and preoperative substance misuse diagnosis, excluding tobacco dependence). Notably, the discriminatory ability of Kunze et al’s algorithm for hip arthroscopy outperformed that of Grazal et al, although significance of this comparison cannot be assessed with the information provided. All 3 investigations converted their optimal model into an open-access online web application capable of generating individualized predictions for risk of opioid use.
Discussion
This systematic review identified 10 studies published between 2019 and 2021 out of 1480 queried articles, underscoring the push for individualized preoperative risk stratification to assist patient management and expectations. In the majority of studies, ML discriminatory performance was good-to-excellent with strong performance metrics, confirming the efficacy of current ML in internally validated populations. Preoperative opioid use, benzodiazepine use, antidepressant use, and several procedure-specific age ranges were identified as predictors of prolonged opioid use. Finally, analysis of the methodologic execution of studies and adherence to TRIPOD guidelines highlights areas for needed improvement.
There are several limitations to this review. First, we could not assess the quality of data upon which each ML model was built, as only 80% of studies reported methodology to account for missing data. It is imperative that studies report methodology for handling missing data, such as the use of multiple imputation, as the quality of the results is dependent on the quality of input data [34]. Second, calibration intercept and slope were reported in only 70% of studies; failure to report calibration can generate misleading conclusions, as poorly calibrated ML models are subject to overprediction and underprediction [45]. In addition, a high degree of variability in selection criteria (ie, surgical procedure and degree of preoperative opioid exposure) and lack of external validation may limit applicability of ML algorithms to a general population. Third, heterogeneity of individual study populations on which ML models were built results in the inability to quantitatively pool ML results. This also limits the ability to quantitatively compare ML metrics to standard predictive modeling methods such as LR. However, the ambiguous nature of ML algorithms means that they are not designed to provide quantitative data amenable to meta-analysis. Complexities inherent in designing ML tools prohibit the dissemination of complete models or the associated programming code for that tool, challenging the replication of research in the field of ML.
Previous orthopedic literature has suggested that preoperative opioid use, chronic pain, and back pain are associated with prolonged postoperative opioid use [37]. In an opioid naive population, factors increasing risk of prolonged postoperative opioid use have been identified as age older than 50 years, male sex, and preoperative benzodiazepine or antidepressant use [43]. In the current review, predictive features identified by ML were preoperative opioid use, benzodiazepine use, and antidepressant use (Fig. 2). Age was also a frequently identified predictive factor, in 6 of the 10 articles. However, categorization of risk by age varied substantially in the studies, which may be attributable to heterogeneity between patient populations secondary to inherent variations in population age associated with specific pathologies. For example, the age range of patients with the highest incidence of hip arthroscopy is the fifth decade of life, which represents the lower end of the age range expected to undergo TKA or THA (the highest incidence is in the seventh decade of life) [13,38]. Notably, 9 of 10 studies excluded patients < 18 years of age; thus, the results of this review do not necessarily reflect the risk of prolonged postoperative opioid use in children following orthopedic surgery. Exploration of the utility of ML to this end is warranted. Therefore, the extremes of age implicated as predictive factors in the primary investigations in this review may, in fact, command less generalizable predictive value than do preoperative opioid, antidepressant, or benzodiazepine use, which was consistently identified as predictive of prolonged postoperative opioid use in orthopedics.
Rates of prolonged postoperative opioid use ranged widely from 4.3% to 40.9%, with articles pertaining to spine surgery reporting the majority of lower rates. While heterogeneity in definitions of prolonged opioid use contributes to the variability observed, these percentages are consistent with previously published literature. For instance, Karhade et al (Karhade, Ogink and Thio, 2019) [20,21] found that 9.9% of patients undergoing ACDF met criteria for sustained opioid prescription, which was driven by several factors, including preoperative opioid prescription, antidepressant use, tobacco use, and Medicaid insurance status. This concurs with Harris et al [16], who used an insurance claims database to investigate over 28,000 patients undergoing ACDF and found that 17% of these patients met criteria for chronic postoperative opioid use. While Karhade et al’s utilization of institutional data allowed for broader consideration of Medicare, Medicaid, and uninsured patients, making the study results more applicable to these populations, the single-corporation nature of the study suggests that surgical, geographic, or patient-specific factors specific to their population may influence rates of prolonged opiate use. Performance of ML models should be confirmed in populations other than that in which the initial study was performed to assess external validity. Moreover, openly available web applications of ML models provide visualization tools and explanations of model predictions, overcoming the conventional drawbacks of traditional risk scores or nomograms. Clinically, this creates opportunities for patient-provider discussions, preoperative health modification, and subsequent improvements in probability of achieving clinically relevant outcomes.
Opioids may be necessary postsurgical analgesics in some settings, though recent literature suggests that it is possible to eliminate them altogether after some elective surgeries [32]. A randomized controlled trial by Hannon et al [15] found that prescribing fewer oxycodone immediate-release pills was associated with no differences in pain scores and a significant reduction in unused opioid pills in both hip and knee arthroplasty populations. Patients in this study stopped taking opioids at an average of 1 week after discharge, and about 30% of patients never took opioids after discharge. With increasing evidence that opioid use in arthroplasty patients can be reduced, using ML tools such as the one by Karhade et al to identify patients with refills at 2 weeks is a clinically meaningful result. Early identification may allow for direction of patients to multidisciplinary resources to reduce the potential for long-term opioid use in arthroplasty patients. Furthermore, the International Association for the Study of Pain defines chronic pain as pain that lasts beyond the normal healing time, which in their latest revision was reported as more than 3 months [36]. However, like the ability of ML to determine patient-specific risk based on individual risk factors, definitions of persistent opioid use should be derived from evidence-based understandings of the natural course of postoperative pain after specific orthopedic surgeries. For example, in the Femoroacetabular Impingement RandomiSed controlled Trial (FIRST), Almasri et al [1] reported that the majority of patients undergoing primary hip arthroscopy for treatment of femoroacetabular impingement syndrome show stabilization in VAS pain scores 6 months after surgery. Variation in the timelines of pain resolution following specific procedures necessitates that definitions of prolonged opioid use, as an intervention for prolonged pain, be modified accordingly. Opioid use may be considered prolonged only when the need for it outlives the course of the illness it is intended to treat. As it becomes available, procedure-specific literature should be cited when defining prolonged opioid use. However, when defining prolonged use, 5 studies [2,19–22] referenced an investigation exploring persistent opioid use after major surgical procedures, such as cardiothoracic, gastric, and pelvic operations, rather than specific orthopedic procedures [6]. Heterogeneity in the definitions for refill frequency, time-to-refill, and dosing strategies limits the interpretation and external validity of current ML investigations. Authors should interpret results with caution when attempting to apply this data to their populations, as this is a potential source of confounding bias. Nonetheless, these studies identified at-risk patients and useful prognostic data can be extracted from this review. Future ML studies should utilize the available evidence on average recovery periods for the surgical population of interest, in conjunction with society guidelines defining prolonged opioid use when determining time point cutoffs for ML tools. Improving the diagnostic classification of prolonged postoperative opioid use is a step closer to managing the opioid crisis in a pragmatic way.
A main finding of this study is fair adherence to predictive modeling reporting guidelines and good discriminatory performance metrics of ML algorithms including (1) Brier scores, a mathematical function of describing how close predictions are to the actual outcome, (2) the c-statistic, which calculates the area under the receiver operating characteristic curve to assess discriminative ability, and (3) calibration, which refers to the agreement between observed outcomes and model predictions [41]. Discrimination analysis demonstrated c-statistics ranging from 0.70 to 0.81, indicating good model performance ranging from 0.70 to 0.81 in spine surgery, 0.75 to 0.77 in knee, and 0.71 to 0.77 in hip. Brier scores ranged from 0.04 to 0.08 in spine surgery, 0.01 to 0.12 in knee, and 0.05 to 0.13 in hip, indicating excellent performance of ML predictions (Table 1). Each metric has inherent limitations and thus utilizing multiple metrics provides a more accurate understanding of the prediction model’s performance [3,10]. In addition, all 10 studies used objective measures with weighted variables—Boruta algorithm, multivariate variable selection, recursive feature selection with random forest algorithms, or LASSO regression (Supplemental File 3)—for feature selection, which likely improved algorithm performance. In all, these features highlight the quality and precision of the prediction models, further supporting their use in identifying patients at risk of prolonged opioid use. Clinical practice metrics such as decision curve analysis may also assist the practitioner in deriving the clinical utility of a ML tool. Nine (90%) articles in this review reported decision curves to assist in translation to the clinical setting [42].
While model performance was adequate, current reporting methodology limits the utility of these models. Nine of 10 studies detailed their methodology for internal validation (Supplemental File 3, Training: Test Split), but no studies performed external validation. In the absence of external validation studies, adherence to predictive modeling guidelines such as the TRIPOD statement or JMIR guidelines acts as a surrogate for methodologic quality. Adherence to TRIPOD guidelines within this review was imperfect, despite 80% of studies reporting adherence. TRIPOD guidelines serve as a minimum standard for methodologic integrity; however, investigation-specific methodology must be optimized to enhance clinical utility of study findings. Furthermore, methodologic assessment by MINORS score indicated limitations in study design. Lack of adherence to ML reporting guidelines must be addressed in future studies to support clinical implementation of predictive algorithms. Opioid-specific ML literature should report according to standardized dosing guidelines, use procedure-specific literature as a benchmark for defining prolonged use, and work to improve methodological transparency (such as by providing source code). Rigorous external validation and translation to the clinical setting is required before ML can become a ubiquitous tool for personalized patient care.
In conclusion, ML algorithms created to predict orthopedic surgery patients at risk for prolonged postoperative opioid use demonstrate good discriminatory performance. The frequency and predictive features of prolonged postoperative opioid use identified in this review are consistent with existing literature. However, algorithms remain limited by the absence of external validation efforts and imperfect adherence to predictive modeling guidelines.
Supplemental Material
sj-docx-1-hss-10.1177_15563316231164138 – Supplemental material for Machine Learning Algorithms Can Be Reliably Leveraged to Identify Patients at High Risk of Prolonged Postoperative Opioid Use Following Orthopedic Surgery: A Systematic Review
Supplemental material, sj-docx-1-hss-10.1177_15563316231164138 for Machine Learning Algorithms Can Be Reliably Leveraged to Identify Patients at High Risk of Prolonged Postoperative Opioid Use Following Orthopedic Surgery: A Systematic Review by Laura M. Krivicich, Kyleen Jan, Kyle N. Kunze, Morgan Rice and Shane J. Nho in HSS Journal®: The Musculoskeletal Journal of Hospital for Special Surgery
Supplemental Material
sj-docx-2-hss-10.1177_15563316231164138 – Supplemental material for Machine Learning Algorithms Can Be Reliably Leveraged to Identify Patients at High Risk of Prolonged Postoperative Opioid Use Following Orthopedic Surgery: A Systematic Review
Supplemental material, sj-docx-2-hss-10.1177_15563316231164138 for Machine Learning Algorithms Can Be Reliably Leveraged to Identify Patients at High Risk of Prolonged Postoperative Opioid Use Following Orthopedic Surgery: A Systematic Review by Laura M. Krivicich, Kyleen Jan, Kyle N. Kunze, Morgan Rice and Shane J. Nho in HSS Journal®: The Musculoskeletal Journal of Hospital for Special Surgery
Supplemental Material
sj-docx-3-hss-10.1177_15563316231164138 – Supplemental material for Machine Learning Algorithms Can Be Reliably Leveraged to Identify Patients at High Risk of Prolonged Postoperative Opioid Use Following Orthopedic Surgery: A Systematic Review
Supplemental material, sj-docx-3-hss-10.1177_15563316231164138 for Machine Learning Algorithms Can Be Reliably Leveraged to Identify Patients at High Risk of Prolonged Postoperative Opioid Use Following Orthopedic Surgery: A Systematic Review by Laura M. Krivicich, Kyleen Jan, Kyle N. Kunze, Morgan Rice and Shane J. Nho in HSS Journal®: The Musculoskeletal Journal of Hospital for Special Surgery
Supplemental Material
sj-pdf-4-hss-10.1177_15563316231164138 – Supplemental material for Machine Learning Algorithms Can Be Reliably Leveraged to Identify Patients at High Risk of Prolonged Postoperative Opioid Use Following Orthopedic Surgery: A Systematic Review
Supplemental material, sj-pdf-4-hss-10.1177_15563316231164138 for Machine Learning Algorithms Can Be Reliably Leveraged to Identify Patients at High Risk of Prolonged Postoperative Opioid Use Following Orthopedic Surgery: A Systematic Review by Laura M. Krivicich, Kyleen Jan, Kyle N. Kunze, Morgan Rice and Shane J. Nho in HSS Journal®: The Musculoskeletal Journal of Hospital for Special Surgery
Footnotes
Correction (April 2023):
This article has been updated to correct the affiliations since its original publication.
Declaration of Conflicting Interests
The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Kyle N. Kunze, MD, reports a relationship with Arthroscopy. Shane J. Nho, MD, MS, reports relationships with Allosource, Arthrex, Inc, Athletico, DJ Orthopaedics, Linvatec, Miomed, Smith & Nephew, Ossur, Springer, Stryker, American Orthopaedic Association, American Orthopedic Society for Sports Medicine, Arthroscopy Association of North America. The other authors declare no potential conflicts of interest.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Human/Animal Rights
All procedures followed were in accordance with the ethical standards of the responsible committee on human experimentation (institutional and national) and with the Helsinki Declaration of 1975, as revised in 2013.
Informed Consent
Informed consent was not required for this review article.
Level of Evidence
Level III, systematic review of level III studies.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
