Sage Journals: Discover world-class research

Abstract

Machine learning (ML), a branch of artificial intelligence, is rapidly transforming surgical complication and outcome prediction. Unlike traditional statistical approaches, ML can learn complex, nonlinear relationships across multiple variables, enabling more accurate and adaptable prognostication. Emerging ML-based tools have demonstrated strong performance across diverse surgical specialties, often surpassing conventional risk models. However, challenges remain, including opaque “black box” outputs, diminished performance during external validation, difficulty modeling rare events, and dependence on tabular data. These limitations can be mitigated but demand thoughtful design and rigorous validation. Importantly, ML introduces distinct methodological considerations unfamiliar to many surgeons. Successful clinical integration requires robust external validation and transparent sharing of trained models to ensure reproducibility and generalizability across diverse cohorts. By enhancing the precision of risk prediction, ML holds the potential to guide patient selection, optimize perioperative care, and strengthen shared decision-making between patients and surgeons.

Keywords

artificial intelligence machine learning deep learning postoperative complications postoperative outcomes

Introduction

Artificial intelligence (AI) is a rapidly advancing field that builds computer programs that are able to perform tasks which would normally require human intelligence.¹ AI research incorporates aspects of neuroscience, biology, statistics, linear algebra, and computer science.^1,2 Over the past few years, AI applications in medicine have emerged and grown exponentially, integrating AI closely with the daily functioning of healthcare systems, teams, and personnel.^2,3

Machine learning (ML) is a sub-discipline of AI that uses algorithms that autonomously learn from datasets. A wide range of ML algorithms have been developed, many of which are adaptations of traditional statistical methods designed to leverage modern computing power.⁴ Recently, much of the focus in ML has been on applications of the neural network (NN) family of algorithms, which form the foundation of the deep learning (DL) field of ML.³ The concept of NNs, first described in 1958 as a simplified model of neuronal connections in the human brain, was initially limited by computational constraints.^5,6 However, the emergence of the “AlexNet” image classification architecture in 2012 demonstrated the transformative potential of DL when coupled with modern computing capabilities.^6,7

The optimal integration of AI into medical decision-making is a complex and evolving challenge. The rapid advancement of DL has led to the commercial success of large language models (LLMs) such as ChatGPT (OpenAI, Inc, San Francisco, CA) and Grok (xAI Corp., Palo Alto, CA). While these models perform well on structured tasks such as United States Medical Licensing Exam (USMLE) questions, translating this success to the heterogenous and nuanced presentations of real-world clinical scenarios has proven far more difficult.^1,8,9

ML vs Conventional Statistical Approaches

ML approaches to medical outcome prediction can be contrasted with conventional statistical approaches. In fact, many algorithms (such as linear and logistic regression) can be applied within either framework towards predicting outcomes.^9-11

Shortcomings of Conventional Statistics

Compared with conventional statistics, modern ML algorithms can model subtle multifactorial effects that might be overlooked by simpler models or human users.^9,11 Whereas a conventional statistician may opt for conservative selection of variables into a model, ML practitioners adopt a broader, relatively liberal approach during feature selection, thereby prioritizing predictive accuracy.^9,11 This can lead to greater accuracy at outcome prediction by ML, compared with traditional techniques.⁹ This contrast also exemplifies a few inherent shortcomings.

Conventional linear and logistic regression assume additive and either linear or log-linear relationships between predictors and outcomes.^12-14 Additive models assume that each predictor is independent of the presence, absence, or magnitude of effect on the other predictors in the model.¹⁴ Linearity assumes that a change in a predictor has a uniform effect regardless of baseline; for example, the effect of a decrease in hemoglobin concentration from 8 to 5 g/dL likely confers more mortality risk than a decrease from 15 to 12 g/dL, despite both reflecting a 3 g/dL decline.^13,14 While simpler ML models may also utilize additive linear modeling techniques, many widely used algorithms are nonlinear and nonadditive.^11-15

Example of a Conventional Statistical Approach

An illustrative example of the limitations of conventional statistics can be seen in a 2018 study by Merath et al that reported how multiple comorbid perioperative complications following hepatopancreatic surgery synergistically increase risk of 30-day mortality.¹⁶ The authors found that while the presence of a single postoperative complication increased 30-day mortality by only 0.1% to 0.6%, subsequent complications increased mortality exponentially (Figure 1). Capturing such nonlinear effects in logistic regression would require the use of polynomial terms, which quickly complicates modeling even for a single predictor.¹⁷ Additionally, the authors also analyzed the interactional effects between pairs of complications on mortality. Given 8 complications, a full exploration would require 28 separate regression models for bivariate interactions, with even greater complexity for higher-order combinations.

Figure 1.

Exponential Increase in Mortality Risk From Additional Postoperative Complications. This Nonlinear Effect can be Learned by Most Machine Learning Models, but can be Difficult to Model via Conventional Statistics. Created Based on Data From Merath et al (2018).¹⁶

While conventional regression can be tuned to account for nonlinear and interactional effects, this requires prior knowledge of the functional form (eg, quadratic or square-root terms) or labor-intensive trial-and-error experimentation.^17,18 Similarly, since the number of possible interactions increases following a power law as more predictors are considered, modeling these effects quickly becomes infeasible.^17,19,20

Benefits of ML for Surgical Outcome Prediction

ML approaches address the aforementioned challenges by natively capturing nonlinear and interactional effects during model training. Nonparametric methods, such as decision trees (DTs) and neural networks (NNs), are particularly adept at identifying such relationships.^12-14,21 Even in the case that a researcher presumes a relatively straightforward linear and additive relationship, models like NNs can also approximate regression functions; indeed, the universal approximation theorem in DL states that, given a sufficiently complex NN, any continuous mathematical function can be approximated to an arbitrary degree of accuracy.^22-25 This inherent flexibility is a key advantage of ML over conventional techniques.

A unique benefit of ML is integrated validation testing. Typically, the datasets are randomly subdivided into a training and testing set in a 70:30 or 80:20 ratio.²⁶ Models are trained exclusively on the training dataset, and performance is then assessed on the testing dataset—referred to here as “internal validation.”^26,27 Robust studies also evaluate external validation (generalizability), either temporally (new time period) or geographically (different sites or multicenter).^26,28

Comparison Example of an ML Approach

Building on their prior work, Merath et al (2019) developed a DT model that predicted 30-day morbidity after hepatic, pancreatic, and colorectal surgery.²⁷ This model achieved predictive performance comparable to the regression-based American College of Surgeons (ACS) National Surgical Quality Improvement Project (NSQIP) Surgical Risk Calculator (NSQIP-SRC). Due to statistical reporting differences, the accuracy of the 2019 DT-based study could not be directly compared to the authors’ 2018 regression-based study, but this example underscores how even a relatively simple ML model can match the predictive power of numerous regression models.^16,27

Limitations of ML

“Black Box” Predictions

A persistent limitation of ML is its agnosticism toward causality: while models identify associations, they do not establish whether predictors directly cause outcomes.^11,29 To generalize, where statisticians are concerned with P-values, odds ratios, and other measures of statistical significance and effect size, ML scientists are more concerned with predictive performance metrics such as calibration (eg, Brier score) and discrimination (eg, area under the receiver operating characteristic curve (AUROC)).^11,15 While ML models often outperform conventional statistics in predictive accuracy,¹¹ complex models can function as “black boxes,” obscuring relationships driving their predictions.^11,29 By contrast, conventional regression models are relatively transparent: linear and logistic regression can be expressed as straightforward equations, while DT models resemble intuitive decision trees, albeit constructed mathematically.^15,30

Overfitting

More advanced ML approaches may use more complex variants of regression algorithms, such as least absolute shrinkage and selection operator (LASSO) regression, ridge regression, or elastic net regression. These algorithms implement “regularization,” which aims to reduce “overfitting,” whereby a model learns to predict its training data too precisely that it no longer generalizes to new cases. Such overfitting manifests as a significant performance decline when predicting on external validation datasets.³¹ While these methods may enhance performance, they can make interpretation less intuitive.

Tabular Data

Certain ML algorithms, including advanced DL techniques, are known to underperform on tabular data, a common data format in clinical research.³² Thus, surgical ML research often uses other algorithms that perform particularly well on tabular data common in medical research include random forest (RF) and gradient boosting machines (GBM) variants, which combine multiple DT models in parallel or series configurations, respectively, to improve accuracy.^33,34 Two notable GBM implementations that have gained traction are the eXtreme Gradient Boosting (XGBoost) and Categorical Boosting (CatBoost) algorithms, as further discussed in our work on emerging AI models.^33,35

Despite the complexity, ML models are often more efficient and scalable for capturing nonlinearities and interactions. When conventional regression becomes so complicated that clinicians must treat it as a de facto “black box,” the advantages of ML algorithms including automated modeling, streamlined evaluation, and better handling of complex relationships, become particularly evident.

Conventional Perioperative Risk Assessment Tools

Accurate perioperative risk prediction is central to patient counseling, operative decision-making, and outcome prognostication. Formalized risk-assessment tools are well established; examples include the American Society of Anesthesiologists (ASA) physical status classification and the American College of Surgeons (ACS) National Surgical Quality Improvement Project Surgical Risk Calculator (NSQIP-SRC).^36,37

Subjective Risk-Assessment Tools

Subjective tools rely on clinician judgment and can vary between assessors (ie, they are nondeterministic). The ASA classification for instance, grades risk based on comorbidity severity as assessed by the anesthesiologist.³⁶ While clinically intuitive, the subjective reliance contributes to modest intra- and inter-rater reliability.^10,36,38

Scoring-Based Tools

Objective scoring systems are deterministic: identical inputs yield identical outputs. The Surgical Apgar Score (SAS) exemplifies this approach, using intraoperative variables translated into outcomes through univariate regression.³⁹ Although simple and reproducible, SAS has only moderate discriminative accuracy; a 2022 meta-analysis found a pooled AUROC (C-statistic) of only 0.63 for mortality.⁴⁰

Conventional Multivariable Regression Models

Multivariable regression models enhanced risk prediction by incorporating numerous preoperative variables. The original regression-based NSQIP-SRC achieved strong discrimination (AUROC 0.944 for mortality, 0.816 for morbidity) with subsequent studies reporting consistent values.^37,41-43 Similarly, the Physiological and Operative Severity Score for the enUmeration of Mortality and morbidity (POSSUM) is a regression-based model and its Portsmouth-POSSUM (P-POSSUM) variant achieved respectable mortality prediction (AUROC 0.89 for mortality, 0.67-0.77 for morbidity).^44,45 However, because these methods assume additive linear or log-linear relationships, they may miss subtle nonlinear or multifactorial effects, leading to underperformance relative to modern ML approaches, particularly in nonelective surgery.^9,13,41,43

ML-Based Perioperative Risk Assessment Tools—Recent Advances

The following sections describe the various ML-based perioperative risk assessment tools across multiple subspecialties of surgery, with a focus on recently developed models. These tools are summarized in Table 1.

Table 1.

ML Predictive Models for Surgical Outcomes

Author (Year)	Key findings	AUROC by ML algorithm^a	Model availability	Notes/Limitations
Multiple surgical specialties
Bilimoria et al (2013)³⁷	Strong mortality and morbidity discrimination	Conventional regression model ^b	Formerly publicly hosted	NSQIP-SRC model as of 2013
Bilimoria et al (2013)³⁷	Strong mortality and morbidity discrimination	Mortality: 0.944, Morbidity: 0.816	Formerly publicly hosted	NSQIP-SRC model as of 2013
Bertsimas et al¹⁴ (2018)	DT outperforms R-NSQIP-SRC for mortality and morbidity	DT (POTTER)	Publicly hosted (phone app)	POTTER model; designed for emergency surgeries
		Mortality: 0.916, Morbidity: 0.848
		R-NSQIP-SRC ^b
		Mortality 0.898, Morbidity: 0.806
Liu et al (2023)⁴¹	XGBoost performed equivalently to R-NSQIP-SRC	XGBoost (ML-NSQIP-SRC)	Publicly hosted	NSQIP-SRC model as of 2023
		Mortality: 0.949, Morbidity: 0.767
		R-NSQIP-SRC ^b
		Mortality: 0.944, Morbidity: 0.763
Cohen et al⁴² (2025)	XGBoost better calibrated for mortality; CatBoost better calibrated for all-cause morbidity	XGBoost (current ML-NSQIP-SRC)	Public hosting planned	Calibration measured by APE
		Mortality: 0.95, Morbidity: 0.77		CatBoost planned for NSQIP-SRC replacement in 2026
		CatBoost
		Mortality: 0.95, Morbidity: 0.77
Bonde et al (2021)²¹	DL performed similarly to POTTER on NSQIP surgical data	DL	Source code only	No external validation; successfully implemented CPT code feature embeddings
		Mortality: 0.912, Morbidity: 0.878
		DT (POTTER)
		Mortality: 0.920, Morbidity: 0.851
General surgery
Hassan et al (2022)⁴⁷	ML successfully predicts relevant outcomes following ventral hernia repair; ML shows superior discrimination to conventional regression techniques	Logit (ML-based)	Not publicly available	No external validation; used intraoperative predictors
		Hernia recurrence: 0.71, SSO: 0.68, 30-day readmission: 0.64
		XGBoost
		Hernia recurrence: 0.67, SSO: 0.67, 30-day readmission: 0.62
		Conventional logit ^b
		Hernia recurrence: 0.65, SSO: 0.68, 30-day readmission: 0.61
		MARS
		Hernia recurrence: 0.62, SSO: 0.65, 30-day readmission: 0.74
		NN
		Hernia recurrence: 0.60, SSO: 0.69, 30-day readmission: 0.57
		SVM
		Hernia recurrence: 0.57, SSO: 0.75, 30-day readmission: 0.62
		RF
		Hernia recurrence: 0.54, SSO: 0.72, 30-day readmission: 0.66
		DT
		Hernia recurrence: 0.54, SSO: 0.68, 30-day readmission: 0.73
		KNN
		Hernia recurrence: 0.47, SSO: 0.65, 30-day readmission: 0.59
		Voting ensemble
		Hernia recurrence: 0.54, SSO: 0.68, 30-day readmission: 0.60
Wu et al⁴⁸ (2024)	RF was the best predictor of SSI and SSO following elective open inguinal hernia repair	RF: SSI: 0.849, SSO: 0.740	Publicly hosted	No external validation; used intraoperative predictors
		GBM: SSI: 0.784, SSO: 0.738
		NN: SSI: 0.754, SSO: 0.665
		Logit: SSI: 0.737, SSO: 0.672
		SVM: SSI: 0.648, SSO: 0.708
Choi et al (2023)⁵¹	Preoperative risk factors were used to predict postoperative complications in >65-year-old general surgery patients. LASSO suffered the least performance degradation between internal and external validation	LASSO	Source code only	Including surgery type as a predictor in the LASSO model only improved predictions of postoperative and total LOS (by < 5%)
		90-day mortality/readmission: 0.703; Postoperative delirium: 0.750
		Prolonged postoperative LOS: 0.747; Prolonged total LOS: 0.707
		RF
		90-day mortality/readmission: 0.698; Postoperative delirium: 0.677
		Prolonged postoperative LOS: 0.708; Prolonged total LOS: 0.707
		AdaBoost
		90-day mortality/readmission: 0.695; Postoperative delirium: 0.703
		Prolonged postoperative LOS: 0.743; Prolonged total LOS: 0.708
		GBM
		90-day mortality/readmission: 0.652; Postoperative delirium: 0.699
		Prolonged postoperative LOS: 0.721; Prolonged total LOS: 0.712
		DT
		90-day mortality/readmission: 0.593; Postoperative delirium: 0.644
		Prolonged postoperative LOS: 0.623; Prolonged total LOS: 0.628
Trauma surgery
Du et al (2025)⁵³	RF model effectively predicts TIC in trauma patients, with moderate performance loss on external validation	RF: Internal validation: 0.820; External validation: 0.728	Not publicly available	Only RF model tested on external validation set.
		GBM: 0.808; NN: 0.801; XGBoost: 0.799
		SVM: 0.799; Naïve Bayes: 0.799; Logit: 0.798
Xiong et al (2025)⁵⁵	RF model predicts TIC occurrence in trauma patients; better performance with less degradation on external set than observed by Du et al	RF: 0.91; GBM: 0.89; NN: 0.89; XGBoost: 0.89	Source code only	All models exposed to RF.
		Naïve Bayes: 0.88; KNN: 0.88; Logit: 0.88; SVM: 0.87
		AdaBoost: 0.87; DT: 0.71
Cardiothoracic surgery
Weiss et al⁶² (2023)	XGBoost outperforms STS risk scores to predict postoperative mortality following cardiac surgery, and effectively predicted mortality in non-STS procedures	XGBoost: 0.978 (all procedures)	Not publicly available	No external validation; other algorithms evaluated, but no AUROC reported
		CABG: 1.000; AVR: 1.000; MVRepair: 0.967
		CABG + AVR: 0.939; CABG + MVRepair: 1.000
		STS ^b
		CABG: 0.580; AVR: 0.888; MVRepair: 0.914
		CABG + AVR: 0.894; CABG + MVRepair: 0.755
Vascular surgery
Li et al (2024)⁶³	Suprainguinal bypass: XGBoost was best-performing model to predict 1-year mortality or MALE. Adding postoperative but not intraoperative predictors improved XGBoost performance	XGBoost: 1-year death or MALE: 0.92	Source code only	No external validation; AUROC using only preoperative predictors is listed
		RF: 1-year death or MALE: 0.89
		SVM: 1-year death or MALE: 0.83
		Naïve Bayes: 1-year death or MALE: 0.81
		NN: 1-year death or MALE: 0.71
		Logit: 1-year death or MALE: 0.67
Li et al (2024)⁶⁴	In infrainguinal bypass, XGBoost was the best-performing model to predict 30-day mortality or MALE.	XGBoost: 30-day death or MALE: 0.93	Source code only	No external validation; only preoperative predictors studied
		RF: 30-day death or MALE: 0.82
		Naïve Bayes: 30-day death or MALE: 0.87
		SVM: 30-day death or MALE: 0.85
		NN: 30-day death or MALE: 0.80
		Logit: 30-day death or MALE: 0.63
Surgical oncology
Merath et al (2019)²⁷	DT predicted 30-day all-cause complications as well as R-NSQIP-SRC, and outperformed ASA.	DT: 30-day, any complication: 0.74; Individual complications: 0.76 - 0.98	Not publicly available	No external validation
		R-NSQIP-SRC^b: 30-day, any complication: 0.71
		ASA^b: 30-day, any complication: 0.58
Orthopedic surgery
Chong et al (2025)⁶⁵	BRF sufficiently addressed class imbalance, and best predicted PJI occurrence after total knee arthroplasty	BRF: 0.963; GBM: 0.931; Logit: 0.728	Not publicly available	No external validation
Chong et al (2025)⁶⁵		Naïve Bayes: 0.719; SVM: 0.701	Not publicly available	No external validation
Neurosurgery
Yin et al⁶⁸ (2024)	DL models can effectively predict discharge neurologic recovery (GOS) following moderate or severe traumatic brain injury	SSL NN: Clinical data only: 0.890; With lab data: 0.766	Not publicly available
		Transductive SVM: Clinical data only: 0.851; With lab data: 0.816
		GBM: Clinical data only: 0.814; With lab data: 0.831
		IGTD + CNN: Clinical data only: 0.810; With lab data: 0.714
		FT-transformer: Clinical data only: 0.801; With lab data: 0.861
		Logit: Clinical data only: 0.567; With lab data: 0.639
Xu et al (2025)⁶⁹	PPC prediction in neurosurgical patients remained robust even after reducing the number of features	NN: 35 predictors: 0.835; 11 predictors: 0.840	Not publicly available	3 sites in external validation dataset
		Logit: 35 predictors: 0.829; 11 predictors: 0.831
		XGBoost: 35 predictors: 0.821; 11 predictors: 0.826
		RF: 35 predictors: 0.821; 11 predictors: 0.826
		SVM: 35 predictors: 0.813; 11 predictors: 0.825
		Naïve Bayes: 35 predictors: 0.786; 11 predictors: 0.807
Plastic and reconstructive surgery
Braun et al (2023)⁷²	RF predicted nipple-areolar complex necrosis in nipple-sparing mastectomy	RF: 0.95	Not publicly available	Temporal external validation
Meyer et al⁷³ (2025)	External validation of Braun et al model; performance loss seen, but still usable discrimination	RF: 0.70	Not publicly available	Geographic external validation

Abbreviations: AUROC, area under the receiver operating characteristic curve; ML, machine learning; NSQIP-SRC, National Surgical Quality Improvement Project Surgical Risk Calculator; DT, decision tree; R-NSQIP-SRC, regression-based National Surgical Quality Improvement Project Surgical Risk Calculator; POTTER, Predictive OpTimal Trees in Emergency Surgery Risk; XGBoost, eXtreme Gradient Boosting; ML-NSQIP-SRC, machine learning-based National Surgical Quality Improvement Project Surgical Risk Calculator; CatBoost, Categorical Boosting; APE, absolute percentage error; DL, deep learning; CPT, Current Procedural Terminology; logit, logistic regression; SSO, surgical site occurrence; MARS, multivariate adaptive regression splines; NN, neural network; SVM, support vector machine; RF, random forest; KNN, k-nearest neighbors; SSI, surgical site infection; GBM, gradient boosting machine; LASSO, least absolute shrinkage and selection operator; LOS, length of stay; AdaBoost, Adaptive Boosting; STS, Society of Thoracic Surgeons risk score; CABG, coronary artery bypass graft; AVR, aortic valve replacement; MVRepair, mitral valve repair; MALE, major adverse limb events; ASA, American Society of Anesthesiologists Physical Status Classification System; BRF, balanced random forest; PJI, periprosthetic joint infection; GOS, Glasgow Outcome Scale; SSL, self-supervised learning; IGTD + CNN, Image Generator for Tabular Data + Convolutional Neural Network; FT-Transformer, Feature Tokenizer Transformer; PPC, postoperative pulmonary complication.

^aExternal validation AUROC reported when available, unless otherwise specified;

^bConventional statistical or other non-machine-learning prediction tool.

Pooled Surgical Cohort ML Models

Several ML models have emerged to address the above noted limitations. The Predictive OpTimal Trees in Emergency Surgery Risk (POTTER; Interpretable AI, Cambridge, MA) model, introduced in 2018, uses a decision tree requiring 4-11 binary responses to estimate morbidity and mortality. Trained on the NSQIP data, POTTER outperformed regression-based NSQIP-SRC (mortality AUROC 0.916 vs 0.898; morbidity 0.841 vs 0.806) and has since shown consistent performance (AUC >0.80 across outcomes).¹⁴

In 2023, Liu et al created an XGBoost model that matched regression-based NSQIP-SRC performance (mortality AUROC 0.949 vs 0.944; morbidity 0.767 vs 0.763).⁴¹ XGBoost was subsequently adopted as the official NSQIP-SRC model in June 2023.^41,42 The same authors in 2025 assessed a newer CatBoost algorithm and noted better calibration of XGBoost for mortality and of CatBoost for all-cause morbidity.⁴²

Comparisons suggest that ML-based NSQIP-SRC generally surpasses POTTER in mortality discrimination, while POTTER performs slightly better for morbidity.^14,41,42,46 POTTER’s strengths include simplicity (fewer inputs) and interpretability which may aid patient communication, as well as an exact transparent decision tree used for outcome prediction^14,37 Importantly, unlike NSQIP-SRC, POTTER was designed specifically for emergency surgeries.^14,46

Neural Network Models

While neural network (NN) models have generally underperformed DT-based models on tabular data, a DL model by Bonde et al performed similarly to POTTER for mortality (AUROC 0.912 vs 0.920, respectively), and demonstrated a modest improvement for morbidity prediction (AUROC 0.878 vs 0.851).^21,32 This DL model also exceeded the performance of the original regression-based NSQIP-SRC.^21,37 However, this model has not been made available publicly and has not been directly compared to ML-based NSQIP-SRC. The future implications of DL-based prediction models and their comparison to DT-based models are discussed further in our work on emerging AI models.

General Surgery

Hassan et al (2022) evaluated 9 ML algorithms for predicting postoperative complications after ventral hernia repair in a single-center cohort.⁴⁷ The best-performing models achieved AUROC values of 0.71 for hernia recurrence, 0.75 for surgical site occurrence (SSO), and 0.74 for 30-day readmission, all of which consistently outperformed regression models.⁴⁷ Notably, their models incorporated surgical technique variables (including rectus muscle violation, wound class, bridged repair, and component separation), whereas many prior approaches considered only preoperative factors. These models, however, were not publicly available.⁴⁷

Wu et al⁴⁸ (2024) similarly tested 5 ML models for predicting both surgical site infection (SSI) and SSO following elective open inguinal hernia repair. Their RF model performed best for both SSI and SSO, achieving AUROCs of 0.849 and 0.740, respectively. These models also considered procedural factors, notably operative time and use of antibiotic prophylaxis. Both models are publicly accessible online, though no external validation has been reported. Both the SSO and SSI prediction models are hosted online, though no external validation studies of these models were identified.^49,50

Choi et al (2023) tested 5 ML models in a ≥65-year-old general surgery population, with outcomes including a composite endpoint (90-day all-cause mortality or emergency department visit), prolonged postoperative stay, postoperative delirium, and prolonged total length of stay.⁵¹ They included 21,766 patients in their model training datasets and internally validated on 5431 patients in their testing cohorts. The model was subsequently externally validated on a cohort of 32,857 patients from another hospital. While RF and gradient boosting (GBM) performed strongly in training and internal testing, they suffered a 4%-11% AUROC decline with external validation, suggesting overfitting (Figure 2).^31,52 In contrast, LASSO regression was more stable and emerged as a top performer on external validation. This underscores the necessity of external validation before applying ML-based surgical risk models in clinical practice.

Figure 2.

Model Discrimination Degradation on Training vs Testing vs External Validation Datasets; Created Based on Data From Choi et al (2023).⁵¹

Trauma Surgery

Du et al (2025) developed ML models to predict trauma-induced coagulopathy (TIC) in 2067 operative trauma patients across multiple centers.⁵³ Using perioperative demographics, injury features, labs, and intraoperative variables, random forest (RF) achieved the best performance (AUROC 0.820). External validation in 863 patients showed reduced accuracy (AUROC 0.73), likely reflecting inter-site heterogeneity and severe class imbalance (TIC incidence 25.4% vs 2.9%). This highlights a major challenge in medical ML: outcomes of interest often occur infrequently, impairing generalizability.⁵⁴

Xiong et al (2025) studied 10,023 trauma patients from the MIMIC-IV dataset for TIC prediction and validated their model on 3212 patients across 3 centers.^55,56 RF again performed best, with stable discrimination across internal and external cohorts (AUROC 0.92 vs 0.91).^53,55 This large-scale, multicenter design and code availability strengthen reproducibility, though models were not clinically deployed.^55,57

TIC is associated with high mortality, massive transfusion, and multi-organ failure, making early recognition critical.^53,55,58 While the above models remain research tools, they demonstrate how ML could enable earlier TIC identification and intervention.^53,55,58

Cardiothoracic Surgery

Traditional risk models in cardiothoracic surgery, most notably the EuroSCORE II and Society of Thoracic Surgeons (STS) scores, estimate operative mortality using limited regression-based variables, with variable accuracy across populations.^59-61

Weiss et al (2023) trained an XGBoost model using data from 6392 cardiac surgery patients, leveraging 4016 preoperative variables from their institutional electronic medical record (EMR).⁶² Across 7 cardiac surgery types, the model achieved an AUROC of 0.978, outperforming STS risk scores. Although neither externally validated nor publicly shared, the study illustrates the potential of granular, institution-specific ML models that can utilize local EMR data beyond standardized registries.

Vascular Surgery

Li et al published 2 studies employing ML models for predicting postoperative outcomes following open vascular bypass procedures.^63,64

In suprainguinal bypass (16,832 patients, Vascular Quality Initiative data), XGBoost predicted 1-year death or major adverse limb event (MALE) with AUROC 0.92 using preoperative data, improving to 0.98 when including postoperative data. Secondary outcomes such as revision, graft loss, and mortality were similarly predicted with AUROC >0.85 preoperatively and >0.95 postoperatively.⁶³

In infrainguinal bypass (24,309 NSQIP patients), XGBoost achieved AUROC >0.90 for most 30-day outcomes, including death, MALE, major cardiovascular events, MI, stroke, reintervention, amputation, and bleeding. In both studies, source code but not trained models were released.⁶⁴

While performance was strikingly high, the lack of external validation raises concerns of overfitting. These studies underscore the importance of external validation and open access to trained models to ensure reproducibility.^31,52,61

Surgical Oncology

As discussed in the preceding section, Merath et al (2019) developed a DT model for postoperative complication rates in hepatic, pancreatic, and colorectal surgery.²⁷ Their model achieved AUROC 0.74 for 30-day morbidity, outperforming ASA classification (0.58) and regression-based NSQIP-SRC (0.71). AUROCs for individual complications ranged 0.76-0.98, demonstrating improved granularity compared with conventional tools.

Orthopedic Surgery

Most orthopedic ML models that were described in literature have remained limited to internal validation, with few undergoing external testing.

Chong et al (2025) predicted periprosthetic joint infection (PJI) after total knee arthroplasty using 3483 patients (81 infections).⁶⁵ A balanced random forest (BRF) addressed class imbalance and achieved AUROC 0.963, outperforming prior regression and NN approaches.^66,67 Chong et al also used Shapely Additive Explanations (SHAP) plots to visualize and explain individual predictors’ contributions to model predictions (Figure 3). Operative time, male sex, and ASA >2 were the strongest positive predictors of PJI, while spinal anesthesia was protective. No external validation was performed, nor was the model reproduced online.

Figure 3.

Shapely Additive Explanations (SHAP) Summary bar Plot Showing Relative Predictor Importance. Importance is Ranked by SHAP Value, which Measures the Average Effect a Predictor had on Determining the Model’s Overall Predictions. Reproduced With Permission From Springer Nature, Chong et al (2025), Figure 5.⁶⁵

While the Chong et al model performed well, methodological issues warrant caution.⁶⁵ Predictors were selected using regression on the full dataset, risking data leakage, where information from test patients influences feature selection.²⁸ By performing feature selection on the entire dataset, test set patients’ data can potentially inform which variables are selected for model inclusion. Per the taxonomy of data leakage defined by Kapoor et al (2023), “Feature selection on the entire dataset results in using information about which feature performs well on the test set to make a decision about which features should be included in the model.”²⁸

While we would not necessarily expect Chong et al’s regression results to differ drastically if applied to a 70%-80% training subsample of their cohort, it is worth noting that 5 of the 6 variables included into their models (Figure 3) had final regression P-values between 0.01 and 0.05.⁶⁵ The authors did not clearly disclose whether they utilized a completely separate test set (per explanation in Table 2 of original Chong et al), making it plausible that their reported AUROC does not come from internal evaluation on a completely unseen patient sample at all.⁶⁵ Such practices reduce reproducibility and highlight the need for rigorous validation and transparent ML methodology.²⁸

Neurosurgery

Yin et al (2024) tested 5 ML algorithms to predict discharge Glasgow Outcome Scale (GOS) after surgery for 416 moderate-to-severe TBI patients.⁶⁸ Temporal external validation 6 months later showed strong performance, with AUROCs of 0.861 (without labs) and 0.890 (with blood chemistry/coagulation values). SHAP visualization confirmed predictors such as high GCS consistently drove favorable outcome predictions (Figure 4). SHAP visualization to aid model comprehensibility is further explored in our work on emerging AI models.

Figure 4.

Shapely Additive Explanations (SHAP) Dotplot Showing Each Patient’s Predictor Values, With Red and Blue Dots Indicating Positive and Negative Predictor Values, Respectively (eg, Strong Red for Glasgow Coma Scale (GCS) Near 15). X-Axis Position Shows Positive or Negative Impact on GOS (Glasgow Outcome Scale) Prediction, Akin to a Forest Plot. Clustered Red/Blue Dots on Either Side of the Zero-Effect Line Indicate Consistent Predictor-Outcome Relationships. Strong Red Clustering (High GCS) on the Right Suggests High GCS Predicts Better GOS. Reproduced With Permission From Springer Nature, Yin et al (2024), Figure 4.⁶⁸

Xu et al (2025) developed models for predicting postoperative pulmonary complications (PPCs) in neurosurgical patients.⁶⁹ A DL neural network with 35 predictors performed best on external validation (AUROC 0.835), but a simplified LASSO-logistic regression model with 11 predictors achieved nearly identical accuracy (AUROC 0.831). Both models outperformed standard risk scores, including Assess respiratory RIsk in Surgical patients in CATalonia (ARISCAT; AUROC 0.672) and Laparoscopic Surgery Video Educational Guidelines (LAS VEGAS; AUROC 0.663). A nomogram derived from the regression model is hosted online and demonstrates how ML can translate into practical bedside tools.⁷⁰

The benefit of decompressive craniectomy for TBI remains debated, with prior work indicating lower mortality but greater rates of severe disability.⁷¹ The findings by Yin et al (2025) may serve to guide the development of accurate predictive models, which can aid prognostication and targeted selection of the optimal operative candidates.⁶⁸ Similarly, the Xu et al work on predicting postoperative pulmonary complications in a broad neurosurgery cohort may facilitate early identification and management of at-risk patients.⁶⁹ Both studies provide examples of different yet effective approaches to understanding how ML models make predictions, which may aid understanding by both clinicians and patients.^68,69

Plastic and Reconstructive Surgery

High-quality ML studies in plastic surgery are limited, though 2 recent investigations stand out.^72,73

Braun et al (2023) developed an RF model to predict nipple-areolar complex (NAC) necrosis after nipple-sparing mastectomy in 181 patients.⁷² Internal validation yielded AUROC 0.99, with temporal external validation on 62 patients showing AUROC 0.95. Meyer et al (2025) externally validated the same model in 388 patients from a different institution, and achieved an AUROC of 0.70 for predicting NAC necrosis.⁷³

These studies drive 2 important lessons on adopting ML in prediction of surgical complications and risk. First, exceptionally high AUROCs in small, internally validated datasets often reflect overfitting, with inevitable performance loss in new populations.⁵² Secondly, this pair of studies demonstrate good methodology: the same model was subjected to temporal and multicenter external validation, showing that even imperfect but rigorously tested models can retain clinical value.^72-74 Public release would further enable independent validation and refinement.

Call for Open Access to ML Models

A consistent theme across specialties is the need for open access to trained ML models. Source code alone is insufficient: finished models encode hundreds or thousands of tuned parameters derived from the training dataset. For the field to progress, models that demonstrate strong initial performance must be shared to allow replication, external validation, iterative improvement, and eventual clinical deployment. ML can only improve surgical practice if tested models move beyond proof-of-concept studies and into real-world clinical evaluation with external validation.

Conclusions

Machine learning has emerged as a transformative tool for predicting surgical complications, offering capabilities that extend beyond conventional statistical approaches. By capturing complex, nonlinear, and interactional effects inherent in surgical care, ML provides a more nuanced and reproducible framework for prognostication. As these models continue to mature, they hold the potential to enhance patient selection, refine risk-benefit discussions, and support truly informed shared decision-making. While ML carries both advantages and limitations, its thoughtful integration into practice can significantly elevate surgical precision, safety, and outcomes in the years to come.

Footnotes

ORCID iDs

David Limon

Niruktha Raghavan

Miranda X. Morris

Aashish Rajesh

Author Contributions

DL, VS – conceptualization, draft of preliminary manuscript, revision, approval of final version. MXM, NR, MM – conceptualization, critical review of manuscript with revision for incorporating intellectual content, approval of final version. AR – senior author, conceptualization, critical review of manuscript with revision for incorporating intellectual content, approval of final version

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Alowais

Alghamdi

Alsuhebany

, et al. Revolutionizing healthcare: the role of artificial intelligence in clinical practice. BMC Med Educ. 2023;23(1):689. doi:10.1186/s12909-023-04698-z

Russell

Norvig

. Artificial Intelligence: A Modern Approach. 3rd ed. Pearson; 2016.

Morris

Rajesh

Asaad

Hassan

Saadoun

Butler

. Deep learning applications in surgery: current uses and future directions. Am Surg. 2023;89(1):36-42. doi:10.1177/00031348221101490

Pedregosa

Varoquaux

Gramfort

, et al. Scikit-learn: machine learning in python. arXiv. Preprint posted online June 5, 2018. doi:10.48550/arXiv.1201.0490

Schmidhuber

. Annotated history of modern AI and deep learning. arXiv. Preprint posted online. 2022.

Alom

Taha

Yakopcic

, et al. The history began from AlexNet: a comprehensive survey on deep learning approaches. arXiv. Preprint posted online. 2018.

Krizhevsky

Sutskever

Hinton

. ImageNet classification with deep convolutional neural networks. Commun ACM. 2017;60(6):84-90. doi:10.1145/3065386

Brin

Sorin

Vaid

, et al. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Sci Rep. 2023;13(1):16492. doi:10.1038/s41598-023-43436-9

Churpek

Yuen

Winslow

Meltzer

Kattan

Edelson

. Multicenter comparison of machine learning methods and conventional regression for predicting clinical deterioration on the wards. Crit Care Med. 2016;44(2):368-374. doi:10.1097/CCM.0000000000001571

10.

Hassan

Rajesh

Asaad

, et al. Artificial intelligence and machine learning in prediction of surgical complications: Current state, applications, and implications. Am Surg. 2023;89(1):25-30. doi:10.1177/00031348221101488

11.

Rajula

HSR

Verlato

Manchia

Antonucci

Fanos

. Comparison of conventional statistical methods with machine learning in medicine: diagnosis, drug development, and treatment. Medicina. 2020;56(9):455. doi:10.3390/medicina56090455

12.

Rajesh

Chartier

Asaad

Butler

. A synopsis of artificial intelligence and its applications in surgery. Am Surg. 2023;89(1):20-24. doi:10.1177/00031348221109450

13.

Loftus

Tighe

Filiberto

, et al. Artificial intelligence and surgical decision-making. JAMA Surg. 2020;155(2):148-158. doi:10.1001/jamasurg.2019.4917

14.

Bertsimas

Dunn

Velmahos

Kaafarani

HMA

. Surgical risk is not linear: derivation and validation of a novel, user-friendly, and machine-learning-based predictive OpTimal trees in emergency surgery risk (POTTER) calculator. Ann Surg. 2018;268(4):574-583. doi:10.1097/sla.0000000000002956

15.

Hassan

Rajesh

Asaad

, et al. A surgeon’s guide to artificial intelligence-driven predictive models. Am Surg. 2023;89(1):11-19. doi:10.1177/00031348221103648

16.

Merath

Chen

Bagante

, et al. Synergistic effects of perioperative complications on 30-Day mortality following hepatopancreatic surgery. J Gastrointest Surg. 2018;22(10):1715-1723. doi:10.1007/s11605-018-3829-3

17.

Chambers

Hastie

. Chapter 2. Statistical models. In: Chambers

Hastie

, eds. Statistical Models in S. Reprint. Chapman & Hall Computer Science Series. Chapman & Hall; 1997:13-31. https://mathematics.foi.hr/Rprojekti/knjige/statistical-models-in-s.pdf

18.

O’Brien

Silcox

. Nonlinear regression modelling: a primer with applications and caveats. Bull Math Biol. 2024;86(4):40. doi:10.1007/s11538-024-01274-4

19.

Shiroshita

Yamamoto

Saka

, et al. Expanding the scope: in-Depth review of interaction in regression models. ACE. 2024;6(2):25-32. doi:10.37737/ace.24005

20.

Rimpler

Kiers

HAL

Van Ravenzwaaij

. To interact or not to interact: the pros and cons of including interactions in linear regression models. Behav Res. 2025;57(3):92. doi:10.3758/s13428-025-02613-6

21.

Bonde

Varadarajan

Bonde

, et al. Assessing the utility of deep neural networks in predicting postoperative surgical complications: a retrospective study. Lancet Digit Health. 2021;3(8):e471-e485. doi:10.1016/S2589-7500(21)00084-4

22.

Cybenko

. Approximation by superpositions of a sigmoidal function. Math Control, Signals, Syst. 1989;2(4):303-314. doi:10.1007/BF02551274

23.

Chapter

. A visual proof that neural nets can compute any function. In: Neural Networks and Deep Learning. Determination Press; 2015. https://neuralnetworksanddeeplearning.com/index.html. Accessed 9 August 2025.

24.

Zhou

. Universality of deep convolutional neural networks. Appl Comput Harmon Anal. 2020;48(2):787-794. doi:10.1016/j.acha.2019.06.004

25.

Heinecke

Hwang

. Refinement and universal approximation via sparsely connected ReLU convolution nets. IEEE Signal Process Lett. 2020;27:1175-1179. doi:10.1109/LSP.2020.3005051

26.

Sivakumar

Parthasarathy

Padmapriya

. Trade-off between training and testing ratio in machine learning for medical image processing. PeerJ Comput Sci. 2024;10:e2245. doi:10.7717/peerj-cs.2245

27.

Merath

Hyer

Mehta

, et al. Use of machine learning for prediction of patient risk of postoperative complications after liver, pancreatic, and colorectal surgery. J Gastrointest Surg. 2019;24(8):1843-1851. doi:10.1007/s11605-019-04338-2

28.

Kapoor

Narayanan

. Leakage and the reproducibility crisis in machine-learning-based science. Patterns. 2023;4(9):100804. doi:10.1016/j.patter.2023.100804

29.

Azzolina

Baldi (University Of Padova) I Barbati

, et al.

Machine learning in clinical and epidemiological research: isn’t it time for biostatisticians to work on it?

Ebph. 2022;16:4. doi:10.2427/13245

30.

Song

. Decision tree methods: applications for classification and prediction. Shanghai Arch Psychiatry. 2015;27(2):130-135. doi:10.11919/j.issn.1002-0829.215044

31.

Montesinos López

Crossa

. Chapter 4. Overfitting, model tuning, and evaluation of prediction performance. In: Multivariate Statistical Machine Learning Methods for Genomic Prediction. Springer International Publishing; 2022:109-139. doi:10.1007/978-3-030-89010-0_4

32.

Yıldız

Kalayci

. Gradient boosting decision trees on medical diagnosis over tabular data. arXiv. Preprint posted online July 6, 2025. doi:10.48550/arXiv.2410.03705

33.

Chen

Guestrin

. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016. doi:10.1145/2939672.2939785; 2016:785-794.

34.

Breiman

. Random forests. Mach Learn. 2001;45(1):5-32. doi:10.1023/a:1010933404324

35.

Dorogush

Ershov

Gulin

. CatBoost: gradient boosting with categorical features support. arXiv. Preprint posted online October 24, 2018. doi:10.48550/arXiv.1810.11363

36.

Hendrix

Garmon

. American society of anesthesiologists physical status classification system. In: Statpearls. StatPearls Publishing; 2025. https://www.ncbi.nlm.nih.gov/books/NBK441940/. Accessed July 7, 2025.

37.

Bilimoria

Liu

Paruch

, et al. Development and evaluation of the universal ACS NSQIP surgical risk calculator: a decision aid and informed consent tool for patients and surgeons. J Am Coll Surg. 2013;217(5):833-842e3. doi:10.1016/j.jamcollsurg.2013.07.385

38.

Tollinche

Yang

Tan

Borchardt

. Interrater variability in ASA physical status assignment: an analysis in the pediatric cancer setting. J Anesth. 2018;32(2):211-218. doi:10.1007/s00540-018-2463-2

39.

Gawande

Kwaan

Regenbogen

Lipsitz

Zinner

. An apgar score for surgery. J Am Coll Surg. 2007;204(2):201-208. doi:10.1016/j.jamcollsurg.2006.11.011

40.

Pittman

Dixon

Duttchen

. The surgical apgar score: a systematic review of its discriminatory performance. Ann Surg Open. 2022;3(4):e227. doi:10.1097/AS9.0000000000000227

41.

Liu

Hall

Cohen

. ACS NSQIP risk calculator accuracy using a machine learning algorithm compared to regression. J Am Coll Surg. 2023;236:1024-1030. doi:10.1097/xcs.0000000000000556

42.

Cohen

Liu

Hall

. ACS NSQIP risk calculator performance across multiple domains of operative risk and risk-associated features. Ann Surg. 2025. doi:10.1097/sla.0000000000006753

43.

Lubitz

Chan

Zarif

, et al. American college of surgeons NSQIP risk calculator accuracy for emergent and elective colorectal operations. J Am Coll Surg. 2017;225(5):601-611. doi:10.1016/j.jamcollsurg.2017.07.1069

44.

Copeland

Jones

Walters

. POSSUM: a scoring system for surgical audit. Br J Surg. 1991;78(3):355-360. doi:10.1002/bjs.1800780327

45.

Prytherch

Whiteley

Higgins

Weaver

Prout

Powell

. POSSUM and Portsmouth POSSUM for predicting mortality. Br J Surg. 1998;85(9):1217-1220. doi:10.1046/j.1365-2168.1998.00840.x

46.

HMW

Maurer

Levine

, et al. Validation of the artificial intelligence-based predictive optimal trees in emergency surgery risk (POTTER) calculator in emergency general surgery and emergency laparotomy patients. J Am Coll Surg. 2021;232(6):912-919e1. doi:10.1016/j.jamcollsurg.2021.02.009

47.

Hassan

Asaad

, et al. Novel machine learning approach for the prediction of hernia recurrence, surgical complication, and 30-Day readmission after abdominal wall reconstruction. J Am Coll Surg. 2022;234(5):918-927. doi:10.1097/XCS.0000000000000141

48.

Shi

Song

Peng

Yang

. Application of machine learning algorithms to predict postoperative surgical site infections and surgical site occurrences following inguinal hernia surgery. Hernia. 2024;28(6):2343-2354. doi:10.1007/s10029-024-03167-w

49.

Shi

Song

Peng

Yang

. An online calculator for predcting SSO after groin hernia surgery. https://wuqian17.shinyapps.io/predictionSSO/. Accessed 15 August 2025. Published online September 17, 2024.

50.

Shi

Song

Peng

Yang

. An online calculator for predcting SSI after groin hernia surgery. https://wuqian17.shinyapps.io/predictionSSI/. Accessed 15 August 2025. Published online September 17, 2024.

51.

Choi

Yoo

Song

, et al. Development and validation of a prognostic classification model predicting postoperative adverse outcomes in older surgical patients using a machine learning algorithm: retrospective observational network study. J Med Internet Res. 2023;25:e42259. doi:10.2196/42259

52.

van Leeuwen

Steyerberg

van Klaveren

Wessler

Kent

van Zwet

. Instability of the AUROC of clinical prediction models. Stat Med. 2025;44(5):e70011. doi:10.1002/sim.70011

53.

Wang

, et al. Predicting postoperative trauma-induced coagulopathy in patients with severe injuries by machine learning. Sci Rep. 2025;15(1):27072. doi:10.1038/s41598-025-13283-x

54.

Salmi

Atif

Oliva

Abraham

Ventura

. Handling imbalanced medical datasets: review of a decade of research. Artif Intell Rev. 2024;57(10):273. doi:10.1007/s10462-024-10884-2

55.

Xiong

, et al. Ten machine learning models for predicting preoperative and postoperative coagulopathy in patients with trauma: multicenter cohort study. J Med Internet Res. 2025;27:e66612. doi:10.2196/66612

56.

Johnson

AEW

Bulgarelli

Shen

, et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data. 2023;10(1):1. doi:10.1038/s41597-022-01899-x

57.

PPTIC-ML. https://github.com/kk1937/PPTIC-ML. Accessed August 20, 2025.

58.

Rossaint

Afshari

Bouillon

, et al. The European guideline on management of major bleeding and coagulopathy following trauma: sixth edition. Crit Care. 2023;27(1):80. doi:10.1186/s13054-023-04327-7

59.

Nashef

SAM

Roques

Sharples

, et al. EuroSCORE II. Eur J Cardio Thorac Surg. 2012;41(4):734-745. doi:10.1093/ejcts/ezs043

60.

Shahian

O’Brien

Filardo

, et al. The society of thoracic surgeons 2008 cardiac surgery risk models: part 1—Coronary artery bypass grafting surgery. Ann Thorac Surg. 2009;88(1):S2-S22. doi:10.1016/j.athoracsur.2009.05.053

61.

Dong

Sinha

Zhai

, et al. Performance drift in machine learning models for cardiac surgery risk prediction: retrospective analysis. JMIRx Med. 2024;5:e45973. doi:10.2196/45973

62.

Weiss

Yadaw

Meretzky

, et al. Machine learning using institution-specific multi-modal electronic health records improves mortality risk prediction for cardiac surgery patients. JTCVS Open. 2023;14:214-251. doi:10.1016/j.xjon.2023.03.010

63.

Eisenberg

Beaton

, et al. Using machine learning to predict outcomes following suprainguinal bypass. J Vasc Surg. 2024;79(3):593-608.e8. doi:10.1016/j.jvs.2023.09.037

64.

Verma

Beaton

, et al. Predicting outcomes following lower extremity open revascularization using machine learning. Sci Rep. 2024;14(1):2899. doi:10.1038/s41598-024-52944-1

65.

Chong

Lau

CML

Jiang

, et al. Predicting periprosthetic joint infection in primary total knee arthroplasty: a machine learning model integrating preoperative and perioperative risk factors. BMC Muscoskelet Disord. 2025;26(1):241. doi:10.1186/s12891-025-08296-6

66.

Espindola

Vella

Benito

, et al. Preoperative and perioperative risk factors, and risk score development for prosthetic joint infection due to staphylococcus aureus: a multinational matched case-control study. Clin Microbiol Infect. 2022;28(10):1359-1366. doi:10.1016/j.cmi.2022.05.010

67.

Yeo

Klemt

Robinson

Esposito

Uzosike

Kwon

. The use of artificial neural networks for the prediction of surgical site infection following TKA. J Knee Surg. 2023;36(06):637-643. doi:10.1055/s-0041-1741396

68.

Yin

Zhang

, et al. Machine learning prediction models for in-hospital postoperative functional outcome after moderate-to-severe traumatic brain injury. Eur J Trauma Emerg Surg. 2024;50(4):1219-1228. doi:10.1007/s00068-023-02434-2

69.

Zhu

Hou

, et al. Development and multicenter validation of machine learning models for predicting postoperative pulmonary complications after neurosurgery. Chin Med J. 2025;13:2170-2179. doi:10.1097/CM9.0000000000003433

70.

Dynnomap. https://xuming.shinyapps.io/dynnomapp/. Accessed August 22, 2025.

71.

Zhu

Wang

Zhang

. Decompressive craniectomy for patients with traumatic brain injury: a pooled analysis of randomized controlled trials. World Neurosurg. 2020;133:e135-e148. doi:10.1016/j.wneu.2019.08.184

72.

Braun

Sinik

Meyer

Larson

Butterworth

. Predicting complications in breast reconstruction: development and prospective validation of a machine learning model. Ann Plast Surg. 2023;91(2):282-286. doi:10.1097/SAP.0000000000003621

73.

Meyer

Kim

Eom

, et al. Predicting complications in breast reconstruction: external validation of a machine learning model. J Plast Reconstr Aesthetic Surg. 2025;107:176-181. doi:10.1016/j.bjps.2025.06.020

74.

White

Parsons

Collins

Barnett

. Evidence of questionable research practices in clinical prediction models. BMC Med. 2023;21(1):339. doi:10.1186/s12916-023-03048-6

Artificial Intelligence in Surgery Revisited: A 2025 Update on Machine Learning for Predicting Complications and Outcomes

Abstract

Keywords

Introduction

ML vs Conventional Statistical Approaches

Shortcomings of Conventional Statistics

Example of a Conventional Statistical Approach

Benefits of ML for Surgical Outcome Prediction

Comparison Example of an ML Approach

Limitations of ML

“Black Box” Predictions

Overfitting

Tabular Data

Conventional Perioperative Risk Assessment Tools

Subjective Risk-Assessment Tools

Scoring-Based Tools

Conventional Multivariable Regression Models

ML-Based Perioperative Risk Assessment Tools—Recent Advances

Pooled Surgical Cohort ML Models

Neural Network Models

General Surgery

Trauma Surgery

Cardiothoracic Surgery

Vascular Surgery

Surgical Oncology

Orthopedic Surgery

Neurosurgery

Plastic and Reconstructive Surgery

Call for Open Access to ML Models

Conclusions

Footnotes

ORCID iDs

Author Contributions

Funding

Declaration of Conflicting Interests

References