Sage Journals: Discover world-class research

Abstract

Artificial intelligence (AI) and machine learning (ML) are rapidly transforming surgery, moving beyond traditional risk prediction to real-time clinical support and intraoperative assistance. However, successful integration requires clinicians to understand key methodological challenges, including overfitting, data bias, and the “black box” nature of many models, which can obscure interpretability and limit generalizability. Recent advances demonstrate AI’s growing ability to process text and audiovisual data to streamline documentation, enhance intraoperative decision-making, and even perform basic operative tasks through robotic automation. This review outlines core ML principles relevant to surgical applications, discusses data modalities and evaluation metrics, and highlights emerging models that exemplify the evolving role of AI in the operating room. As these systems progress from experimental to practical use, understanding both their potential and limitations will be essential to ensure safe, effective, and ethically sound adoption in surgical practice.

Keywords

artificial intelligence machine learning algorithms predictive modeling surgical innovation

Introduction

Artificial Intelligence (AI) is a multidisciplinary field focused on building computer programs that perform tasks requiring human-like cognition.¹ Current implementations are almost exclusively artificial narrow intelligence (ANI), which is designed for specific tasks.² In contrast, human intellect is remarkably general-purpose, in that our innate “programming” can apply our intelligence to a diverse set of cognitive tasks.³ The theoretical general-purpose AI is termed artificial general intelligence (AGI).²

Machine learning (ML) is an approach to AI that employs algorithms which autonomously learn from data to perform AI tasks, and makes up the preponderance of common AI implementations.^4,5 ML algorithms, as elaborated upon in this review, are numerous and diverse. Neural networks (NNs) are a large class of ML algorithms that are modeled after the synaptic wiring of human neuronal circuits.^6,7 Many current NN implementations qualify as deep learning (DL), where multiple hidden layers allow hierarchical processing of data, akin to how the human brain processes visual information.⁷

Computer vision (CV) and natural language processing (NLP) describe certain AI objectives: understanding, recognition, classification, and/or reproduction of visual-medium or human language data, respectively; CV and NLP do not necessarily strictly prescribe implementation.^7,8 Large language models (LLMs), including ChatGPT (OpenAI, Inc, San Francisco, CA), Claude (Anthropic PBC, San Francisco, CA), Grok (xAI Corp., Palo Alto, CA) et al, are NLP systems built on massive DL architectures containing billions of parameters, enabling them to generate contextually relevant text at scale.⁹

Relevance to Surgical Practice

Fundamentally, ML models are predictive engines that differ in how they learn, generalize, and handle diverse data types. Many aspects of surgical care can be framed as predictive tasks, making ML relevant across the continuum of clinical practice. For example, documentation support with LLMs can be modeled as complex “auto-complete” tasks, incorporating patient context, structured data, and even audio from surgeon–patient encounters.⁹

Predictive ML models can improve clinical efficiency by automating the burden of certain administrative tasks.¹⁰ A 1-year retrospective study by The Permanente Medical Group (TPMG; Oakland, CA) following implementation of AI scribes (so-called “ambient AI,” see discussion of audio-based ML models) found that the heaviest users of AI tools saved time in note writing relative to infrequent and non-users.¹⁰ However, actual time saved by even heavy users was quite modest: less than 1 minute per note.¹¹ As models advance, their adoption will likely depend on whether they deliver efficiency gains or accuracy improvements substantial enough to outweigh familiarity with current workflows.

This review aims to contextualize the opportunities and limitations of AI for surgeons, highlighting tasks these tools are already impacting and areas where further evolution is needed. Specifically, we address the technical aspects of algorithm and AI utilization to guide a surgeon in clinical practice.

Fundamentals of ML Model Development

Data Preprocessing and Model Selection

ML model development begins with a defined clinical question: patient population, timeframe, intervention(s), and outcome(s). These decisions guide data requirements, algorithm selection, and evaluation strategies. High-quality data are essential, typically sourced from surgical registries or electronic health records (EHRs). Preprocessing includes handling missing/errant values and formatting raw variables into meaningful features.

The data set is then split into development and evaluation subsets. Typically, 70-80% of data are allocated for training and 20-30% for testing, without overlap to prevent information leakage. Feature selection may also be applied to prevent overfitting, wherein a model becomes too finely tuned to training data, learning to predict meaningful patterns but also random noise, reducing performance on new, unseen cases. Next, an algorithm is selected and undergoes hyperparameter tuning: adjusting settings controlling how a model learns, such as number of trees in a random forest model, boundary complexity in support vector machines (used to separate outcomes), or layer and connection complexity in NNs. Hyperparameters determine model complexity and flexibility, in turn dictating underfitting and overfitting tendencies during cross-validation.¹²

If only one data set is available, before model training, a “hold-out split” randomly sequesters 20-30% of data for testing, approximating a truly novel data set. However, testing completely independent data sets is preferable, whenever possible. Cross-validation provides more reliable estimates by repeatedly splitting the training data, training on some, and validating on the rest in rotation. Averaging results across folds gives more stable performance estimates. Careful preprocessing and validation design minimizes overfitting, and optimizes models for training and evaluation.¹³

Example Model Creation

An example random forest (RF) classifier model was built using open-source University of California, Irvine (UCI) ML Repository Thoracic Surgery Data for illustrative purposes of understanding the working of an ML model.¹⁴ Mortality among primary lung cancer following “major lung resections” was predicted from preoperative features. This model serves as the basis for further discussion of graphical ML model evaluation and interpretation throughout this section (Figures 1 –3).

Figure 1.

Example Calibration Curve Showing Observed Proportion of Survivors (Positives) as a Function of Model’s Predicted Survival Probability. Constant-Slope (Blue) Line Represents Perfect Calibration: For Each Given Survival Probability Prediction, the Observed Proportion of Survivors Would Match Exactly. Variable (Orange) Line Indicates Observed Calibration; for Most Predicted Probabilities, Observed Proportion of Survivors Matched Acceptably, Except Around Predicted Survival Probability of 50%. Model Created Using Open-Source University of California, Irvine Machine Learning Repository Thoracic Surgery Data.¹⁴

Figure 2.

Notional Representation of Algorithms’ Approximate Explainability-Performance Tradeoff. In Practice, Relative Performance Depends on Data Format and Modality, Linearity, Class Imbalance, Sample Size, etc. (DL, Deep Learning; SVM, Support Vector Machines; KNN, K-Nearest Neighbors; DT, Decision Trees)

Figure 3.

Common Types of Shapely Additive Explanations (SHAP) Plots, Created From an example Random Forest Model Using Preoperative Factors to Predict One-Year Mortality Following Pulmonary Resection for Lung Cancer, From the Open-Source University of California, Irvine Machine Learning Repository Thoracic Surgery Data.¹⁴ (A) SHAP bar Plot Showing Predictors’ Average Impacts on Predicted Mortality Probability. (B) SHAP Beeswarm Plot Showing Each Predictors’ per-Patient Impact on Predicted Probability of Mortality; Red and Blue Dots Indicate High and Low Predictor Values, Respectively. Color Clustering to Either Side of the Zero-effect Line Implies a Consistent Effect of the Predictor on Mortality Probability Prediction. (C) SHAP Waterfall for a Single Patient, Showing How Each Feature Affected the Model’s Mortality Probability Prediction for This Patient

Evaluation Parameters

Multiple candidate models are typically created, then evaluated across several performance metrics. Special care is needed with imbalanced data, common in surgical outcomes, as naive models may appear highly accurate by simply predicting the majority outcome. Therefore, reporting multiple evaluation measures (eg, sensitivity, specificity, area under the receiving operator characteristic [AUROC] curve, and average precision [AP]) is critical to ensure balanced performance and clinical relevance. Common ML evaluation parameters are described in Table 1.¹⁵

Table 1.

Common Discrimination and Calibration Metrics

Evaluation metric	Description
Discrimination
Accuracy	Proportion of correct predictions, among all predictions made. With imbalanced data, accuracy is biased toward the majority class; in a disease with 95% survival, a model achieves 95% accuracy by always predicting survival
Sensitivity (positive recall)	Proportion of true-positive predictions, among all patients with target outcome. For highly morbid outcomes, models may prioritize sensitivity to minimize false-negative predictions
Specificity (negative recall)	Proportion of true-negative predictions, among all patients without target outcome. Recall metrics (sensitivity and specificity) are intrinsic to a test (ie, a model), and agnostic to population prevalence
Positive predictive value (PPV; positive precision)	Proportion of true-positive predictions, among all positive predictions. Precision metrics (PPV and NPV) are affected by population prevalence (ie, class balance)
Negative predictive value (NPV; negative precision)	Proportion of true-negative predictions, among all negative predictions. Models predicting highly morbid outcomes may also maximize NPV to ensure that false-negative predictions are minimized
Area under the receiver operating characteristic curve (AUROC)	Measures how well a model can distinguish between patients with and without an outcome. It represents the likelihood that a randomly chosen patient with the outcome will have a higher predicted risk than one without it. AUROC of 0.5 indicates random chance, while 1.0 indicates perfect discrimination. However, because AUROC weighs false-positives and false-negatives equally, it can give overly optimistic results on imbalanced data sets with rare outcomes
Area under the precision-recall curve (AUPRC)	Balances sensitivity and PPV. Baseline chance performance is the prevalence of positive outcomes. AUPRC prioritizes detecting positive outcomes, even on imbalanced data with rare positive outcomes
Average precision (AP)	Weighted mean precision score across all decision thresholds. AP is a particular estimation method for AUPRC, which may be less biased particularly in class-imbalanced data
F₁ score (Dice score)	Harmonic mean of sensitivity and PPV, which equally penalizes false-positive vs false-negative predictions. F₁ is often called Dice score in CV tasks; its inverse is intersection-over-union score (IoU)
F₂ score	F₁ variant with an additional scaling factor to penalize false-negatives more than false-positives
F_β score	Generalization of F₁ and F₂, where β can be chosen to weight models’ “tolerance” toward false-positives vs false-negatives. β > 1 is more tolerant of false-positives; β < 1 is more tolerant of false-negatives
Calibration
Brier score	Mean squared error (MSE) between predicted probabilities vs actual outcomes. MSE increases with the square of increasing prediction error. Lower brier score represents more accurate predictions
Log loss (cross-entropy loss)	Mean negative log-likelihood of predicted probabilities vs actual outcomes. Log loss penalizes large prediction errors more strongly than small errors. Lower log loss represents greater predictive accuracy
Expected calibration error (ECE)	Predicted probabilities are binned into ranges; within each bin, the mean absolute-value error (MAE) between predicted probabilities and actual outcomes are calculated. ECE is the sample-size-weighted mean of MAE across all bins. Uneven sample size across bins (seen in class imbalance) can skew ECE.
Maximum calibration error (MCE)	Like ECE, MCE measures the mean absolute error in each predicted probability bin. MCE is the MAE of the worst-performing bin. MCE penalizes worst-case error, whereas ECE penalizes mean error

PPV, positive predictive value; NPV, negative predictive value; AUROC, area under the receiver operating characteristic curve; AUPRC, area under the precision-recall curve; AP, average precision; CV, computer vision; IoU, intersection-over-union score; MSE, mean squared error; MAE, mean absolute error; ECE, expected calibration error; MCE, maximum calibration error.

ML models can be benchmarked using many different measures, but classification metrics are broadly grouped into measures of “discrimination” or “calibration.” Whereas discrimination relates to a model’s ability to correctly classify outcomes (eg, whether or not a patient will experience a given complication), calibration describes the error between the model’s predicted vs observed probability of an outcome within the study population (Figure 1). For example, if a group of patients all have an estimated 30% chance of a given complication, 30% of this group should experience the complication in a well-calibrated model. While discrimination metrics are important, many studies often overlook reporting calibration plots.¹⁶ Over-reliance on discrimination metrics can lead to overprediction or underprediction of events, which can be mitigated by the inclusion of calibration.¹⁷

Model Validation

Arshi et al¹⁸ (2025) concluded that only one in six clinical prediction models are externally validated, and impact assessments are completed for even fewer. The disparity between the number of clinical models being created vs validated is indeed concerning, and reflects important missing steps in the eventual path to model deployment and use. Lack of impact assessment studies make it difficult to ascertain if a model outperforms standard clinician practices, further impeding model deployment and adoption into regular clinical use.¹⁹ Clinicians critically appraising models should consider the use of temporal (collected at a separate time) or geographic (collected at a different institution) validation cohorts, and look for externally validated models when considering clinical implementation. For example, our illustrative RF model created from UCI Thoracic Surgery Data underwent internal validation only, which risks performance degradation in practice if applied to broader populations.¹⁴ Validating models across disparate cohorts promotes generalizability. This topic is further explored in our accompanying work on ML prediction models for surgical complications and outcomes.²⁰

Interpreting ML Models

While ML models are often trained to achieve maximum accuracy, clinicians may hesitate to adopt them if the reasoning behind their predictions is unclear. “Interpretability” in AI refers to the degree to which a human can understand and trust the decisions made by AI systems.²¹ Importantly, interpretability and performance often exist in tension: complex models such as DL-NNs may achieve higher accuracy but are frequently viewed as “black boxes.” Thus, selecting an appropriate model for surgical applications often requires balancing predictive performance with the ability to explain and justify predictions (Figure 2).²²

Shapley additive explanations (SHAP) techniques are often used to explain how much a particular feature contributes to a model’s predictions by calculating the difference between each of a model’s predictions and the mean prediction. SHAP assigns a contribution value to each feature in a data set by comparing predictions made with and without that feature. In practical terms, it helps clinicians understand why the model reached its decision for a specific patient. As an example, SHAP may determine that high intraoperative blood loss or advanced age most strongly influenced a predicted risk of postoperative complication. SHAP plots are intuitive, visually ranking variables by their contribution to an outcome, making them particularly helpful for clinical interpretation (Figure 3).

A complementary approach is Local Interpretable Model-agnostic Explanations (LIME). LIME works by locally perturbing features in a data set and observing how predictions change. This allows users to approximate how a “black box” model is behaving in a specific case, effectively providing a simplified, more interpretable local model. In the clinical context, LIME can illustrate how small changes in patient characteristics (eg, comorbidity profile or operative time) influence outcome predictions, thereby clarifying the decision-making logic of otherwise opaque algorithms.²³

AI Applications in the Surgical Workflow

The following discussion of surgical AI applications is organized according to surgeons’ routine clinical workflows and responsibilities: (1) nonoperative clinical decision-making, (2) administrative tasks, including documentation, and (3) intraoperative patient care.

I. AI for Clinical Decision Support

In surgical data analysis, an important question is whether DL truly surpasses traditional ML approaches on structured clinical data. Tree-based ensembles such as decision trees, random forests (RFs), and gradient boosting machines (GBMs; eg, XGBoost, LightGBM, and CatBoost) remain highly effective for tabular data sets. Recent comparisons across medical diagnosis data sets show GBMs often rank highest in accuracy, ahead of both older techniques and state of the art tabular DL.²⁴ Strengths include robustness to heterogeneous inputs, lower computational cost, and easier optimization, whereas NNs struggle with sparse categorical variables and weak feature correlations.

DL is nonetheless competitive when adapted carefully. Bonde et al²⁵ (2024) developed a multilabel DL-NN for surgical outcome prediction, utilizing entity embeddings for high-dimensional categorical variables such as Current Procedural Terminology (CPT) codes. By mapping similar procedures close together in vector space, the model captures latent clinical relationships ignored by one-hot encoding. CPT embedding enabled their DL-NN to outperform both the American College of Surgeons (ACS) National Surgical Quality Improvement Project (ACS-NSQIP) Surgical Risk Calculator (NSQIP-SRC), and RF baselines across multiple prediction tasks.

Data volume is critical to DL model performance.²⁵ DL-NNs often need very large data sets to generalize. This is challenging in surgery; individual hospitals may only have hundreds or thousands of cases for a given outcome.²⁵ Tree-based models handle smaller data sets better via built-in regularization. Large-scale efforts such as ACS-NSQIP provide detailed data across many procedures, enabling NN training. Bonde et al²⁵ used pooled general surgery surgical data to train a broad model, then applied transfer learning to a low-volume subset, improving prediction for pancreaticoduodenal surgery compared with training from scratch. This approach illustrates how pooled surgical data sets can be adapted for specialized tasks. With expanding national registries like ACS-NSQIP and Trauma Quality Improvement Project (ACS-TQIP), DL limitations are being overcome, making them increasingly viable options for surgical risk prediction.

Our accompanying article on machine learning for predicting complications and outcomes compares specific ML models for prognostic prediction tasks in further detail.²⁰

II. AI for Administrative Tasks

Most surgical ML research leverages clinical data to predict outcomes such as complications, morbidity, or mortality. These data sets are typically tabular (rows: patients/encounters; columns: variables).²⁴ Text, audio, video, and image data are gaining relevance in surgical ML.⁷

Text-Based ML in Surgery

Text-based surgical ML primarily includes natural language processing (NLP) models.²⁶ NLP excels at tasks such as summarizing large volumes of text, potentially aiding clinical documentation, such as for complex patients with extensive existing documentation.²⁶ These models can also adapt tone, audience, or linguistic complexity (style transfer), simplifying patient communications.²⁷ Beyond summarization, clinical applications for text data in surgery appear somewhat limited. While research has shown reasonable accuracy in diagnosis tasks, diagnosis based on existing clinical documentation including progress notes and radiology or procedure reports is not a practice-changing function; physicians still fundamentally performed most of the diagnostic work.²⁶ However, in surgical research, the ability to summarize and glean meaning can be transformative. NLP-based chart review can ease data collection burdens in retrospective studies. Prior literature has found both LLMs and non-DL NLP models to extract target data from medical records with comparable accuracy to human evaluators.^28,29 Importantly, NLP-assisted chart review still allows for human-in-the-loop (HITL) approaches for questionable classifications, and commercial LLMs (eg, Ollama models; ollama.org LLC, Fort Lauderdale, FL) can run on protected servers to ensure protected health information (PHI) compliance.^28,29

Audio-Based ML in Surgery

In surgical applications, audio-based ML largely consists of voice recognition and transcription technologies. Scribing and documentation software often utilize audio-based ML to assist clinical documentation. Older systems like Dragon^® Medical (Nuance Communications, Inc, Burlington, MA) use ML for simple speech-to-text transcription.³⁰ Newer “ambient AI” systems listen to clinical interactions and read clinical documentation, using NLP models to draft clinical documentation in real time.³⁰ Examples include DAX™ Copilot (Nuance Communications, Inc, Burlington, MA) and models implemented by The Permanente Medical Group (TPMG; Oakland, CA).¹⁰ While these tools can reduce documentation burden, they carry risks: beyond transcription errors, models may introduce fabricated details (“hallucinations”), even documenting procedures that were never performed.¹⁰ In an example by Mess et al¹⁰ (2025), an ambient AI wrote “Capsules removed during surgery will be sent to pathology for examination,” in an email to a patient, though no capsulectomy during implant removal was performed. Beyond documentation, specific surgical implementations of audio-based ML have included detection of intraoperative phase and events based on audio signature (eg, table movement or electrocautery tones), and verification of preoperative time-out information.^31,32

III. Intraoperative Applications of AI

Image- and Video-Based ML in Surgery

Computer vision (CV) is the AI discipline focusing on detection and classification of image and video data.³³ Operationally, CV takes the form of DL techniques such as convolutional neural networks (CNNs) and vision transformers.^33-35 CV is of particular interest in minimally invasive surgery (MIS), where existing video-based operative techniques make CV integration a natural adjunct.^33-35 Initial CV applications included operative phase recognition, such as recognizing the critical view of safety (CVS) in cholecystectomy, but recent efforts have sought real-time integration, such as detailed on-screen overlays (Figure 4) that assess operative difficulty and technique, aiding communication with surgical trainees and operative team members.^33,36 CV is also a critical element of research into autonomous surgical robotics, such as the Hierarchical Surgical Robot Transformer (SRT-H) model from Kim et al³⁷ (2025), which autonomously performed certain steps of a simulated robotic-assisted laparoscopic cholecystectomy.

Figure 4.

Examples of Potential Computer Vision (CV) Based Overlays, as They Might be Integrated Into Laparoscopic Surgical Visualization Systems in the Future, Using Robotic-Assisted Laparoscopic Cholecystectomy in This example. Potential Applications Include (A) Rating of Expected Procedure Difficulty Based on Local Anatomy, in This Case Rated “Resident” Difficulty; (B) Issuing Warnings for Dissection Into Potentially Unsafe Areas; (C) a Heatmap-Style Overlay of Safe Area to Begin Dissection; (D) a Color-Coded CV-Based Identification of Elements of the Critical View of Safety, With On-Screen Checklist; (E) On-Screen Reminders for Proper Clip Application, as Might be Used in a Teaching Context; (F) an On-Screen Indicator of Procedure Steps, which May Help Assist Surgical Technologists, Anesthesia, Circulator Nursing, etc. Image Reproduced Under Creative Commons 4.0 CC-BY Open License From Mascagni et Al (2022), Figure 2.³³

Emerging Surgical AI Models

Much of the recent intraoperative ML literature focuses on laparoscopic cholecystectomy (LC). As one of the most common surgeries in the United States, LC is particularly well-suited to ML given its standardized workflow, relatively straightforward anatomy, and clear procedural endpoints: achieving the critical view of safety (CVS), clipping, and dividing the cystic duct and artery. Robotic-assisted LC (r-LC) further expands available data sets and provides a platform for ML integration. Per Mascagni et al,³³ laparoscopic procedures are natural choices for CV models given their consistent steps and visual structure, making LC a recurring testbed for surgical AI.

Autonomous Robotic Cholecystectomy

In July 2025, the Hierarchical Surgical Robot Transformer (SRT-H) model by Kim et al (2025) made news headlines such as “Experimental surgical robot performs gallbladder procedure autonomously.”^37,38 While somewhat overstated, SRT-H successfully executed key steps of r-LC.^37,38 SRT-H controls a da Vinci Research Kit Si® (dVRK) surgical robot (Intuitive Surgical, Inc, Sunnyvale, CA), which differs from a standard clinical da Vinci Si® by addition of two wrist cameras, providing additional visual data on instrument-tissue interaction.³⁷ SRT-H consists of two hierarchical DL models; the high-level NLP model uses image data to issue natural language instructions, while the low-level model combines instructions and image data to plan and execute dVRK instrument trajectories. SRT-H trained via imitation learning on paired video and text data, analogous to surgical trainees learning from prior cases.

In testing, SRT-H completed grasping, clipping, and dividing the cystic duct and artery in eight ex-vivo porcine specimens (10% excluded for variant anatomy) with 100% accuracy.³⁷ Average task time was 5 minutes 17 seconds, which was slower than expert-operated dVRK, but with smoother and more efficient instrument paths. Notably, SRT-H did not perform dissection or skeletonization of the hepatocystic triangle, dissection of the gallbladder off the cystic plate, or specimen removal. SRT-H completed all procedure steps with 100% accuracy across all eight specimens.³⁷

SRT-H demonstrates an effective pathway toward task-specific automation of operative tasks in surgery.³⁷ However, SRT-H has not addressed common intraoperative challenges such as bleeding, adhesiolysis, obscured visualization, or variant anatomy. These limitations underscore the considerable gap between controlled ex-vivo tasks and safe in-vivo autonomous surgery; overcoming such challenges is a prerequisite for the futuristic concept of AI-driven autonomous surgery.

Safe Zones for Laparoscopic Dissection

Protserov et al³⁴ (2024) sought to create and deploy a DL semantic segmentation model to identify safe (“go”) and unsafe (“no-go”) zones for dissection in LC video feeds. Semantic segmentation assigns each pixel to an anatomical class (eg, liver, gallbladder, cystic duct), here reframed as safe (go zone), unsafe (no-go zone), or background (neither).³⁹ The Protserov et al³⁴ models label pixels with a probability of belonging to each of three classes; the highest-probability class becomes the prediction for the pixel.

The authors previously created “GoNoGoNet” for similar LC segmentation tasks, which was validated by an external panel-of-experts, and showed potential efficacy in avoiding bile duct injury.^40-42 However, the new architectures developed in Protserov et al³⁴ (2024) offer several notable advancements. Their overall focus was to create LC segmentation models that were useful in broader contexts, including real-time use, resource-limited or remote settings, and on more diverse or underpowered hardware. The authors compared two segmentation architectures: U-Net, a biomedical CNN-based image segmentation architecture, and SegFormer, a transformer-based segmentation architecture.^43,44 Both models included image processing techniques to facilitate prediction on wider arrays of video formats, for example, different laparoscope models, which vary by aspect ratio or other image features. Implementing data processing optimizations, including image downscaling and bandwidth throttling, ensuring accurate, low-latency, real-time image overlays could be obtained when communicating with the models hosted on university cloud servers, even with limited internet speeds.

The models utilized two separate data sets. Training and internal validation occurred on 289 open-source LC videos from 37 countries. External validation included 25 expert-annotated cases from the Society of American Gastrointestinal and Endoscopic Surgeons (SAGES) Safe Cholecystectomy Task Force. In external validation, U-Net achieved PPV of 82% and 92% for predicting go and no-go zones, respectively, and incorrectly labeled dangerous zones as go zones in 4% of pixels. SegFormer achieved PPV of 75% (go) and 92% (no-go), with 1% mislabeling of unsafe zones. Their flow-control bandwidth calibration allowed both models to achieve >60 frames per second (fps) and <100 millisecond image latency, via a 32 megabits per second (Mbps) internet connection, without image downscaling (United States median home broadband speeds were 285 Mbps and 48 Mbps for download and upload, respectively, as of July 2025).⁴⁵ For internet speeds as slow as 2 Mbps, as might be encountered in remote or resource-limited settings, image downscaling still allowed frame rates >60 fps and latency <150 milliseconds for both models, while only reducing various accuracy metrics by approximately 2-5%. Thus, both U-Net and SegFormer achieved usable predictive accuracy and sufficient speeds for real-time image overlay, even in places with slow internet access.

Both U-Net and SegFormer, as well as the PSPNet architecture from the authors’ prior GoNoGoNet model, are available online.^41,46 The models allow real-time synchronous image overlay, streamed from laptops, laparoscopic video towers, or other internet-connected devices, as well as upload and prediction from pre-recorded operative video. The models also integrate a go zone threshold slider, allowing users to tune the minimum threshold to display go zones. Online, open-access model hosting facilitates validation studies in more diverse clinical populations.

Intraoperative Diagnosis

Chen et al⁴⁷ (2025) created a semantic segmentation model called AI laparoscopic exploration system (AiLES) to intraoperatively detect intra-abdominal metastases (IAMs) during diagnostic staging laparoscopy (DSL) for gastric cancer. The authors focused on small, occult IAM detection in a cohort of 100 gastric cancer patients. An expert surgeon gold standard was used, and AiLES was compared to both novice surgeons and five generalized image segmentation models.

AiLES was the most discriminative (Dice 0.76) and fastest-predicting (11 fps) ML model and was non-inferior to novice surgeon detection across IAM types, though instances of novice-missed but AiLES-detected IAMs were reported. AiLES achieved excellent detection (Dice ≥0.80) for uterine (Dice 0.93), mesenteric (Dice 0.80), single peritoneal (Dice 0.90), and “tiny” (≤5 mm; Dice 0.87) IAMs.

In gastric cancer, DSL is a definitive diagnostic step for peritoneal carcinomatosis, which portends treatment failure and up to 60% of gastric cancer mortality.^48,49 DSL is prone to missing small, solitary, and peritoneal lesions, leading to under-staging and inappropriate treatment; AiLES excelled in detection of such lesions, and was non-inferior to surgical trainees.^47-49 Current performance may have surgical education applications, and modest performance improvements may justify evaluation as an intraoperative diagnostic adjunct. AiLES architecture could also be suitable for training on surgical anatomy relevant to other procedures and tumor types. Image prediction latency was slower than that reported by Protserov et al, but the authors noted that 11 fps aligns with most surgical image segmentation models; the supplementary video qualitatively exhibits sufficient responsiveness for real-time intraoperative use.^34,50 The authors’ model code is available by request.

Other Surgical AI Frontiers

The above examples highlight two objectives of surgical AI development: CV-driven safety tools (Protserov et al and Chen et al) and more “agentic” models capable of real-world action (SRT-H).^34,37,47 Other CV applications include ML-based laparoscopic video-feed smoke removal, robotic instrument recognition, surgical phase, task, and workflow recognition, and segmentation of operative anatomy.^39,51-55 Agentic models include the STAR model by Saeidi et al (2022), which demonstrated autonomous robotic hand-sewn porcine intestinal anastomosis, while the micro-STAR model by Haworth et al (2024) similarly performed ex-vivo vascular anastomosis.^56,57 Gruijthuijsen et al (2022) autonomously targeted a rigid endoscope in a lab setting, while Ma et al (2019) demonstrated an autonomously tracked flexible laparoscope on the dVRK system.^58,59 These AI principles have been expanded to improve surgical education and trainee assessment—as an example, Kawaharazuka et al⁶⁰ (2024) reported autonomous performance on the peg transfer task of the Fundamentals of Laparoscopic Surgery® course (SAGES, Los Angeles, CA; ACS, Chicago, IL). A detailed discussion of AI in contemporary surgical education is reviewed in an accompanying article as part of this symposium.⁶¹ Together, these advances suggest a trajectory toward increasingly autonomous performance, though clinical translation is still in evolution.

Ethical Considerations for Implementing Surgical AI

Clinical benefits offered by surgical AI entail novel ethical risks, warranting additional ethical considerations and protections. AI can “learn” human sociocultural biases to which models are exposed during training, contributing to disparate or inequitable health outcomes.⁶²

AI use in surgical research also raises questions regarding authorship; likewise, unique methodological aspects of ML research require new transparency standards for reporting results; to this end, ML-specific reporting guidelines have been promulgated. AI use as clinical adjuncts have led to codification of new regulatory frameworks, which may regulate Software as a Medical Device (SaMD).⁶³ Finally, use of surgical AI may carry novel medicolegal implications, particularly as AI models gain agentic functionalities. Ethical aspects of surgical AI are addressed in detail in our upcoming work on evolving ethical challenges of AI in surgical research.

Conclusion

Although most agentic surgical AI systems remain confined to the laboratory, progress is advancing rapidly. Surgical machine learning has expanded far beyond prognostic modeling, with deep learning now enabling real-time intraoperative tools that enhance safety, efficiency, and training, particularly in laparoscopic and robotic-assisted surgery. Early work in laparoscopic cholecystectomy, a standardized and well-defined operation, illustrates how common procedures can serve as testbeds for broader applications. Many emerging models rely on convergent architectures, suggesting that once validated, proven designs could be rapidly adapted across diverse operations. These advances highlight that surgical ML is not only a computer science challenge but a clinical one, and as capabilities move closer to real-world integration, a working knowledge of surgical AI is becoming essential to modern surgical practice.

Footnotes

ORCID iDs

David Limon

Niruktha Raghavan

Aashish Rajesh

Author Contributions

DL, VS – conceptualization, draft of preliminary manuscript, revision, approval of final version, NR, PN – conceptualization, critical review of manuscript with revision for incorporating intellectual content, approval of final version, AR – senior author, conceptualization, critical review of manuscript with revision for incorporating intellectual content, approval of final version.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Russell

Norvig

. Artificial Intelligence: A Modern Approach. 3rd ed. Pearson, 2016.

Bubeck

Chandrasekaran

Eldan

, et al.. Sparks of artificial general intelligence: early experiments with GPT-4. arXiv. 2023. doi:10.48550/ARXIV.2303.12712

Colom

Karama

Jung

Haier

. Human intelligence and brain networks. Dialogues Clin Neurosci. 2010;12(4):489-501. doi:10.31887/DCNS.2010.12.4/rcolom

Murphy

. Chapter 1. What is machine learning? In: Li

, ed. Machine Learning in Radiation Oncology: Theory and Applications. Springer International Publishing AG, 2015.

Bates

Auerbach

Schulam

Wright

Saria

. Reporting and implementing interventions involving machine learning and artificial intelligence. Ann Intern Med. 2020;172(11_Supplement):S137-S144. doi:10.7326/M19-0872

Yuste

. From the neuron doctrine to neural networks. Nat Rev Neurosci. 2015;16(8):487-497. doi:10.1038/nrn3962

Morris

Rajesh

Asaad

Hassan

Saadoun

Butler

. Deep learning applications in surgery: current uses and future directions. Am Surg. 2023;89(1):36-42. doi:10.1177/00031348221101490

Hashimoto

Rosman

Rus

Meireles

. Artificial intelligence in surgery: promises and perils. Ann Surg. 2018;268(1):70-76. doi:10.1097/SLA.0000000000002693

Shahzad

Mazhar

Tariq

Ahmad

Ouahada

Hamam

. A comprehensive review of large language models: issues and solutions in learning environments. Discov Sustain. 2025;6(1):27. doi:10.1007/s43621-025-00815-8

10.

Mess

Mackey

Yarowsky

. Artificial intelligence scribe and large language model technology in healthcare documentation: advantages, limitations, and recommendations. Plast Reconstr Surg Glob Open. 2025;13(1):e6450. doi:10.1097/GOX.0000000000006450

11.

Tierney

Gayre

Hoberman

, et al.. Ambient artificial intelligence scribes: learnings after 1 year and over 2.5 million uses. NEJM Catalyst. 2025;6(5). doi:10.1056/CAT.25.0040

12.

Pudjihartono

Fadason

Kempa-Liehr

O’Sullivan

. A review of feature selection methods for machine learning-based disease risk prediction. Front Bioinform. 2022;2:927312. doi:10.3389/fbinf.2022.927312

13.

Pfob

Sidey-Gibbons

. Machine learning in medicine: a practical introduction to techniques for data pre-processing, hyperparameter tuning, and model comparison. BMC Med Res Methodol. 2022;22(1):282. doi:10.1186/s12874-022-01758-8

14.

Lubicz

Pawelczyk

Rzechonek

Kolodziej

. 2014;Thoracic surgery data. Published online. doi:10.24432/C5Z60N

15.

Luu

Borisenko

Przekop

Patil

Forrester

Choi

. Practical guide to building machine learning-based clinical prediction models using imbalanced datasets. Trauma Surg Acute Care Open. 2024;9(1):e001222. doi:10.1136/tsaco-2023-001222

16.

Collins

Reitsma

Altman

Moons

KGM

. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. Br J Surg. 2015;102(3):148-158. doi:10.1002/bjs.9736

17.

Huang

Macheret

Gabriel

Ohno-Machado

. A tutorial on calibration measurements and calibration models for clinical prediction models. J Am Med Inf Assoc. 2020;27(4):621-633. doi:10.1093/jamia/ocz228

18.

Arshi

Cowley

Rijnhart

Reeve

Smits

Wynants

. External validation, impact assessment and clinical utilization of clinical prediction models: a prospective cohort study. J Clin Epidemiol. 2025;186:111902. doi:10.1016/j.jclinepi.2025.111902

19.

Siontis

GCM

Tzoulaki

Castaldi

Ioannidis

JPA

. External validation of new risk prediction models is infrequent and reveals worse prognostic discrimination. J Clin Epidemiol. 2015;68(1):25-34. doi:10.1016/j.jclinepi.2014.09.007

20.

Limon

Satish

Raghavan

Morris

Muir

Rajesh

. Artificial intelligence in surgery revisited: a 2025 update on machine learning for predicting complications and outcomes. Am Surg. 2025;29:00031348251393934. doi:10.1177/00031348251393934

21.

Ennab

Mcheick

. Enhancing interpretability and accuracy of AI models in healthcare: a comprehensive review on challenges and future directions. Front Robot AI. 2024;11:1444763. doi:10.3389/frobt.2024.1444763

22.

Gunning

Aha

. DARPA’s explainable artificial intelligence program. AI Mag. 2019;40(2):44-58. doi:10.1609/aimag.v40i2.2850

23.

Fuhrman

Gorre

El Naqa

Giger

. A review of explainable and interpretable AI with applications in COVID‐19 imaging. Med Phys. 2022;49(1):1-14. doi:10.1002/mp.15359

24.

Yıldız

Kalayci

. Gradient boosting decision trees on medical diagnosis over tabular data. arXiv. 2024. doi:10.48550/arXiv.2410.03705

25.

Bonde

Kaafarani

Millarch

Sillesen

. Assessing the value of deep neural networks for postoperative complication prediction in pancreaticoduodenectomy patients. PLoS One. 2024;19(12):e0316402. doi:10.1371/journal.pone.0316402

26.

KDR

Tay

SBP

Choy

Verjans

Sasanelli

Kong

JCH

. Applications of natural language processing tools in the surgical journey. Front Surg. 2024;11:1403540. doi:10.3389/fsurg.2024.1403540

27.

Wang

Clark

McKelvey

, et al.. Science out of its ivory tower: improving accessibility with reinforcement learning. arXiv. 2025. doi:10.48550/arXiv.2410.17088

28.

Dencker

Bonde

Troelsen

Sillesen

. Assessing the utility of natural language processing for detecting postoperative complications from free medical text. BJS Open. 2024;8(2):zrae020. doi:10.1093/bjsopen/zrae020

29.

Delk

Lai

. A comparison of large language model versus manual chart review for extraction of data elements from the electronic health record. medRxiv. 2023;1:4924. doi:10.1101/2023.08.31.23294924

30.

JJW

Wang

Zhou

, et al. Evaluating the performance of artificial intelligence-based speech recognition for clinical documentation: a systematic review. BMC Med Inf Decis Making. 2025;25(1):236. doi:10.1186/s12911-025-03061-0

31.

Fuchtmann

Riedel

Berlet

, et al. Audio-based event detection in the operating room. Int J CARS. 2024;19(12):2381-2387. doi:10.1007/s11548-024-03211-1

32.

Yoo

Kim

, et al. Deep learning-based smart speaker to confirm surgical sites for cataract surgeries: a pilot study. PLoS One. 2020;15(4):e0231322. doi:10.1371/journal.pone.0231322

33.

Mascagni

Alapatt

Sestini

, et al. Computer vision in surgery: from potential to clinical value. Npj Digit Med. 2022;5(1):163. doi:10.1038/s41746-022-00707-5

34.

Protserov

Hunter

Zhang

, et al. Development, deployment and scaling of operating room-ready artificial intelligence for real-time surgical decision support. Npj Digit Med. 2024;7(1):231. doi:10.1038/s41746-024-01225-2

35.

Schmidt

Mohareri

DiMaio

Yip

Salcudean

. Tracking and mapping in medical computer vision: a review. Med Image Anal. 2024;94:103131. doi:10.1016/j.media.2024.103131

36.

Mascagni

Alapatt

Laracca

, et al. Multicentric validation of EndoDigest: a computer vision platform for video documentation of the critical view of safety in laparoscopic cholecystectomy. Surg Endosc. 2022;36(11):8379-8386. doi:10.1007/s00464-022-09112-1

37.

Kim

Chen

Hansen

, et al.. SRT-H: a hierarchical framework for autonomous surgery via language conditioned imitation learning. arXiv. 2025. doi:10.48550/arXiv.2505.10251

38.

Choudhury

. Experimental surgical robot performs gallbladder procedure autonomously. Reuters. https://www.reuters.com/business/healthcare-pharmaceuticals/experimental-surgical-robot-performs-gallbladder-procedure-autonomously-2025-07-09/. Accessed September 2, 2025.

39.

Gon Park

Park

Rock Choi

, et al. Deep learning model for real-time semantic segmentation during intraoperative robotic prostatectomy. Eur Urol Open Sci. 2024;62:47-53. doi:10.1016/j.euros.2024.02.005

40.

Laplante

Namazi

Kiani

, et al. Validation of an artificial intelligence platform for the guidance of safe laparoscopic cholecystectomy. Surg Endosc. 2023;37(3):2260-2268. doi:10.1007/s00464-022-09439-9

41.

Madani

Namazi

Altieri

, et al. Artificial intelligence for intraoperative guidance: using semantic segmentation to identify surgical anatomy during laparoscopic cholecystectomy. Ann Surg. 2022;276(2):363-369. doi:10.1097/SLA.0000000000004594

42.

Khalid

Laplante

Masino

, et al. Use of artificial intelligence for decision-support to avoid high-risk behaviors during laparoscopic cholecystectomy. Surg Endosc. 2023;37(12):9467-9475. doi:10.1007/s00464-023-10403-4

43.

Ronneberger

Fischer

Brox

. U-Net: Convolutional networks for biomedical image segmentation. In: Navab

Hornegger

Wells

Frangi

, eds. Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. Vol 9351. Lecture Notes in Computer Science. Springer International Publishing, 2015:234-241. doi:10.1007/978-3-319-24574-4_28

44.

Xie

Wang

Anandkumar

Alvarez

Luo

. SegFormer: simple and efficient design for semantic segmentation with transformers. arXiv. 2021. doi:10.48550/ARXIV.2105.15203

45.

United States Median Country Speeds Updated July 2025 . Speedtest Global Index. 2025. https://www.speedtest.net/global-index/united-states. Accessed September 8, 2025.

46.

LIVE surgical AI demo. Accessed September 4, 2025.https://surg-ai.uhndata.io/

47.

Chen

Gou

Fang

, et al. Artificial intelligence assisted real-time recognition of intra-abdominal metastasis during laparoscopic gastric cancer surgery. Npj Digit Med. 2025;8(1):9. doi:10.1038/s41746-024-01372-6

48.

Ajani

D’Amico

Bentrem

, et al. Gastric cancer, version 2.2025, NCCN clinical practice guidelines in oncology. J Natl Compr Cancer Netw. 2025;23(5):169-191. doi:10.6004/jnccn.2025.0022

49.

Cao

, et al. Prediction of peritoneal cancer index and prognosis in peritoneal metastasis of gastric cancer using NLR-PLR-DDI score: a retrospective study. Cancer Manag Res. 2022;14:177-187. doi:10.2147/CMAR.S343467

50.

Chen

Gomez

Kapadia

. Improving the American college of surgeons NSQIP surgical risk calculator with machine learning. J Am Coll Surg. 2023;237(2):385-386. doi:10.1097/XCS.0000000000000676

51.

Wang

Sun

. Surgical smoke removal via residual swin transformer network. Int J CARS. 2023;18(8):1417-1427. doi:10.1007/s11548-023-02835-z

52.

De Backer

Van Praet

Simoens

, et al. Improving augmented reality through deep learning: real-time instrument delineation in robotic renal surgery. Eur Urol. 2023;84(1):86-91. doi:10.1016/j.eururo.2023.02.024

53.

Liu

Boels

Garcia-Peraza-Herrera

, et al. LoViT: long video transformer for surgical phase recognition. Med Image Anal. 2025;99:103366. doi:10.1016/j.media.2024.103366

54.

Kim

Rhu

Kim

Choi

. Real-time segmentation of biliary structure in pure laparoscopic donor hepatectomy. Sci Rep. 2024;14(1):22508. doi:10.1038/s41598-024-73434-4

55.

Pak

Park

, et al. Application of deep learning for semantic segmentation in robotic prostatectomy: comparison of convolutional neural networks and visual transformers. Investig Clin Urol. 2024;65(6):551-558. doi:10.4111/icu.20240159

56.

Saeidi

Opfermann

Kam

, et al. Autonomous robotic laparoscopic surgery for intestinal anastomosis. Sci Robot. 2022;7(62):eabj2908. doi:10.1126/scirobotics.abj2908

57.

Haworth

Biswas

Opfermann

, et al.. Autonomous robotic system with optical coherence tomography guidance for vascular anastomosis. arXiv. 2024. doi:10.48550/ARXIV.2410.07493

58.

Gruijthuijsen

Garcia-Peraza-Herrera

Borghesan

, et al. Robotic endoscope control via autonomous instrument tracking. Front Robot AI. 2022;9:832208. doi:10.3389/frobt.2022.832208

59.

Song

Chiu

. Autonomous flexible endoscope for minimally invasive surgery with enhanced safety. IEEE Rob Autom Lett. 2019;4(3):2607-2613. doi:10.1109/LRA.2019.2895273

60.

Kawaharazuka

Okada

Inaba

. Robotic constrained imitation learning for the peg transfer task in fundamentals of laparoscopic surgery. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE; 2024:606-612. doi:10.1109/ICRA57147.2024.10610059

61.

Raghavan

Patel

Limon

Morris

Kempenich

Rajesh

. Artificial intelligence in surgical education: a 2025 update on adaptive training, feedback, and competency-based education. Am Surg. 2025;31348251397597. doi:10.1177/00031348251397597

62.

DeCamp

Lindvall

. Latent bias and the implementation of artificial intelligence in medicine. J Am Med Inf Assoc. 2020;27(12):2020-2023. doi:10.1093/jamia/ocaa094

63.

Artificial intelligence/machine learning (AI/ML)-Based software as a medical device (SaMD) action plan. 2021. https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-software-medical-device Accessed October 20, 2025.

Artificial Intelligence in Surgery Revisited: A 2025 Guide to Understanding and Applying AI Models in Clinical Practice

Abstract

Keywords

Introduction

Relevance to Surgical Practice

Fundamentals of ML Model Development

Data Preprocessing and Model Selection

Example Model Creation

Evaluation Parameters

Model Validation

Interpreting ML Models

AI Applications in the Surgical Workflow

I. AI for Clinical Decision Support

II. AI for Administrative Tasks

Text-Based ML in Surgery

Audio-Based ML in Surgery

III. Intraoperative Applications of AI

Image- and Video-Based ML in Surgery

Emerging Surgical AI Models

Autonomous Robotic Cholecystectomy

Safe Zones for Laparoscopic Dissection

Intraoperative Diagnosis

Other Surgical AI Frontiers

Ethical Considerations for Implementing Surgical AI

Conclusion

Footnotes

ORCID iDs

Author Contributions

Funding

Declaration of Conflicting Interests

References