Sage Journals: Discover world-class research

Abstract

In the rapidly advancing landscape of surgical education, the traditional apprenticeship model is being increasingly complemented by individualized learning, competency-based assessment, and data-driven feedback. Work-hour restrictions, administrative burdens, and limited operative exposure have intensified the need for innovative solutions to supplement faculty-led training. Artificial intelligence (AI) has emerged as a promising adjunct, offering scalable platforms for technical skill acquisition, personalized feedback, and structured progress tracking. Early applications include AI-guided simulation, feedback, natural language processing for resident evaluation, and advanced applicant-screening systems, which hold the potential to streamline holistic review while reducing faculty workload. Despite these advances, significant challenges remain, including bias mitigation, ethical data governance, and the need for rigorous outcome-based validation. The greatest promise lies in hybrid models, where AI augments rather than replaces mentorship, freeing faculty for complex, context-dependent teaching. With careful implementation, AI is poised to meaningfully transform surgical education worldwide.

Keywords

artificial intelligence machine learning surgical education entrustable professional activities competency-based learning feedback

Introduction

Surgical education is undergoing a paradigm shift from sole reliance on the traditional apprenticeship model; while skills are still primarily acquired through observation, repetition, and incremental responsibility, there is now increased emphasis on individualized learning, efficiency, and measurable competency.¹ This change has been accelerated by regulatory, technological, and cultural changes that challenge how residents acquire operative skills and prepare for independent practice. The Accreditation Council for Graduate Medical Education (ACGME) introduced duty hour restrictions in 2003, capping resident work at 80 h per week. While intended to reduce fatigue and improve patient safety, these reforms have raised concerns about reduced operative exposure and the readiness of trainees for unsupervised surgical practice.² Simultaneously, the increasing importance of electronic medical records and emphasis on administrative documentation has further impeded opportunities for experiential learning in the operating room.^3,4

To address these constraints, researchers have begun exploring artificial intelligence (AI) and immersive technologies as supplemental tools for technical surgical training, with promising initial results. Even with emerging evidence for AI as a teaching modality, thoughtful deliberation on integrating these new tools into existing curricula and leveraging AI to provide more equitable access to surgical education worldwide is imperative. AI may additionally aid residency leadership with resident selection, especially in the wake of growing applicant numbers and desire for holistic review. Alongside the myriads of opportunities that AI portends, lie ethical and implementation challenges that must be addressed as widespread adoption is considered.

This review will explore the expanding role of AI in surgical education, emphasizing its ability to create structured, data-driven, and learner-centered pathways that complement clinical training and advance the goals of competency-based education (CBE). A tabulated summary of the evidence discussed is presented in Table 1.

Table 1.

Overview of AI in Surgical Education

First author (Year)	AI model	Summary	Key findings	Educational impact
Leveraging AI for simulation and skill acquisition
Stone (2025)⁶	CNN/DNN with XR integration	XR-guided placement of bulldog clamp on renal artery with real-time feedback	AI model identified the correct technique with 99.9% accuracy. All trainees achieved successful placement with high reported satisfaction	Demonstration of scalable and objective models for discrete procedural steps
Giglio (2025)⁷	Intelligent continuous expertise monitoring system	RCT comparing medical student performance of simulated brain tumor resections under intelligent tutor vs standardized expert feedback vs AI-informed personalized expert feedback	AI-informed personalized feedback group performed significantly higher than the intelligent tutor alone. Specific areas of improvement include bleeding and injury risk	Supports the use of AI models as a qualified co-instructor. Suggests best practice is AI as an injunct to serve alongside traditional methods
Fazlollahi (2023)⁸	“Virtual Operative Assistant” AI tutor	Follow-up cohort study of an AI-enhanced surgical skills curriculum in simulation training to detect performance changes vs control group	The AI-enhanced curriculum cohort had significant improvement in safety metrics, although had significantly lower dominant hand velocity/acceleration and speed of tumor removal compared to control	Signals a potential “hidden curriculum” within AI-assisted learning. Integration requires continued model validation and assessment
Gomez (2025)⁹	Explainable AI (XAI) video analysis and feedback generation	Prospective user study of medical students comparing XAI-generated feedback vs traditional video-based coaching	XAI group more closely mimicked expert practice, had improved cognitive load, and reported increased confidence	Validation of AI-generated feedback as a tool to improve learner engagement and cognitive load
Kim (2025)¹⁸	K-nearest neighbors model built on surgical simulation data to assess skills using time and distance-based calculations	AI was trained on 111 videos of four different procedures to generate predicted scores and compared to human-derived scores based on global rating scale competencies	When compared to human reviewer scores, the AI model achieved 42-100% accuracy for the 5-class (score of 1-5) evaluation and 68-100% for the 2-class (pass/fail) evaluation	Offers a low-resource AI model for assessment of procedural simulation skills across multiple procedure types
Ma (2024)²²	AI-based vision transformer for surgical activity	Pilot RCT evaluating effect of AI-feedback on technical skill using a da Vinci surgical robot	AI feedback group had significantly larger improvement in needle handling compared to control	Compelling support for AI-driven feedback model, particularly within robotics-associated surgical skill development
Chen (2025)²³	Four machine translation tools (DeepL, google gemini, google translate, microsoft copilot)	Selected phrases in an established critical care education program were translated into Chinese, Spanish, and Ukrainian and later compared with a professional translator and rated independently	All Chinese and Spanish versions received “understandable to good” to “high-quality” scores, while Ukrainian overall scored “hard to get the gist.” Ratings for blinding comparison to human translator varied among languages	Highlights that machine translation tools can be used to enhance educational access internationally, but there is currently large variation among languages and models
Intraoperative decision support & assessment
Madani (2022)²⁴	Deep convolutional neural network	Model intraoperatively identified safe (Go) and dangerous (No-go) zones during laparoscopic cholecystectomy	Go zone IOU = 0.53, accuracy = 0.94; No-go zone IOU = 0.71, accuracy = 0.85; organ IOUs (liver = 0.86, gallbladder = 0.72, hepatocystic triangle = 0.65)	Demonstrates feasibility of intraoperative anatomical recognition and distinguishment of safe vs unsafe areas with potential to reduce adverse events such as bile duct injury
Checcucci (2023)²⁵	Multi-task learning CNN	Developed the bleeding artificial intelligence-based detector (BLAIR) to identify intra operative bleeding during robot-assisted radical prostatectomy	Bleeding event recognition accuracy of 90.63%, and demonstrated faster response in pre-warning of bleeding events compared to human assessments	Strong potential for AI utility in quantifiable and objective identification of complication occurrence
Assessment & progression in CBE/EPAs with real-time feedback
Fazlollahi (2022)²⁷	ML algorithm integrated into “virtual operative assistant” (VOA) tutoring framework	RCT comparing VOA vs remote expert instruction for medical students performing a simulated brain tumor resection	VOA group had significantly improved practice and realistic expertise scores compared to control and the human instructor group	VOA can be a scalable tutor for basic simulation skill acquisition
Kumar (2025)³⁰	Adaptive learning through real-time haptic feedback	RCT of a cost-effective training platform that combines 3-D simulation with adaptive feedback over a wide variety of procedures	Treatment group demonstrated significant improvements compared to control groups, including 42% improved procedural accuracy and 38% reduction in training time	Offers an accessible VR/AI solution for broad deployment, especially programs or areas with less resources
Lavanchy (2021)³¹	Multi-stage ML method using CNN for object identification and linear regression to predict surgical skills	Three-stage modeling approach to automate surgical skills assessment from laparoscopic cholecystectomy videos	Model achieved 87% in distinguishing good vs poor skill compared to expert ratings, and 70% accuracy in predicting within 1 score of a 1-5 likert rating by experts	Supports objective, reproducible assessment for competency-based assessments which may assist in supplemental education and reduce faculty burden
Mirchi (2020)³²	“Virtual operative assistant” (VOA) iterative automated AI tutoring platform	Introduction of VOA among a cohort of attending neurosurgeons, resident physicians, and medical students who performed a VR-simulated brain tumor resection	Successful classification of skilled vs novice participants with an accuracy of 92%; provided an immediate visual representation of participants performance compared to benchmarks	Model-generated formative assessment data can be aligned with milestones and used to track resident progression over time
Khairnar (2025)³³	ML-based automated instrument tracking and performance assessment performance with supervised and unsupervised models	Assessment of novice and expert performance of Nissen fundoplication on porcine bowl	The most accurate model achieved 81.7% accuracy in performance classification when correlated with expert ratings	Useful for objective assessment of core skills, and can be integrated into large simulation curricula
Kiyasseh (2023)³⁴	Deep learning video-analysis model which provided performance-based feedback suggestions and explanations	Multi-institutional videos of single step of robot-assisted radical prostatectomy were used to train model and assess for feedback quality and compared to feedback provided by experts	Explainable AI model achieved near human scoring, although there was explanation bias expressed between novice vs experts; bias was mitigated when utilizing “training with explanations” (TWIX) strategy which used human explanations as supervision to teach AI systems	Validation of explainable AI for surgical skill assessment and interpretability in video-based training evaluation
Guo (2024)³⁶	GPT-3.5-turbo based chatbot peer-review platform	Developed peer feedback assistant tool that provided prompts and feedback on students’ comments based on multi-dimensional quality of feedback	Experimental group using AI-generated feedback had significantly improved quality in feedback quality (F = 111.5, P < .001) and writing performance (F = 5.12, P = .025)	AI-assisted feedback improves both the quality of the feedback but also outcomes, supporting using AI for timely, high-quality feedback dissemination from instructors to learners
Ayers (2023)³⁷	ChatGPT AI chatbot assistant	Comparison of chatbot-generated responses to human physician answers to patient questions on a public online forum; responses judged by independent rates for quality and empathy	AI-generated responses were preferable in 78.6% of cases, and were longer and more detailed than physician responses	Suggests AI language models can generate quality and empathetic communication, showing potential for AI-led use for basic tasks such as automated feedback
AI in surgical resident selection
Mahtani (2023)⁴⁰	NLP-based supervised model for narrative text analysis	Developed AI pipeline to predict interview invitation status using qualitative information from internal-medicine residency applications over 3 application cycles	The model effectively identified applicant characteristics associated with interview invitation (active leadership, research, social justice-related work), with an AUROC of 0.92 after the addition of structured data	Supports AI as a potential tool for holistic and equitable residency selection while reducing manual workload for human reviewers
Hassan (2025)⁴¹	LLM-based AI ranking system, trained through defined preference ranking	Compared AI-based application screening with traditional human review for general surgery residency applicants; the model incorporated weighted academic, research, and qualitative features	There was only a 7.3% total overlap in selected applicants between the AI model and human reviewers; AI selected significantly more white/Hispanic applicants, less signals, more honors society members, and those with more research publications	Shows that current models require direct supervision and rigorous quality control to ensure that selected applicant pools meet the desired standards of program leadership
Drum (2023)⁴²	ML model using supervised learning for thematic extraction from text snippets	Used internal-medicine pediatric applications to train model that identified specific values (eg, compassion, communication, professionalism)	After comparison with manual human review on a testing data set, the algorithm had moderate success in identifying the same values as determined by human reviewers (sensitivity = 0.64, specificity = 0.97)	Supports integration of AI-assisted narrative analysis in application review to allow for scalable and consistent identification while reducing reviewer burden
Varman (2025)⁴³	GPT-3.5-based scoring of text	Personal statements of general surgery applicants were scored by human assessors and AI using the same rubric	Human and AI assessment showed low agreement (κ = 0.184 for leadership domain, κ = 0.120 for pathway domain); AI gave lower leadership scores and higher pathway scores compared to humans	Showed that AI scored consistently which may assist in early screening, although there was significant inconsistency to human review suggesting AI is not currently a suitable replacement for applicant processing
Koleilat (2024)⁴⁴	NLP models (ChatGPT and BingAI) generated text based on 3 prompts	Reviewers attempted to identify whether personal statements were written by human or AI; also measured interview offer odds based on writer type	Human reviewers identified AI-generated statements with an accuracy of 55%, with kappa statistic for correct authorship of 0.19; odds ratio of offering an interview was 7 times higher for perceived human author compared to perceived AI author	Highlights the challenge in identifying AI vs human written personal statements, and suggests potential bias when believing a statement to be AI-generated
Ortiz (2022)⁴⁵	NLP with supervised machine learning	Prediction of match outcomes using either letters of recommendation (LORs) or demographic data among neurosurgery applicants; another model predicted standardized letter of recommendation (SLOR) rankings	The LOR and demographics models had similar discrimination of matching outcomes (AUCs = 0.75); SLOR model predicted rankings in the top 5%	Suggests narrative text in applications can be used for predictive value and may help programs streamline the selection process
Rees (2023)⁴⁶	Random forest supervised ML algorithm applied to structured data	Applicants to an IM residency program over 3 cycles – AI model differentiated ranked vs unranked applicants, and matriculants vs ranked non-matriculants	AUROC of 0.925 for predicting ranked vs un-ranked applicants, and AUROC of 0.597 for matriculants vs ranked non-matriculants	Suggests that ML-powered selected can help programs triage large applicant pools by predicting ranking likelihood, although there are substantial discrepancies in predicting matched applicants

AI, artificial intelligence; XR, extended reality; CNN, conventional neural network; DNN, deep neural network; RCT, randomized-control trial; IOU, intersection-over-union; ML, machine learning; GPT, generative pre-trained transformer; NLP, natural language processing; LLM, large language model; AUC, area under the curve; AUROC, area under the receiving operating characteristic; CBE, competency-based education; EPA, entrustable professional activities.

Leveraging AI for Simulation and Skill Acquisition

The increasing complexity of medical education, rapid innovation in surgical techniques, variability in teaching styles, variable exposure to rare procedures, and increasing demands on teaching faculty create uneven training experiences and potential skill gaps—challenges that could directly impact patient safety. In response, surgical education is shifting toward a CBE framework, which emphasizes measurable milestones. Central to CBE are Entrustable Professional Activities (EPAs), defined as discrete, observable tasks that a trainee may perform independently once competence is demonstrated.⁵ Successful implementation of this framework requires longitudinal performance tracking and frequent, individualized feedback—an area where AI can have transformative value.

Unlike traditional “one-size-fits-all” curricula, AI has the potential to dynamically assess each trainee’s performance and tailor learning accordingly. Simulation platforms powered by AI can evaluate progress toward specific EPAs and prescribe targeted practice in safe, low-stakes environments. As an example, Stone et al (2025) introduced an AI-driven extended reality (XR) system that guided trainees through renal artery clamp placement on a kidney phantom. Using deep learning applied to first-person video, the system distinguished among correct and incorrect clamp sites and provided real-time corrective feedback. In this proof-of-concept study, all 17 participants successfully completed the task, with algorithmic classification achieving nearly 100% accuracy and survey feedback showing strong acceptance.⁶ At McGill University’s Neurosurgical Simulation and Artificial Intelligence Learning Centre, researchers tested a hybrid model in which 87 medical students practiced neurosurgical tasks under either AI-only instruction, human-only instruction, or human instruction informed by AI feedback.⁷ The hybrid group significantly outperformed the others in both technical performance and skill transfer, demonstrating that AI is most effective when augmenting – not replacing human experts. Another study from the same institution further revealed that AI-enhanced surgical curricula produced “positive unintended effects,” including improved safety metrics, enhanced bimanual control, and more deliberate operative actions.⁸

By assessing unique data such as case logs and performance metrics, models may identify knowledge gaps and technical deficiencies and subsequently use this information to prescribe targeted learning activities such as additional dexterity training, anatomy reviews, and reading assignments.⁹ This approach not only optimizes and modernizes the learning process but also ensures that available training hours are spent tailored to each trainee’s specific requirements.

Augmented and Virtual Reality Platforms

AI can integrate with augmented reality (AR) and virtual reality (VR) platforms to deliver immersive, adaptive training experiences. AR uses visuospatial technology to overlay computer-generated content onto the real world, while VR creates fully simulated environments.¹⁰ Such systems have already shown measurable benefits. A randomized-controlled trial on laparoscopic salpingectomy found that proficiency-based VR training significantly enhanced surgical skills in novice registrars. The VR-trained group achieved a median performance score of 33 points—equivalent to an intermediately experienced surgeon, and completed the operation in half the time, with a median of 12 min compared to the control group’s 24 min.¹¹

AR/VR platforms commonly employ real-time hand and instrument tracking using sensors and cameras, allowing the system to immediately report objective data such as instrument path accuracy or hand stability.^12,13 Emerging modalities including eye-tracking and physiological data are becoming increasingly integrated within AR/VR to provide rich assessments of performance.^14-16 A randomized-controlled crossover trial investigated the effect of augmented reality telestration on surgical performance and gaze behavior in minimally invasive surgery training.¹⁷ The study compared an AR-based system (iSurgeon) with verbal-only instruction, measuring outcomes in 40 laparoscopically naive medical students. Trainees instructed with the AR system exhibited improved gaze behavior, as evidenced by a substantial reduction in gaze latency and an increase in collaborative gaze convergence with the instructor. This guidance translated to a lower number of errors and higher scores on both the global and task-specific Objective Structured Assessment of Technical Skills (OSATS) scales. The use of AR also reduced the trainees’ cognitive workload, as measured by the NASA Task Load Index and blink rate, suggesting a more efficient and less taxing learning process.

AI for Global Surgical Education through Simulation

AI has the potential to broaden access to surgical education by delivering scalable, consistent learning experiences that do not rely on faculty availability. In low-resource environments, AI-driven systems can provide structured practice and feedback, helping trainees achieve competency benchmarks with limited supervision. Additionally, low-cost platforms have been developed which can facilitate adoption of AI technology to advance global surgical education.

Bridging gaps between high and low-resource countries require tools that are designed for low cost, and several recent proof-of-concept and pilot programs have had some success building these types of models. For example, Kim et al (2025) developed an open-source model for laparoscopic simulation, specifically designed for use in low-resource areas. Notably, the authors demonstrated that AI training on multiple different procedures (eg, appendectomy and salpingectomy) through the African Laparoscopic Learners – Safe Advancement for Ectopic pregnancy (ALL-SAFE) simulation platform could assess a different procedure (eg, enterectomy) within that same platform with moderate accuracy. The study found that training AI in this manner could most accurately assess performance on laparoscopic appendectomies; this study provides some pilot data towards demonstrating how AI models can be implemented in cost-effective, scalable methods in low-resource settings.¹⁸

Virtual mentoring through AI to personalize learning and assess a trainee’s performance and skill remotely, transcending geographic boundaries, presently remains a theoretical potential which has not been studied.^19,20 Proof-of-concept works have evaluated AI-based feedback for skill training. As highlighted in the earlier sections, AI-augmented personalized expert feedback resulted in superior performance on a VR simulation platform at McGill University; this study provides a preliminary impetus for AI guidance in focusing instructor feedback and time towards achieving optimal trainee performance through deliberate practice.⁷ The concept of deliberate and directed practice in acquisition of expert performance has been a fundamental time-tested pillar in the domain of surgical education.²¹ In the era of AI, the potential for AI-driven automated skills assessment and feedback is enormous as it addresses the key bottleneck of time commitment from faculty in busy clinical practice settings. As an example, authors from the University of Southern California designed an AI-based automated feedback system that assessed novice performance on robotic suturing of a vesicoureteral anastomosis. They noted that the AI-feedback group improved more than the control group (without AI-feedback) in specific domains of robotic suturing, and that AI-feedback most benefitted underperformers and receptive learners, while maintaining concordance with human assessment.²²

AI applications are also being used to address language and cultural barriers inherent to global education. Advances in language processing models have enabled researchers to translate health care curricula into many languages, increasing resource availability among diverse populations. Chen et al (2025) evaluated AI models such as Google Gemini and Microsoft CoPilot for translating critical care education materials from English into Spanish, Chinese, and Ukrainian.²³ Blinded clinical assessments and automated metrics indicated generally satisfactory performance, although specific outcomes varied depending on AI model and target language.

Finally, automated evaluation systems can standardize performance metrics globally, mitigating differences in national standards and supporting universal benchmarks for competency. Such objectivity can enhance patient safety and equity in surgical care worldwide.

Intraoperative Decision Support & Assessment

Beyond simulation, AI has potential intraoperative training applications. For example, Madani et al (2022) developed a deep learning model (GoNoGoNet) to identify anatomy and “safe zones” during laparoscopic cholecystectomy, offering real-time AR guidance.²⁴ This study highlights AI’s significant value in surgical education by accurately identifying safe and dangerous dissection zones during surgery. This technology can serve as a real-time digital mentor for trainees, enhancing their anatomical understanding and helping prevent critical errors. The model’s strong performance, with a mean F1 Dice score of 0.80 for No-Go zones, indicates its effectiveness in identifying high-risk areas. Such AI tools can be integrated into simulators and operating rooms to standardize training, objectively assess performance, and ultimately improve patient safety. In 2023, San Luigi Hospital developed a dedicated artificial Neural Network (NN) was created and trained to recognize active bleeding during robot-assisted radical prostatectomy (RARP).²⁵ The software was designed to analyze the video feed from the endoscope in real-time, scanning every 3 s to identify and predict the occurrence of bleeding. A confidence score, represented as a percentage, was used to signal the likelihood of an upcoming bleeding event. The software currently identified active bleeding with an accuracy of 90.6%. Researchers also noted that on average, the model was able to predict bleeding events 3 s faster than the human surgeon.

Such tools foreshadow a future in which AI enhances both surgical training and intraoperative safety. Virtual mentoring can also be useful beyond surgical training; AI-driven telemedicine platforms can allow for surgeons in rural settings to connect with specialists and seek real-time guidance on complex anatomy encountered during cases.¹⁹ Although continued development and validation of these tools on larger scales is necessary, the potential for implementing AI in such settings can produce performance improvements globally.

Finally, despite the need for significant ongoing investment and early stages of technological development, AI-integrated AR/VR platforms could procedurally generate novel anatomic variations and operative scenarios, reducing practice redundancy and strengthening intraoperative decision-making. Together, these advances highlight AI’s potential to adapt training to each learner, optimize preparation, and reinforce patient safety.

Assessment & Progression in CBE/EPAs with Real-Time Feedback

Perhaps the most transformative role of AI in surgical education lies in performance evaluation and feedback. Traditional assessment relies heavily on faculty observation. While expert evaluators bring invaluable clinical judgment, human assessment is inherently limited by subjectivity, inter-rater variability, and observation time. A 2023 scoping review found that even well-intentioned evaluators may demonstrate implicit bias when rating trainee performance.²⁶ AI offers a powerful complement by generating objective, continuous, and reproducible reports of technical skill.

This aligns well with competency-based frameworks that emphasize EPAs. AI-enabled simulation platforms can track resident progress toward specific EPAs and prescribe practice regimens to meet performance targets. For example, Fazlollahi et al (2022) performed a randomized-controlled trial among 70 medical students, and the group that received feedback from an AI audiovisual model had greater improvement than students receiving instruction from a human expert, when assessed on a VR neurosurgical model.²⁷ The model provided goal-oriented, metric-based suggestions, which translate well into today’s EPA-based guidelines. AI could also be leveraged to devise suitable training schedules to achieve proficiency by a target date.

AI additionally offers opportunities to standardize resident assessment beyond written exams such as the In-Training Exam (ITE) or American Board of Surgery Qualifying Exam, which purely measure trainee knowledge.^28,29 Using motion-tracking and haptic sensors, AI can evaluate parameters such as smoothness of operative motion, instrument path, and error frequency.³⁰ Ma et al (2024) found that a group of surgical trainees who received AI-based skill assessment results from suturing tasks using the da Vinci surgical robot had greater improvement than controls.²² In Lavanchy et al (2024), four expert surgeons rated videos of 949 cholecystectomy clip applications, rating performance on scale of 1 to 5, with “good” skill defined as greater than or equal to 3.³¹ The researchers then showed the same videos to the model, and it distinguished good vs poor skill with 87% accuracy. Likewise, Mirchi et al³² (2020) created a machine learning model which classified participants of a VR brain tumor resection task as “novice” or “skilled” with 92% accuracy. This “Virtual Operative Assistant” also provided automated, benchmark-guided feedback. Such metrics can provide immediate, actionable, potentially subtle feedback (eg, needing to adjust suture tension or reduce excessive wrist rotation) that may elude the human eye, although direct translation of the haptics associated with VR simulation is an area that needs continued validation and development.

Beyond individual assessment, AI enables benchmarking. Trainee performance can be compared against expert surgeon metrics, identifying gaps and prescribing structured training schedules to achieve proficiency within defined timelines. This is demonstrated by Khairnar et al (2025), who applied a machine learning model to videos of novice and expert surgeons performing laparoscopic suturing and found that their model correctly distinguished learner status with 81.7% accuracy.³³ Such stratification tools may allow earlier identification of underperforming learners, ensure equitable training outcomes across programs, and ultimately improve patient safety. By combining objective assessment with personalized feedback, AI has the potential to elevate surgical education into a more transparent, data-driven process.

Curriculum Integration of AI in Surgical Training

Despite AI’s promise, it is not a substitute for human mentorship, which remains the cornerstone of surgical education. Current evidence suggests the greatest gains occur when AI and faculty complement each other. For example, Kiyasseh et al (2023) compared AI-based feedback to expert feedback and found that while AI explanation is often in concordance with that of experts, they can be less reliable for certain learner groups.³⁴ Given this discrepancy, they proposed the “TWIX” (training with explanations) model, wherein AI systems are trained to mimic human explanations, rather than simply providing a numerical score. Effective integration therefore requires a balanced curriculum where AI augments expert teaching.

Several platforms already demonstrate how AI can be incorporated into surgical training. SIMPL (System for Improving and Measuring Procedural Learning) enables attendings to log resident operative performance in real time via a mobile app.³⁵ While convenient, its utility is limited by scorer variability and the time burden placed on faculty. AI could enhance such systems by aggregating faculty evaluations, adjusting for rater stringency, and generating concise anonymized reports highlighting resident strengths and weaknesses for clinical competency review. Although such an approach in surgical training in its infancy, comparable approaches have been demonstrated to be efficient in other educational settings. For example, Guo et al (2024) introduced an AI chatbot designed to evaluate undergraduate student comments on their peers’ essays. Results showed that the integration of AI feedback improved the quality of peer reviews. These findings highlight the benefit of AI-generated feedback within an educational setting.³⁶

AI tools are particularly well suited for repetitive technical tasks such as robotic suturing or needle passing, where objective performance metrics can be tracked longitudinally. For nuanced domains including intraoperative judgment, management of complications, or patient communication, human oversight is gold standard, although AI may have a place in improving these facets as well. In fact, Ayers et al (2023) found that evaluators preferred AI-generated responses over physician-generated responses to questions from an online medical forum.³⁷ A practical approach for curricular integration of AI may therefore involve using AI for repetitive skill reinforcement and progress tracking, while reserving faculty expertise for the complex, context-dependent teaching, where AI is most likely to falter.

AI in Surgical Resident Selection

With residency applications increasing annually, programs face mounting difficulty in reviewing applicants thoroughly and equitably.^38,39 The growing volume often limits holistic review, especially of narrative components such as personal statements, recommendation letters, and descriptions of meaningful activities. AI, long applied to analyzing complex data, is now being tested as a tool to process text-based information at scale, potentially easing this burden.

Mahtani et al (2023) used natural language processing to evaluate over 188 000 narratives from internal-medicine applicants, integrating thematic markers with structured data to predict interview offers with ∼92% accuracy.⁴⁰ In contrast, Hassan et al (2025) reported that an AI- based resident selection algorithm aligned with the program director selection in only 7% of cases, with AI-selected applicants being more likely to be White/Hispanic and with higher standardized exam scores and research metrics.⁴¹ Drum et al (2023) developed a model that identified traits such as compassion, teamwork, and work ethic from residency applications, highlighting the ability to quantify qualities often valued but difficult to measure.⁴² These studies underscore both the promise and variability of AI, emphasizing that outcomes depend heavily on model design and oversight. While these examples reflect the early developments of incorporating AI into the residency selection process, care must be taken to avoid over-reliance which may pave the path for savvy applicants to alter their content towards securing a favorable score on AI-only review. This is highlighted by Varman et al (2025), who found that while AI scores of prospective general surgery resident personal statements were consistent, they differed significantly from human reviewers.⁴³ As AI use becomes more widespread and accessible, there are growing concerns that students may use these tools to generate their personal statements, allowing them to draft and revise their work far more quickly than those who wrote their essays themselves. To this effect, Koleilat et al (2024) found that surgeons from a resident selection committee were only able to distinguish AI and human-generated personal statements with an accuracy of 55%, reflecting the potential issues that may emerge with broad knowledge and adoption of AI tools.⁴⁴

The most compelling role for AI is not to replace human reviewers but to act as an intelligent filter, triaging large applicant pools and flagging candidates for deeper review. Ortiz et al (2022) used 2 natural language models: one that used applicant letters of recommendation and one that used demographic data like honor society membership, standardized examination scores, and number of research publications.⁴⁵ Both models were able to discriminate whether applicants did or did not match into neurosurgical residency program, although the letter of recommendation model was superior (area under the curve = 0.80 vs 0.75). Likewise, Rees and Ryder (2023) used a machine learning algorithm to predict outcomes for 5067 internal-medicine residency applicants, and found high accuracy in distinguishing ranked from unranked applicants (area under the receiving operating characteristics curve [AUROC] = 0.925), but less success in distinguishing matched from unmatched candidates (AUROC = 0.597).⁴⁶

In summary, AI models can preserve faculty oversight, reduce time burden while ensuring rigorous standards, and manage repetitive, time-intensive tasks amid ever-growing application volumes. AI also has the potential to enhance holistic review by streamlining the extraction of key data points that programs prioritize, thereby reducing the need to rely on algorithms for interpreting context from letters, personal statements, or narrative experiences—areas where AI outputs may be discordant with human assessment. For example, one of the large graduate medical education selection platforms (Thalamus) has pioneered an applicant-screening platform (Cortex) that uses AI, NLP and transcript normalization to extract relevant application datapoints which are then aggregated into an interface optimized for the human reviewer.⁴⁷ The system claims to increase screening efficiency by 50%, allows programs to customize which datapoints matter by letting faculty define what parameters they consider predictive for success in their residency. Finally, as AI algorithms inevitably reflect the biases present in their training data, it is essential that a diverse group of stakeholders defines which datapoints are incorporated. Such deliberate design is critical to mitigate bias and ensure fair, equitable opportunities that align closely with human judgment.

The overview of AI in surgical education as elaborated in the preceding sections of this article is illustrated as a concept map in Figure 1. The corresponding literature discussed is tabulated in Table 1.

Figure 1.

Concept Map Overview of Artificial Intelligence in Surgical Education. NLP: Natural Language Processing; LOR: Letter of Recommendation; AI: Artificial Intelligence; XR: extended reality; AR: augmented Reality; EPA: Entrustable Professional Activity

Ethical Consideration of AI in Surgical Education

The adoption of AI raises significant ethical concerns, foremost among them the ownership and use of resident performance data. Many AI systems rely on continuous data collection sent to vendors for model refinement.⁴⁸ While useful for optimization, these data sets may include sensitive trainee information, raising risks of privacy breaches, misuse, or even monetization.⁴⁹ Public or insurance disclosure of raw performance data could also unfairly influence patient perceptions of individual surgeons.⁵⁰ While programs may value such data to identify struggling residents, critics argue that blinded or aggregate use is preferable to avoid discrimination. Clear governance frameworks remain lacking.^51,52 Informed consent is critical, and residents and patients should know what data is collected, its storage duration, and how it is shared. Opt-out provisions, institutional review board oversight, and explicit policies for human review and override are necessary safeguards.⁵³ AI-generated evaluations should be used as adjunctive, not definitive, measurements of performance. Finally, AI use for residency applicant review should avoid bias through input from a diverse group of individuals during model development as well as frequent post hoc analysis and adjustment. Time and effort need to be allocated to these tasks to mitigate risks of homogenization and lack of diverse personnel and perspective during recruitment.

Challenges and Future Directions

Despite early promise, several barriers must be addressed for AI to be integrated into surgical education. Faculty acceptance is pivotal. Resistance often stems from skepticism about validity and concerns that AI undermines traditional mentorship strategies.^54,55 Transparent development, pilot programs demonstrating improved trainee performance, and alignment with established simulation guidelines may foster quicker adoption. Data security also remains a critical challenge. Breaches like the 2024 Change Healthcare hack, which exposed data for over 100 million patients, highlight the stakes.⁵⁶ Vendors managing trainee or patient data must comply with strict regulations, emphasizing anonymity, secure storage, and scheduled deletion. Future large multicenter trials with long-term follow-up are needed to validate efficacy and generalizability of AI in surgical training and provide the leap from a proof-of-concept to trusted educational infrastructure.

Conclusion

Artificial intelligence has emerged as a powerful adjunct in surgical education, offering personalized feedback, scalable training, and standardized assessment that can bridge institutional and global gaps. The greatest promise lies in hybrid models where AI augments, rather than replaces traditional mentorship. To realize this potential, rigorous outcome-based research, long-term validation, and careful attention to ethical, technical, and implementation challenges are essential. With deliberate governance and thoughtful integration, AI can become a transformative pillar of surgical training.

Footnotes

ORCID iDs

Niruktha Raghavan

David Limon

Miranda X. Morris

Aashish Rajesh

Author Contributions

NR, PCP – Conceptualization, draft of preliminary manuscript, revision, approval of final version. DL, MXM, JWK – conceptualization, critical review of manuscript with revision for incorporating intellectual content, approval of final version. AR – senior author, conceptualization, critical review of manuscript with revision for incorporating intellectual content, approval of final version.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Pakkasjärvi

Anttila

Pyhältö

. What are the learning objectives in surgical training - a systematic literature review of the surgical competence framework. BMC Med Educ. 2024;24(1):119. doi:10.1186/s12909-024-05068-z

Watson

Flesher

Ruiz

Chung

. Impact of the 80-hour workweek on surgical case exposure within a general surgery residency program. J Surg Educ. 2010;67(5):283-289. doi:10.1016/j.jsurg.2010.07.012

Cox

Farjat

Risoli

, et al.

Documenting or operating: where is time spent in general surgery residency?

J Surg Educ. 2018;75(6):e97-e106. doi:10.1016/j.jsurg.2018.10.010

Maloney

Peterson

Kao

Sherrill

Green

Sachdev

. Surgery resident time consumed by the electronic health record. J Surg Educ. 2020;77(5):1056-1062. doi:10.1016/j.jsurg.2020.03.008

Brasel

Klingensmith

Englander

, et al. Entrustable professional activities in general surgery: development and implementation. J Surg Educ. 2019;76(5):1174-1186. doi:10.1016/j.jsurg.2019.04.003

Stone

Griffith

Zeller

Wilson

. Autonomous educational system for surgical training utilizing deep learning combined with extended reality. J Med Ext Real. 2025;2(1):160-173. doi:10.1177/29941520251361898

Giglio

Albeloushi

Alhaj

, et al. Artificial intelligence-augmented human instruction and surgical simulation performance: a randomized clinical trial. JAMA Surg. 2025;160(9):993-1003. doi:10.1001/jamasurg.2025.2564

Fazlollahi

Yilmaz

Winkler-Schwartz

, et al. AI in surgical curriculum design and unintended outcomes for technical competencies in simulation training. JAMA Netw Open. 2023;6(9):e2334658. doi:10.1001/jamanetworkopen.2023.34658

Gomez

Seenivasan

Zou

, et al. Explainable AI for automated user-specific feedback in surgical skill acquisition. arXiv. 2025. doi:10.48550/arXiv.2508.02593 Preprint posted online.

10.

Abbas

O’Connor

Ganapathy

, et al. What is virtual reality? A healthcare-focused systematic review of definitions. HPT. 2023;12(2):100741. doi:10.1016/j.hlpt.2023.100741

11.

Larsen

Soerensen

Grantcharov

, et al. Effect of virtual reality training on laparoscopic surgery: randomised controlled trial. BMJ. 2009;338:b1802. doi:10.1136/bmj.b1802

12.

Borzelli

Boarini

Casile

. A quantitative assessment of the hand kinematic features estimated by the oculus quest 2. Sci Rep. 2025;15(1):8842. doi:10.1038/s41598-025-91552-5

13.

Reimer

Podkosova

Scherzer

Kaufmann

. Evaluation and improvement of HMD-Based and RGB-Based hand tracking solutions in VR. Front Virtual Real. 2023;4:1169313. doi:10.3389/frvir.2023.1169313

14.

Adhanom

MacNeilage

Folmer

. Eye tracking in virtual reality: a broad review of applications and challenges. Virtual Real. 2023;27(2):1481-1505. doi:10.1007/s10055-022-00738-z

15.

Schütz

Dehghani

Sommersperger

Faridpooya

Navab

. The impact of intraoperative optical coherence tomography on cognitive load in virtual reality vitreoretinal surgery training. Sci Rep. 2025;15(1):24848. doi:10.1038/s41598-025-07670-7

16.

Francia

Donno

Covarrubias Rodriguez

Cascini

Tarabini

Galli

. Real-time monitoring of physiological and postural parameters to evaluate human reactions in virtual reality for safety training. Sensors (Basel). 2025;25(14):4400. doi:10.3390/s25144400

17.

Felinska

Fuchs

Kogkas

, et al. Telestration with augmented reality improves surgical performance through gaze guidance. Surg Endosc. 2023;37(5):3557-3566. doi:10.1007/s00464-022-09859-7

18.

Kim

Rosenthal

Ryder

, et al. Generalizability of artificial intelligence assessments in laparoscopic surgery simulation. J Surg Res. 2025;309:249-256. doi:10.1016/j.jss.2025.03.030

19.

AI Has Potential to Transform Global Surgical Systems . ACS. Accessed September 28, 2025.https://www.facs.org/for-medical-professionals/news-publications/news-and-articles/bulletin/2024/june-2024-volume-109-issue-6/ai-has-potential-to-transform-global-surgical-systems/

20.

Satapathy

Hermis

Rustagi

Pradhan

Padhi

Sah

. Artificial intelligence in surgical education and training: opportunities, challenges, and ethical considerations - correspondence. Int J Surg. 2023;109(5):1543-1544. doi:10.1097/JS9.0000000000000387

21.

Ericsson

. Deliberate practice and acquisition of expert performance: a general overview. Acad Emerg Med. 2008;15(11):988-994. doi:10.1111/j.1553-2712.2008.00227.x

22.

Kiyasseh

Laca

, et al. Artificial intelligence-based video feedback to improve novice performance on robotic suturing skills: a pilot study. J Endourol. 2024;38(8):884-891. doi:10.1089/end.2023.0328

23.

Chen

Dong

Castillo-Zambrano

, et al. A systematic multimodal assessment of AI machine translation tools for enhancing access to critical care education internationally. BMC Med Educ. 2025;25(1):1022. doi:10.1186/s12909-025-07452-9

24.

Madani

Namazi

Altieri

, et al. Artificial intelligence for intraoperative guidance: using semantic segmentation to identify surgical anatomy during laparoscopic cholecystectomy. Ann Surg. 2022;276(2):363-369. doi:10.1097/SLA.0000000000004594

25.

Checcucci

Piazzolla

Marullo

, et al. Development of bleeding artificial intelligence detector (BLAIR) system for robotic radical prostatectomy. J Clin Med. 2023;12(23):7355. doi:10.3390/jcm12237355

26.

Helliwell

Hyland

Gonte

, et al. Bias in surgical residency evaluations: a scoping review. J Surg Educ. 2023;80(7):922-947. doi:10.1016/j.jsurg.2023.04.007

27.

Fazlollahi

Bakhaidar

Alsayegh

, et al. Effect of artificial intelligence tutoring vs expert instruction on learning simulated surgical skills among medical students: a randomized clinical trial. JAMA Netw Open. 2022;5(2):e2149008. doi:10.1001/jamanetworkopen.2021.49008

28.

Rahimpour

Morrison

Denning

Bown

Ray

Barry

. Navigating the american board of surgery in-training examination (ABSITE) success: insights from pre-assessment practices in preparing surgical residents for competitive sub-specialties. Cureus. 2024;16(6):e62896. doi:10.7759/cureus.62896

29.

Stain

Matthews

Ata

Adams

Chen

Potts

. US medical licensing exam performance and American board of surgery qualifying and certifying examinations. J Am Coll Surg. 2021;233(6):722-729. doi:10.1016/j.jamcollsurg.2021.08.674

30.

Kumar

Saudagar

AKJ

Kumar

Alkhrijah

Raja

. Innovating medical education using a cost effective and scalable VR platform with AI-Driven haptics. Sci Rep. 2025;15(1):26360. doi:10.1038/s41598-025-10543-8

31.

Lavanchy

Zindel

Kirtac

, et al. Automation of surgical skill assessment using a three-stage machine learning algorithm. Sci Rep. 2021;11(1):5197. doi:10.1038/s41598-021-84295-6

32.

Mirchi

Bissonnette

Yilmaz

Ledwos

Winkler-Schwartz

Del Maestro

. The virtual operative assistant: an explainable artificial intelligence tool for simulation-based training in surgery and medicine. PLoS One. 2020;15(2):e0229596. doi:10.1371/journal.pone.0229596

33.

Khairnar

Nguyen

Desir

Holcomb

Scott

Sankaranarayanan

. Machine learning-based automated assessment of intracorporeal suturing in laparoscopic fundoplication. Global Surg Educ. 2025;4(1):62. doi:10.1007/s44186-025-00373-7

34.

Kiyasseh

Laca

Haque

, et al. A multi-institutional study using artificial intelligence to provide reliable and fair feedback to surgeons. Commun Med. 2023;3(1):42. doi:10.1038/s43856-023-00263-3

35.

Hsu

Wnuk

Leininger

, et al. When the first try fails: re-implementation of SIMPL in a general surgery residency. BMC Surg. 2024;24(1):257. doi:10.1186/s12893-024-02557-2

36.

Guo

Pan

Lai

. Effects of an AI-supported approach to peer feedback on university EFL students’ feedback quality and writing ability. Internet High Educ. 2024;63:100962. doi:10.1016/j.iheduc.2024.100962

37.

Ayers

Poliak

Dredze

, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med. 2023;183(6):589-596. doi:10.1001/jamainternmed.2023.1838

38.

Singh

Boyd

. Rapidly increasing number and cost of residency applications in surgery. Am Surg. 2023;89(12):5729-5736. doi:10.1177/00031348231173947

39.

Michaelson

Karasik

Polen

Rangarajan

Zhao

. Preparing for AI in resident selection: a scoping review of current applications and limitations. Laryngoscope. 2025. doi:10.1002/lary.32308 Published online June 7.

40.

Mahtani

Reinstein

Marin

Burk-Rafel

. A new tool for holistic residency application review: using natural language processing of applicant experiences to predict interview invitation. Acad Med. 2023;98(9):1018-1021. doi:10.1097/ACM.0000000000005210

41.

Hassan

Ayad

Nembhard

, et al. Artificial intelligence compared to manual selection of prospective surgical residents. J Surg Educ. 2025;82(1):103308. doi:10.1016/j.jsurg.2024.103308

42.

Drum

Shi

Peterson

Lamb

Hurdle

Gradick

. Using natural language processing and machine learning to identify internal medicine–pediatrics residency values in applications. Acad Med. 2023;98(11):1278-1282. doi:10.1097/ACM.0000000000005352

43.

Varman

Nicholas

Conner

Prabhu

French

Lipman

. Feasibility of using AI to evaluate general surgery residency application personal statements. J Surg Educ. 2025;29:103655. doi:10.1016/j.jsurg.2025.103655 Published online August 2025.

44.

Koleilat

Bongu

Chang

Nieman

Priolo

Patel

. Residency application selection committee discriminatory ability in identifying artificial intelligence-generated personal statements. J Surg Educ. 2024;81(6):780-785. doi:10.1016/j.jsurg.2024.02.009

45.

Ortiz

Feldman

Yengo-Kahn

, et al. Words matter: using natural language processing to predict neurosurgical residency match outcomes. J Neurosurg. 2023;138(2):559-566. doi:10.3171/2022.5.JNS22558

46.

Rees

Ryder

. Machine learning for the prediction of ranked applicants and matriculants to an internal medicine residency program. Teach Learn Med. 2023;35(3):277-286. doi:10.1080/10401334.2022.2059664

47.

Thalamus | GME residency interview scheduling software . Accessed September 28, 2025.https://www.thalamusgme.com/

48.

Feng

Phillips

Malenica

, et al. Clinical artificial intelligence quality improvement: towards continual monitoring and updating of AI algorithms in healthcare. NPJ Digit Med. 2022;5(1):66. doi:10.1038/s41746-022-00611-y

49.

Murdoch

. Privacy and artificial intelligence: challenges for protecting health information in a new era. BMC Med Ethics. 2021;22(1):122. doi:10.1186/s12910-021-00687-3

50.

Sherman

Gordon

Mahvi

, et al. Surgeons’ perceptions of public reporting of hospital and individual surgeon quality. Med Care. 2013;51(12):1069-1075. doi:10.1097/MLR.0000000000000013

51.

Zhang

. Ethics and governance of trustworthy medical artificial intelligence. BMC Med Inf Decis Making. 2023;23(1):7. doi:10.1186/s12911-023-02103-9

52.

Shiferaw

Roloff

Balaur

Welter

Waltemath

Zeleke

. Guidelines and standard frameworks for artificial intelligence in medicine: a systematic review. JAMIA Open. 2025;8(1):ooae155. doi:10.1093/jamiaopen/ooae155

53.

Tahri Sqalli

Aslonov

Gafurov

Nurmatov

. Humanizing AI in medical training: ethical framework for responsible design. Front Artif Intell. 2023;6:1189914. doi:10.3389/frai.2023.1189914

54.

Toni

Fereidooni

Ayatollahi

. Acceptance and use of extended reality in surgical training: an umbrella review. Syst Rev. 2024;13(1):299. doi:10.1186/s13643-024-02723-w

55.

Weykamp

Bingham

. Generation learning differences in surgery: why they exist, implication, and future directions. Surg Clin. 2023;103(2):287-298. doi:10.1016/j.suc.2022.11.008

56.

Jiang

Ross

Bai

. Ransomware attacks and data breaches in US health care systems. JAMA Netw Open. 2025;8(5):e2510180. doi:10.1001/jamanetworkopen.2025.10180

Artificial Intelligence in Surgical Education: A 2025 Update on Adaptive Training,Feedback,and Competency-Based Education