Abstract
Study Design
Retrospective Multicenter Cohort Study.
Objectives
To develop and validate an AI-based high-speed multi-class instance segmentation system for lumbar spinal endoscopic surgery using multicenter surgical video data and to assess performance across hardware environments.
Methods
Endoscopic videos from 112 patients at 5 hospitals (2020-2025) were analyzed. One frame per 300 frames was sampled, yielding 58,087 annotated images for 7 classes (instrument, fat, soft tissue, bone, nerve, disc, vessel). A Segment Anything Model (SAM)-assisted workflow improved annotation efficiency, followed by expert refinement. A YOLOv11-seg model was trained with a patient-level 4:1 split. Performance was evaluated using precision, recall, F1-score, mAP50, and mAP50-95, stratified by surgical approach. Inference speed was benchmarked across CPU (Intel i5/i7) and GPU (RTX 4080/5080) configurations.
Results
In the biportal group, overall precision, recall, F1-score, and mAP50 were 0.975, 0.633, 0.768, and 0.629, respectively. The uniportal group demonstrated 0.659, 0.670, 0.664, and 0.682, respectively. Class-wise performance varied substantially by surgical approach: the instrument class showed exceptionally high mAP50 (0.949) in uniportal settings, whereas anatomical structures like vessels were detected with superior accuracy in biportal settings (mAP50 = 0.863). Benchmarking yielded 21.86-27.45 FPS with CPU-only, ∼92 FPS with RTX 4080, and ∼117 FPS with RTX 5080.
Conclusions
This multicenter study highlights the potential of high-speed, multi-class instance segmentation in endoscopic spine surgery. Improving model robustness in visually degraded environments requires further research. Prioritizing high precision to prevent surgeon distraction, supported by rapid inference to maintain temporal continuity, is a practical direction for future surgical AI models.
Keywords
Introduction
Endoscopic spine surgery provides advantages such as minimal tissue trauma, faster recovery, and improved outcomes.1–3 Nevertheless, it remains technically demanding due to the restricted operative corridor, frequent changes in visualization, and the resemblance of adjacent structures. These factors complicate identification of critical anatomy like nerve roots, dura mater, and ligamentum flavum, thereby elevating the risk of iatrogenic injury.4,5
Artificial intelligence (AI), especially deep learning–based computer vision, has already proven valuable in other surgical fields. For example, GI Genius™ (Medtronic) and ENDO-AID (Olympus) are commercially available systems in gastrointestinal endoscopy that provide real-time object detection to enhance polyp recognition and mucosal assessment.6,7 In spine surgery, however, applications have largely concentrated on preoperative imaging or postoperative predictions,1–3 while intraoperative assistance remains at an early stage. Existing reports in spinal endoscopy AI are typically single-center, cover only a limited set of objects, and rarely address real-time feasibility.5,6,8
To address these gaps, we developed a high-speed, multi-class segmentation model trained on a large multicenter dataset of endoscopic spine surgery videos. More than 58,000 frames were manually annotated by spine surgeons, and the model was implemented using the YOLOv11 architecture. By incorporating multicenter data and focusing on diverse anatomic and procedural targets, this study presents a high-speed multi-class segmentation system for spinal endoscopic surgery.
Methods
Study Design and Video Collection
This was a retrospective, multicenter study designed to develop and evaluate an AI-based multi-class instance segmentation for lumbar spinal endoscopic surgery. Ethical approval was obtained from the institutional review board (IRB No.2509-005-154). A total of 112 lumbar endoscopic spinal surgery videos were collected from 5 hospitals between 2020 and 2025. The included videos encompassed both interlaminar and extraforaminal approaches and were selected to ensure diversity in surgical technique, anatomy, and endoscope configuration.
Model Development Process
To construct the training dataset, 1 frame was extracted for every 300 frames from the full-length surgical videos using an automated Python script. (Figure 1) Given that the native recording frame rate varied by endoscopic system (typically 30-60 fps), this corresponds to approximately 1 sampled image every 5-10 seconds. This sampling strategy was selected to avoid frame redundancy while maintaining sufficient visual variation across the dataset. A total of 58,087 still images were extracted. Each extracted frame was resized and normalized for input into the training pipeline. Workflow of model development Endoscopic surgical videos were collected from 5 hospitals and sampled at 1 frame per 300 frames, yielding a total of 58,087 images. Pixel-level annotations were created for 7 anatomical and procedural structures. To improve annotation efficiency and consistency, a model-assisted annotation workflow was employed, in which initial segmentation masks were generated using the SAM and subsequently verified and refined by experienced spine surgeons through iterative correction. The finalized annotated dataset was used to train a YOLOv11-based instance segmentation model. Model performance was evaluated using standard segmentation metrics, including precision, recall, F1-score, and mAP, as well as inference speed measured in FPS
Each extracted frame was manually annotated by experienced spine surgeons. Annotations were performed using pixel-level polygon masks (instance segmentation) for the following 7 categories: surgical instrument, fat, soft tissue, bone, nerve, disc material, and vessel. Given the sheer volume of 58,087 frames, we adopted a ‘Model-Assisted Annotation’ workflow to ensure efficiency and consistency. Data annotation was performed using an iterative pipeline leveraging the Segment Anything Model (SAM).
9
Initially, approximately 1000 frames were manually annotated to train the preliminary model. This model generated pre-labels for the remaining dataset, which were subsequently verified and corrected by the participating surgeons (Figure 2). The model was iteratively retrained on the corrected data to refine the auto-annotation for the remaining images. Through this repeated process of prediction, verification, and retraining, a total of 58,087 frames were annotated. Examples of SAM-assisted pre-labeling during dataset construction. (A) Pre-labeled segmentation masks generated by the SAM at an early stage of dataset construction, prior to large-scale expert-refined annotations. (B) Pre-labeled segmentation masks generated by SAM at a later stage, after iterative incorporation of expert-refined annotations during the model-assisted annotation process
The YOLOv11-seg instance segmentation architecture was utilized for model development. 10 Training was conducted on a workstation equipped with an AMD Ryzen 7 9800X3D CPU and an NVIDIA RTX 5080 GPU. The dataset was partitioned at the patient level into training (45,829 frames) and validation (12,258 frames) sets (4:1 ratio). The model was trained using stochastic gradient descent (SGD) with a learning rate of 0.001 and a batch size of 16. Data augmentation techniques, including horizontal flipping, random brightness adjustments, and rotation, were applied. Transfer learning was implemented using weights pretrained on the COCO dataset. Training was set for a maximum of 1000 epochs, and the final model weights were selected based on the epoch achieving the highest validation mean average precision (mAP).
Performance Evaluation
Model performance was quantitatively evaluated using standard computer vision metrics. The Intersection over Union (IoU), defined as the ratio of the intersection area to the union area of the predicted (Mp) and ground truth (Mgt) masks, served as the fundamental metric for localization accuracy:
Based on the IoU threshold, predicted segmentation masks were classified as true positives (TP), false positives (FP), or false negatives (FN). Precision and Recall were calculated for each class using the following equations:
The F1-score was computed as the harmonic mean of Precision and Recall to provide a balanced assessment of model performance:
The mean Average Precision (mAP) was calculated by averaging the Average Precision (AP) across all classes. We reported mAP at an IoU threshold of 0.5 (mAP50) and the average mAP across IoU thresholds from 0.5 to 0.95 in 0.05 increments (mAP50-95). Furthermore, to evaluate the model’s robustness against potential domain shifts caused by different endoscopic configurations, a stratified analysis was conducted. Performance metrics, including precision, recall, F1-score, and mAP, were independently calculated and compared between the biportal and uniportal endoscopy subgroups. Finally, inference speed was measured in frames per second (FPS).
To evaluate the model’s performance consistency and practical feasibility across varying computational resources, inference speed was benchmarked across different hardware configurations using a cloud-based computing environment. All tests were performed under an identical software stack with consistent versions of the operating system, Docker container, and deep learning libraries. Each hardware configuration was evaluated under the same input conditions (identical preprocessing, image resolution, and batch size), and the mean FPS was calculated from repeated measurements. The configurations included: (1) CPU-only (Intel i5, i7), (2) CPU (i5, i7) + NVIDIA GeForce RTX 4080 and (3) CPU (i5, i7) + NVIDIA GeForce RTX 5080. All tests were conducted within the same cloud infrastructure to minimize variability due to local hardware or thermal throttling effects.
Results
Demographic and Clinical Characteristics
Demographics and Clinical Characteristics
HNP = herniated nucleus pulposus; SS = spinal stenosis; FS = foraminal stenosis; SPLT = spondylolisthesis; ULBD = unilateral laminectomy for bilateral decompression; LD = lumbar discectomy; ULIF = unilateral lumbar interbody fusion; EFD = endoscopic foraminal decompression.
The participating centers utilized different hardware configurations. Center 2 (uniportal) used the Arthrex Synergy 4K system (Arthrex, Naples, FL, USA) with Joimax iLESSYS Delta and TESSYS TransSAP endoscopes (Joimax GmbH, Karlsruhe, Germany). Centers 1 and 4 (biportal) used the Conmed IM8000 system with Linvatec 0° 4-mm scopes (Conmed, Largo, FL, USA). Center 3 (biportal) utilized the Smith & Nephew LENS 4K system (Smith & Nephew, London, UK) with 4K 0° 4-mm scopes. Center 5 (biportal) used the Stryker 1688 system (Stryker, Kalamazoo, MI, USA) with 0° 4-mm scopes.
Regarding underlying pathology, herniated nucleus pulposus was the most common diagnosis (n = 43, 38.4%), followed by spinal stenosis (n = 32, 28.6%), foraminal stenosis (n = 20, 17.9%), and spondylolisthesis (n = 17, 15.2%). Cases were contributed by 5 centers, with patient distribution ranging from 10 (8.9%) to 34 (30.4%) per center, corresponding to 3940 to 19,084 video frames.
With respect to surgical approach, the interlaminar approach accounted for the majority of cases (n = 94, 83.9%), including unilateral laminectomy for bilateral decompression (n = 47, 42.0%), lumbar discectomy (n = 32, 28.6%), and unilateral lumbar interbody fusion (n = 15, 13.4%). The transforaminal approach was used in 18 cases (16.1%), all of which involved endoscopic foraminal decompression.
Dataset Composition and Class Distribution
Class Distribution of Training and Validation Data
Model Performance
Performance of the AI Model for Each Anatomical Structure
mAP (mean Average Precision) 50: Calculates average precision at a fixed IoU threshold of 0.5.
mAP50-95: Calculates the mean of APs at 10 IoU thresholds from 0.5 to 0.95 (in steps of 0.05).
IoU(Intersection over Union): a measure of the overlap between the predicted segmentation mask and the ground truth mask.
In the uniportal group, the overall precision and F1-score were 0.659 and 0.664, which were lower than those of the biportal group, while the recall (0.670) and mAP50 (0.682) were slightly higher. Class-wise analysis in the uniportal setting showed that the instrument class achieved the highest performance (mAP50 = 0.949, F1-score = 0.912). However, the vessel class exhibited a significant performance drop, with a precision of 0.393 and mAP50 of 0.260. Other anatomical structures, including bone (0.737) and soft tissue (0.718), maintained mAP50 levels above 0.7 in the uniportal environment. Representative examples are provided in Supplemental Videos 3.
Inference Speed According to CPU-GPU Combination
Inference Speed (FPS) of the AI Model According to CPU–GPU Combinations
All values represent the average inference speed of the AI model measured in frames per second (FPS).
Each combination was tested under identical software environments.
Discussion
In this study, we developed a high-speed, multi-class segmentation AI model for endoscopic spine surgery using a large-scale multicenter dataset comprising over 58,000 frames. While AI applications in spine surgery have been expanding, real-time intraoperative assistance has remained a challenge due to technical limitations.1,8,11–17 Our model simultaneously identified 7 anatomical and surgical targets with high precision, maintaining rapid inference speeds of up to 117 frames per second, thereby demonstrating its potential for intraoperative assistance. Although the recall was relatively lower compared to the high precision, this trade-off should be interpreted in the context of the system’s intended role as an intraoperative assistant rather than an autonomous decision-making tool. In surgical settings, minimizing false-positive alerts is critical, as excessive or erroneous visual cues may distract the surgeon and disrupt surgical workflow. Furthermore, recall was evaluated on a frame-by-frame basis, which does not fully reflect how surgeons perceive anatomical structures in continuous intraoperative video. Our model achieved an inference speed exceeding commonly accepted real-time thresholds. In general, real-time computer-vision systems for tasks such as pedestrian monitoring operate satisfactorily at ∼10-30 FPS, whereas high-speed applications (eg, autonomous driving or drone video) typically target ∼60 FPS. 18 The fast inference speed allows anatomical structures to be detected repeatedly across consecutive frames. Even when frame-level segmentations are intermittent, such temporal redundancy can support continuous perception of critical structures and may help reinforce surgeon confidence during dissection.
From a practical standpoint, inference speed is inherently dependent on the underlying hardware environment. In the present study, although the highest frame rates were achieved using high-end GPU configurations, our benchmarking demonstrated that prior-generation GPUs and even CPU-only settings were able to maintain frame rates within or near the lower bounds of real-time performance. While reduced throughput in these environments may limit temporal density, the achieved processing speeds remain sufficient to provide meaningful intraoperative assistance in selected workflows. These findings suggest that the proposed system retains flexibility for deployment across a range of hardware conditions, and that its clinical applicability is not restricted to optimized laboratory settings.
Several recent studies have demonstrated the potential of deep learning in spinal endoscopy. However, most have been limited to single-task applications or smaller datasets. Early works primarily addressed isolated problems, such as instrument detection or single-class tissue segmentation. For instance, Cho et al. 12 and Chen et al. 16 focused on lightweight instrument detection tasks, which inherently allow high inference speed due to sparse prediction requirements. In contrast, Lee et al. 11 and Peng et al. 14 developed models specifically for neural tissue segmentation, a substantially more demanding task requiring dense, pixel-level prediction. Although these models demonstrated favorable accuracy (Dice coefficient >0.8) for specific targets, their reported inference speeds suggest limited performance margins for maintaining stable intraoperative assistance during rapid instrument motion and frequent viewpoint changes typical of endoscopic spine surgery. The most recent and closely related study is the lightweight multi-scale attention framework (LMSF-A) proposed by Lai et al., 15 which demonstrated high-speed instance segmentation in spinal endoscopic surgery using a compact, task-optimized network. Trained on image-level data from 61 patients undergoing percutaneous endoscopic lumbar discectomy, the authors reported high segmentation accuracy together with very high inference speed, highlighting the feasibility of lightweight, real-time deployment under constrained computational settings.
While LMSF-A and the present study share the objective of high-speed intraoperative perception, they differ in scope and intended clinical role. LMSF-A focuses on algorithmic efficiency for a limited number of anatomical targets, whereas our study adopts a system-level approach by simultaneously segmenting 7 anatomical and procedural structures using a large multicenter surgical video dataset. This broader task definition increases segmentation complexity, particularly for nerve structures, but enables scalable intraoperative assistance across heterogeneous surgical environments. In this context, our model achieved clinically usable performance (mAP50 = 0.629) with very high precision (0.975) and stable rapid inference speed (∼115 FPS), supporting temporal continuity and surgeon confidence during dynamic endoscopic procedures.
The selection of AI models in the present study was guided by practical considerations related to dataset scalability, annotation reliability, and real-time deployability, rather than task-specific architectural novelty. During dataset construction, the SAM was employed as a model-assisted annotation tool to address the challenges inherent to large-scale, multicenter surgical video data. By enabling rapid generation of initial segmentation masks that were subsequently reviewed and refined by expert annotators, SAM substantially improved annotation efficiency while reducing inter-annotator variability, thereby facilitating the creation of a consistent and clinically reliable training dataset. For rapid inference, a YOLO-based architecture was adopted to support stable, low-latency performance required for continuous intraoperative assistance. Unlike U-Net–based or multi-stage segmentation pipelines, which may introduce cumulative latency through sequential processing steps, YOLO performs end-to-end prediction in a single forward pass, offering a favorable balance between computational efficiency and segmentation performance. This design choice enabled temporally consistent visualization under diverse hardware conditions and aligns well with the operational requirements of intraoperative video analysis. Taken together, the combination of model-assisted annotation using SAM and high-speed inference with YOLO represents a system-oriented strategy that may be broadly applicable to future surgical AI studies requiring scalable data curation and reliable rapid processing performance.
Beyond the general performance and hardware metrics, our subgroup analysis identified a significant performance gap between biportal and uniportal environments. Notably, while the uniportal group exhibited lower overall precision (0.659 vs 0.975), its overall recall (0.670 vs 0.633) and mAP50 (0.682 vs 0.629) were slightly higher than those of the biportal group. This seemingly counterintuitive result is primarily driven by the exceptional performance of the instrument class in the uniportal setting (Recall = 0.928, mAP50 = 0.949). In uniportal endoscopy, the instrument is introduced coaxially through the working channel, maintaining a relatively constant appearance, position, and trajectory at the center of the visual field. This consistent spatial prior allows the model to detect instruments with high sensitivity, significantly inflating the overall average metrics. However, this high detection rate for instruments contrasts sharply with the compromised segmentation performance for surrounding anatomical structures (eg, vessel, nerve, fat). As qualitatively demonstrated in the supplementary materials, while Videos 1 and 2 exhibit robust and continuous tracking in the biportal setting, Video 3 reveals noticeably lower temporal continuity and accuracy in the uniportal environment. It suggests that the reduced temporal continuity is likely attributable to the inherent mechanical workflow of uniportal surgery, which requires frequent rotation, tilting, and pivoting of the endoscope to navigate the restricted operative corridor. These continuous reorientations cause rapid, unpredictable changes in the angle and morphology of anatomical structures, increasing intra-class variance and making consistent segmentation highly challenging. Furthermore, this discrepancy is exacerbated by a substantial data imbalance, as the dataset contained significantly fewer uniportal frames. While the current dataset size was sufficient for the model to map the relatively stable anatomical features of biportal surgery, it was inadequate to fully capture the highly dynamic visual field unique to uniportal endoscopy. Therefore, we anticipate that disproportionately expanding the uniportal training data in future iterations will be essential to bridge this performance gap and overcome these approach-specific perceptual challenges.
Several limitations of this study should be acknowledged. First, model performance was primarily evaluated on a frame-by-frame basis, which may not fully reflect temporal perception in continuous intraoperative video. To partially address this limitation, supplementary videos are provided to qualitatively demonstrate temporal continuity and rapid processing capabilities of the proposed system. Second, this work focused on system-level feasibility and did not assess clinical outcomes such as complication rates or operative efficiency. Third, inference performance was dependent on hardware configuration; although real-time or near–real-time speed was achieved across multiple settings, reduced throughput may occur under constrained environments. Fourth, the dataset exhibited substantial imbalances in both class distribution and surgical approaches. There was a relative scarcity of visually ambiguous and intermittently exposed structures, such as vessels and discs. Moreover, the dataset contained significantly fewer uniportal cases than biportal cases, which limited the model’s ability to generalize to the dynamic rotational changes typical of uniportal surgery and contributed to the lower performance on anatomical structures in this setting. Fifth, although we employed a standardized model-assisted annotation workflow with expert refinement to minimize variability, formal inter-rater reliability metrics (eg, Cohen’s kappa) were not quantitatively evaluated due to the practical constraints of the large-scale dataset. Finally, the present study did not explicitly evaluate or optimize performance under blood-contaminated or degraded visualization, which remains a major limitation. Although our system’s high inference speed allows for rapid re-identification of structures once the visual field is cleared by irrigation, this does not fundamentally resolve the model’s vulnerability to active bleeding or severe visual artifacts. We acknowledge that segmentation accuracy for critical structures, particularly nerves and vessels, inevitably degrades under such conditions. Because a stratified analysis of these adverse environments and targeted algorithmic solutions are lacking in the current study, claims regarding the immediate clinical utility of this system must be appropriately tempered. Future studies should focus on constructing datasets with graded severity of bleeding and developing algorithms robust to these specific degradations.
Conclusions
In conclusion, this multicenter study explores the developmental direction and potential of real-time, multi-class instance segmentation in endoscopic spine surgery. The proposed model successfully demonstrated reliable object detection and high inference speeds, establishing a baseline for intraoperative AI assistance.
Our stratified analysis revealed that while biportal endoscopy yielded robust and precise segmentation performance, the highly dynamic visual environment of uniportal surgery posed challenges, primarily due to an insufficient volume of approach-specific training data. Furthermore, the current system exhibits inherent limitations in accurately segmenting anatomical structures under conditions of obscured visibility, such as active bleeding or severe visual artifacts.
Addressing these limitations will require further research and development to improve the model’s robustness, particularly in visually degraded environments. We suggest that prioritizing high precision to prevent surgeon distraction, supported by rapid inference to maintain temporal continuity, represents a practical and clinically sound direction for the future development of real-time surgical AI models.
Supplemental Material
Supplemental Material
Supplemental Material
Footnotes
ORCID iDs
Ethical Considerations
This study was approved by the Institutional Review Board of Pusan National University Hospital (IRB No. 2509-005-154).
Consent to Participate
The requirement for informed consent was waived due to the retrospective nature of the study.
Funding
The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: This work was supported by clinical research grant from Pusan National University Hospital in 2026.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
