Abstract
Background
Artificial intelligence (AI) enables hand motion tracking from standard surgical video recordings; however, translating these data into meaningful performance metrics remains challenging. We evaluated the preliminary validity of a markerless, AI-driven system that generates interpretable technical skill scores from an open-surgery task.
Methods
Sixteen medical students and one instructor performed a one-handed knot-tying task recorded with a smartphone camera while wearing motion sensors beneath surgical gloves. A deep learning algorithm tracking 21 hand joints mapped wrist trajectories and generated visualization boundaries from which kinematic parameters were derived and grouped into three domains scored from 0 to 10: economy of motion (EM), flow of motion (FM), and spatial organization (SO). AI metrics were validated against sensor-based data. Parameters and domain scores were correlated with expert-rated product quality (PQ) and technical performance (TP) using validated checklists.
Results
Nineteen performances were analyzed. AI metrics demonstrated strong correlations with sensor-based measures (r = 0.79-0.88, P < 0.01). EM metrics (path length, number of movements, task time) were associated with PQ and TP (r = 0.59-0.67, P < 0.01). Smoothness within the FM domain correlated with PQ and TP (r = 0.56-0.57, P < 0.01), while the composite FM score moderately correlated with TP (r = 0.44, P = 0.057). Working area within the SO domain demonstrated a moderate association with TP (r = 0.41, P = 0.08).
Conclusion
This prototype AI framework translated hand kinematics into interpretable, cohort-normalized domain-level scores that aligned with expert assessment. The findings support the feasibility of video-based kinematic scoring and provide preliminary evidence of construct validity. Further studies are warranted to determine reliability and generalizability.
Keywords
Introduction
Surgical proficiency remains challenging to assess in a universally unbiased and practical manner. For decades, evaluation has relied primarily on direct observation by expert surgeons. While foundational and still essential, observational assessment is inherently subjective, episodic, and time intensive, and it may vary across evaluators and settings.1,2 To improve standardization, structured assessment tools such as the Objective Structured Assessment of Technical Skills (OSATS) and global rating scales were introduced. 3 These instruments enhanced reliability and reduced some observer bias, yet they remain largely retrospective and outcome focused. They grade performance quality but provide limited insight into the underlying motor processes that generate errors or inefficiencies.
In response, hand motion analysis (HMA) technologies have emerged as more objective approaches for quantifying technical skills. Wearable electromagnetic sensors, virtual reality platforms, and proprietary motion tracking systems have demonstrated the ability to distinguish expertise levels in simulated environments.4-6 Although these systems reduce subjectivity, they often require specialized hardware, controlled settings, or complex calibration, limiting scalability and real time application.
More recently, artificial intelligence (AI) and computer vision methods have been applied to surgical skill assessment. 7 Several existing approaches rely on video-based tracking of surgical instruments to derive objective kinematic features, providing an indirect method for assessing hand dexterity during task performance.8,9 These systems include convolutional neural network models that leverage standard video recordings to track motion in 2D planes10-14 using parameters derived from velocity and path length calculations. However, the vast majority have been developed for minimally invasive procedures, where camera-fixed views and instrument visibility facilitate motion tracking.
In contrast, objective evaluation of open surgical hand motion remains underdeveloped. To date, a limited number of groups have developed markerless models in simulated scenarios,11,15,16 with few studies demonstrating feasibility in real-world open procedures.10,17-20 However, interpretation of hand-dexterity metrics remains limited, as many existing approaches emphasize performance classification or aggregate scoring rather than parameter-level interpretability. This has constrained progress in defining fundamental technical skills, particularly those involving fine motor control of the wrist and fingers, that could be integrated for correction and intuitive actional feedback prior to advancing to more complex techniques.
The present study evaluated the preliminary evidence of construct validity of a markerless, AI-driven computer vision HMA system that tracks wrist and finger kinematics, integrates them into defined technical skill domains, and generates interpretable performance scores from an open surgery simulated task with simplified motion visualizations. We hypothesized that AI-derived kinematic metrics would correlate with established measures of technical skill quantification.
Methods
A cross-sectional study was conducted in March–April 2023, including participants from Baylor College of Medicine (BCM, Houston, TX). During a 6-week Surgical Clerkship Boot Camp, trainees were invited to participate following completion of their weekly session. Interested individuals were enrolled after providing written informed consent under an Institutional Review Board–approved protocol (H-38994).
Study Design
A standardized one-handed knot-tying task consisting of 10 throws21-23 was used to evaluate system performance. The task was performed on a Knot Tying Board (Ethicon, NJ) mounted with a pre-tied 2.5-foot #2-0 silk suture (Ethicon, NJ), ensuring symmetrical suture ends. A smartphone (iPhone 11 or 13, Apple, CA) mounted on a tripod was positioned in front of the setup to record the task. No predefined camera angle or distance was required, provided both hands remained visible within the frame. To compare the AI-driven system with validated objective assessment tools,2,24 wearable motion sensors (Biostamp, MC10 Inc., MA) were affixed to the dorsum of each hand and covered with standard surgical gloves.
Participants received standardized instructions from the research team (A.Z., E.D.) and began the task upon verbal confirmation. After task completion, surgical gloves were removed and the sensors were placed on a docking platform that automatically uploaded accelerometer and gyroscope data to a secure cloud server for subsequent sensor-data analysis.
Hand Motion Analysis
Markerless hand motion tracking was performed using custom-developed software built upon the MediaPipe deep-learning framework,
25
which automatically extracted 21 wrist and finger joint landmarks from 2D smartphone video recordings (Figure 1). A Multi-layer convolutional neural network predicted join coordinates in each frame.26,27 For spatial scaling and calibration, the metacarpophalangeal and proximal interphalangeal joints of the index finger were used to estimate proximal phalanx length, selected for its anatomical consistency and reliable visual identification. A standardized reference length of 45 mm25,28 was applied to ensure consistent scaling across participants, independent of camera distance, angle, or lighting condition. AI-based markerless computer vision system designed to map and analyze hand motion. (A) 21 joint landmarks are detected using deep learning. (B) Calibration system detects the index finger first phalange (landmark 5 and 6) to standardize the tracking of parameters. (C) The wrist joint (landmark 0) is used to map the left and hand motion during the task. (D) A hand trajectory pattern from each hand is mapped upon the completion of the task from which working area (pink boundary) and kinematic metrics are derived
Sensor-based motion signals (X, Y, Z axes) were processed in MATLAB (MathWorks, MA) following our prior methodolog.2,24,29 Velocity, path length, and number of movements were derived from the three-dimensional positional sensor data. A dynamic velocity threshold (30th percentile of the velocity distribution) was applied to attenuate micro-movements and baseline oscillations. Adaptive thresholding approaches are commonly used in signal processing to distinguish meaningful motion from background noise. 30 Sensor-derived data were analyzed solely for validation of the AI-driven metrics and were not integrated into the AI scoring model.
Computer Vision–Derived Hand Dexterity and Technical Proficiency Parameters
Following calibration, the wrist landmark was selected to compute the spatial trajectory of each hand across all video frames (Figure 1). From these trajectories, dexterity-related kinematic parameters were derived (Figure 2). Graphical description of hand dexterity parameters derived from the hand trajectory pattern. 2D wrist landmark trajectories generate a hand trajectory pattern from which three skill domains are derived. Economy of Motion quantifies task efficiency using total path length, number of hand movements, and time of task. Flow of Motion evaluates dynamic motor control of the dominant hand using sub-threshold velocity time, smoothness (LDLJ), steadiness (SPARC), and speed consistency (CVspeed). Spatial Organization characterizes operative field utilization through working area and inter-hand spatial alignment (horizontal distance and vertical elevation)
Movement Efficiency
Path length was calculated as the cumulative distance traveled by the wrist landmark throughout the task. The total path length of the left and right hands was summed. Shorter cumulative path length indicated greater movement economy. The number of hand movements was determined using a velocity-based segmentation approach. Instantaneous velocity was computed based on frame rate and spatial displacement (m/s). A movement event was defined when velocity exceeded 0.05 m/s, consistent with threshold-based segmentation methods used to exclude tremor, fidgeting, and non-purposeful oscillations. 31 Fewer discrete movements indicated greater motor efficiency. Sub-threshold motion (<0.05 m/s) was quantified as total jerkiness time, 32 representing non-purposeful oscillatory activity. Lower jerkiness time corresponded to improved motor control.
Spatial Organization and Holistic Visualization of the Operative Field
Spatial utilization of the operative field was quantified in 2D to provide an interpretable representation of hand organization during task execution (Figure 2). For each hand, the spatial trajectory was computed across all video frames, and the centroid was defined as the mean X and Y coordinates of the trajectory. To characterize working area dispersion, a boundary encompassing 99.7% of positional samples (±3 standard deviations from the centroid) was constructed, forming an elliptical region representing each hand’s functional workspace. 33 The enclosed area (cm2) reflected the spatial extent of motor activity during the task. The combined working areas of both hands quantified overall operative field utilization, where smaller, more compact regions indicated greater spatial efficiency and controlled motor execution.
Horizontal and vertical separation between hand centroids were computed to quantify relative bimanual alignment. Values closer to zero indicated tighter spatial coupling and coordinated hand positioning, while larger deviations reflected broader spatial distribution within the surgical field. Collectively, spatial parameters provided an intuitive, visual representation of spatial organization and bimanual coordination.
Motion Consistency, Smoothness, and Steadiness
To characterize higher-order motor control features, established kinematic metrics were computed from the dominant hand trajectory, as this hand performed the primary manipulative knot-forming actions across the one-handed task. Motion consistency, which evaluates variability in movement pacing, was quantified using the coefficient of variation of speed (
Motion smoothness was calculated using log dimensionless jerk (LDLJ) assessed by quantifying fluctuations in acceleration, which normalizes jerk relative to trial duration (
Higher (less negative) LDLJ values indicated smoother motion.
Motion steadiness was assessed using spectral arc length (SPARC), which evaluates regularity of the speed profile in the frequency domain,
36
distinguishing uninterrupted movements from fragmented or stop-and-go patterns. SPARC was computed as the negative arc length of the normalized Fourier magnitude spectrum up to cutoff frequency
Values closer to zero indicated smoother, more continuous movement.
Hand Dexterity Scoring
To facilitate interpretability and domain integration, each kinematic parameter was normalized across participants using min–max scaling and converted to a standardized 10-point score. Higher scores uniformly represented better performance:
Similarly, dominant-hand control metrics were normalized to preserve directional consistency. Motion consistency (
Conversely, smoothness (LDLJ) and steadiness (SPARC), where higher values indicate better performance, were normalized using standard min–max scaling:
This normalization framework ensured comparability across heterogeneous kinematic measures and enabled aggregation into higher-order technical skill domains.
Technical Skills Determination and Scoring
Three fundamental technical skill domains35,37,38 were derived by integrating normalized hand dexterity and control parameter scores.
Economy of Motion Score (EMS)
EMS was calculated as the average of total path length, total number of hand movements, and time-of-task scores. This domain reflects the ability to execute the task efficiently by minimizing unnecessary movements, reducing travel distance, and completing the task within an optimal timeframe.
Spatial Organization Score (SOS)
SOS was calculated as the average of total working area and horizontal and vertical inter-hand distance scores. This domain represents the ability to utilize the operative field in a compact, coordinated manner, reflecting efficient spatial strategy and appropriate relative positioning of the hands throughout task execution.
Flow of Motion Score (FMS)
FMS was calculated as the average of total jerkiness time and dominant-hand motion consistency, smoothness, and steadiness scores. This domain captures the continuity, stability, and rhythmic coordination of movements over time, reflecting the degree to which motion is executed in a steady and uninterrupted manner by the dominant (active) hand during the task. 39
Together, these three domains translate individual kinematic measures into interpretable constructs representing efficiency, spatial organization, and dynamic motor control.
Construct Validity Evaluation
Task performance was independently evaluated by a vascular surgeon and Associate Director of Surgical Simulation (N.H.), who was blinded to AI-derived kinematic outputs, domain-level scores, and sensor-based metrics. Videos were de-identified with coded participant IDs, with no indication of attempt order. Scoring was completed prior to review of AI processing using modified validated knot-tying assessment scales: (1) product quality 40 (1–18 points) and (2) technical performance 41 (0–10 points). Higher scores indicated superior performance (Supplemental Table 1).
To evaluate the accuracy of the AI-derived metrics obtained from smartphone video recordings, the estimated core kinematic parameters (velocity, path length, and number of hand movements for each hand) were systematically compared with corresponding metrics derived from hand-mounted sensors using previously validated algorithms.2,24,29 These kinematic measures are well-established descriptors of motor performance 42 and constituted the foundational parameters from which higher-order AI-based metrics were subsequently derived. Associations between technologies were assessed using Spearman correlation, while agreement was evaluated using Bland–Altman analysis to calculate systematic bias (mean difference, d) and 95% limits of agreement.
Discriminatory Power of Kinematic Parameters and Technical Skills
Spearman correlation analysis was performed to examine the relationship between individual kinematic parameters, composite technical skill domain scores (EMS, SOS, FMS), and expert-rated knot-tying performance measures (product quality and technical performance). Statistical significance was defined as P < 0.05, and effect size strength was categorized as low (r = 0.1-0.3), moderate (r = 0.3-0.5), and high (r > 0.5). 43 All statistical analyses were conducted using SPSS (IBM, Chicago, IL) and Python version 3.10 (Wilmington, DE).
Results
Participant Characteristics
Participants’ Demographics and Expert Rating Scores
aData not analyzed. F, female; M, male; NA, not applicable.
Data Collection and Agreement Between Technologies
Sensor-based data from 4 participants was lost due to signal interruption/noise, thus excluded from the analysis. This led to 19 performances analyzed with both computer-vision and sensor-based systems, 13 collected in a first session, 5 in a second session, and 1 in a third session. Individual data per right and left hand was tracked in each performance, composing n = 38 samples (hands). There was a significant correlation between sensor-based and computer-vision hand dexterity parameters including velocity (r = 0.79, P < 0.01), path length (r = 0.82, P < 0.01), and number of hand movements (r = 0.88, P < 0.01). Similarly, agreement between technologies was achieved for such parameters: Velocity (=−1.26, 95% limits of agreement: −49.32 to 46.8), path length (=−0.36, 95% limits of agreement: −4.27 to 4.99), and number of hand movements (=−0.16, 95% limits of agreement: −19.46 to 19.15).
Task Results
Data from all computer-vision samples (n = 38 hands) in 19 performances were utilized to calculate the EMS, SOS, and FMS. Individual participant data is depicted in Supplemental Tables 2-4. Among first attempts, the instructor achieved the highest EMS (9.34), FMS (8.79) and SOS (9.72). MS008 obtained a higher EMS (9.56) and FMS (9.81), however in their second attempt. Figure 3 represents visualization of the spatial organization parameter in some participants. Product quality median score was 15 points (13-18), while technical performance score median was 4 points (2-10) (Table 1). Real-time task recordings demonstrating the AI-based markerless computer vision system mapping hand kinematics. Camera angles varied depending on the day of recording, but they did not impact the system’s ability to track motion data. Environmental factors such as lighting conditions and background objects did not interfere with parameter extraction. The hand trajectory pattern is scaled to real workspace volume occupation as it utilizes a calibration system. This scale permits comparison of parameters between subjects and calculation of scores based on the same performance. (A) Subject IN001-1; (B) MS013-1; (C) MS008-2. Expert-rated scores are depicted in Table 1. AI-derived scores are depicted in Supplemental Tables 2-4
Discriminatory Power of Parameters and Technical Skills
Correlation of Individual Motion Parameters With Expert-Rated Surgical Skill
Total means both left- and right-hand data together. Green: high correlation (r > 0.5); Yellow: moderate correlation (r = 0.3-0.5); Red: low correlation (r = 0.1-0.3).
When grouping parameters into technical skills, EMS correlated with technical performance and product quality (r = 0.66, r = 0.58), FMS correlated with technical performance (r = 0.44), and SOS did not show correlations with either validated checklist (Figure 4). Computer-vision technical skills association with technical performance during a one-handed knottying task. EMS, economy of motion score; SOS, spatial organization score; FMS, flow of motion score. Association between EMS, FMS, SOS (range 1-10), and expert-rated scores (product quality, range 1-18; technical performance, range 0-10)
Discussion
This study developed an AI-based markerless computer vision framework to quantify technical skills by integrating: (1) standard video-recordings, (2) a calibrated deep learning algorithm tracking hand dexterity parameters, (3) integration of such metrics into technical skill categories, and (4) creation of a cohort-normalized prototype score fed by performance data from participants executing an open surgery task. This work is a prototype assessment framework task (one-handed knot tying) rather than a validated objective grading system, because the scoring is normalized to the best- and worst-performing individuals within this cohort rather than to an externally anchored proficiency threshold or multi-expert consensus benchmark. We also note that agreement between AI- and sensor-derived kinematics supports concurrent validation of motion capture accuracy, whereas associations between domain-level scores and expert-rated product quality and technical performance provide only preliminary evidence of construct validity. This approach allowed us to explore, understand, and refine our system for future application in higher expertise levels and more complex tasks. Establishing educational utility will require prospective studies demonstrating discrimination across expertise levels, responsiveness to training, and predictive validity.
Direct evaluation of hand motion is crucial, as precise movements are essential for developing surgical expertise. Hand dexterity serves as a key differentiator among proficient surgeons, minimizing errors and intraoperative complications while ensuring patient safety and optimal outcomes. 44 Achieving technical proficiency requires reducing unnecessary hand movements, maintaining bimanual hand dexterity, motion flow, and ensuring stable hand positioning within the operative field. 6 Our goal was to quantify these fundamental skills during a one-handed knot-tying task in an objective and realistic manner and determine their discriminatory power in distinguishing among varying levels of technical performance.
Aligned with our previous work2,24 and others assessing knot-tying tasks via objective motion tools,21,22,45 the present study identified economy of motion, captured via computer vision, as a technical skill strongly associated to both product quality and technical performance. The fact that economy of motion correlated with both outcomes suggests that efficient and purposeful motion is not only perceived by human raters during knot-tying tasks but also reliably captured by algorithm-based assessment. Supporting this, Kasa et al, 23 used deep learning to integrate sensor-based kinematic data and product quality images from knot-tying tasks performed by 72 surgeons, demonstrating superior discrimination for economy of motion, product quality, and overall performance by their system compared to expert raters’ assessments. Although we did not combined kinematic and outcome data or compared assessment approaches, we observed that the instructor achieved the highest scores across all economy of motion components, including path length, distance, and time of task, and was above the 95th percentile of product quality and technical performance scores despite not being aware of the parameters under evaluation (Supplemental Table 2). This indicates that differences in technical experience are detectable by our system and given its simplicity and fast function compared with time-consuming checklist assessments, the proposed economy of motion metrics may be tested for external validation and support future training tasks.
Flow of motion has been consistently described in surgical skills literature as reflecting the continuity and regularity of instrument handling rather than isolated movement efficiency.46,47 Movement smoothness is characteristic of a healthy well-trained motor behavior and effort minimization. 48 Prior research on surgical tasks has shown that smoothness, quantified by LDJL, is correlated with expert-based assessment, 49 and serves as an indicator of coordinated higher performance with experts exhibiting smoother motion and fewer abrupt corrections than novices. 50 In the present study, motion smoothness was the most sensitive parameter within the flow of motion domain, as it strongly correlated with both product quality and technical performance scores (Table 2). Moreover, total jerkiness time, defined as inefficient movements below the active threshold (Figure 2), showed a moderate correlation with technical performance. This finding may reflect that unnecessary movements and increased inactive hand periods32,51,52 are readily perceived by human expert raters and similarly quantified by algorithm-based assessment. In contrast, motion steadiness (SPARC), and speed consistency (CVspeed) did not correlate with technical outcomes in this task. This could be explained by the reduced sensitivity of SPARC to brief tracking noise or small local variations in speed as seen in a simple one-handed knot tying task. 53 Indeed, SPARC has been shown to distinguish surgical skill levels primarily among higher-skilled performers based on smoother continuous tool or hand motion,39,54-56 while CV speed has demonstrated utility predominantly in more complex tasks, such as robotic surgery. 34 Even though only jerkiness and smoothness correlated to outcomes, there remains a question whether more complex tasks (ie, anastomosis) in higher level trained surgeons (ie, residents, attendings) would yield greater sensitivity to SPARC and CVspeed.
While efficiency and motion smoothness directly reflect technical performance and tissue handling,39,57 spatial utilization metrics could reflect how compactly surgeons operate within the surgical field. Maintaining a consistent distance between hands, allowing wrists to operate within a controlled range, may not only support surgical hygiene and spatial awareness but also promote efficient, anchored elbow positioning within shoulder boundaries.58,59 These concepts, although being a one-handed knot-tying task, were evident in the instructor’s performance who consistently scored above the 95th percentile across all spatial organization parameters (Supplemental Table 3); however, this domain was not associated with product quality nor technical performance. This divergence may be caused by the weak correlation between distance and elevation scores with human expert scores. Even though the instructor scored the highest, it didn’t apply to all participants. In a simple goal-directed knot-tying task, hand spacing can vary by individual strategy and ergonomics without necessarily affecting knot formation; thus, workspace-separation metrics may be less sensitive to skill differences than economy and flow of motion parameters.60,61 Nonetheless, hand working area showed a moderate correlation (r = 0.41, P = 0.08) with technical performance, suggesting that expert raters may value compact operative field use and controlled spatial execution.62,63 By translating wrist trajectory data into simplified geometric representations, our system enables a holistic visualization of operative field utilization. D’Angelo et al 33 reported that working volume area, represented as spheres, was inversely proportional to surgeon expertise during open tasks measured via optical motion tracking. Similarly, robotic surgery studies using 3D sensor-based systems have shown that workspace-range measures (ie, path length tracing) distinguish surgeon skill levels by quantifying how broadly instruments occupy the operative field during simulation tasks and real-world procedures.42,60,63 While this remains speculative in the context of a one-handed knot-tying task, further validation using more complex surgical tasks is needed to clarify its role. It is important to note that despite excluding in-depth motion tracking (Z plane), our model showed a significant association between computer-derived (2D) and sensor-based (3D) parameters, particularly across variations in camera angle and distance. This suggests that complex tracking systems may not be necessary for evaluating spatial organization, and simple video cameras (ie, smartphone) could provide a feasible alternative for assessment.
The successful assessment of technical skills in our study (validated against wearable metrics) relies on accurately predicting hand motion patterns from video-recordings despite confounding environmental factors. For instance, Nagaraj et al 11 developed a deep-learning pass/fail system to detect critical errors in instrument handling and knot tying using 213 medical student video recordings. They used convolutional neural network to measure objects’ velocity and capture temporal features of sequential 2D images. This work represents a significant step toward human-machine scoring open surgery tasks, facing minor challenges with lighting, camera positioning, and background noise, however the model did not incorporate glove use. Other approaches attempted to mitigate these gaps by using color-differentiated gloves,51,64 but at the expense of reducing real-world applicability. Our model addresses these limitations through a calibration system capable of tracking hand motion independently of background, lighting, or camera placement, and is compatible with standard surgical gloves (Figure 3). While this prototype currently uses the wrist joint for hand dexterity assessment, it is adaptable to other anatomical landmarks, including fingers, elbows, or broader body regions. Currently, the system has only been tested in simulated open surgery settings and accuracy remains untested, as comparisons between large cohorts of expert and novice groups were not conducted.
Aside from Nagaraj et al, 11 few models have directly assessed hand motion through computer vision in a way that yields interpretable performance scores from standard 2D video recordings. Goodman et al 10 developed a multitask neural network to quantify action recognition and hand/tool detection using large volumes of online real-world procedural videos. The model was trained to extract hand motion metrics (ie, translation, rotation) and task execution variables (ie, knot-tying, suturing, cutting), then validated on an independent set of surgical videos and demonstrated that more localized movements were correlated with surgical expertise. They also identified procedure-specific surgical signatures for 3 open procedures (appendectomy, pilonidal cystectomy and thyroidectomy) and suggested that deviation from these patterns may represent altered flow. These findings offer strong features for the development and validation of future open surgery models with interpretable data, although the approach was not validated against expert surgical assessment tools. Similar to the present study, Thomas On et al 65 tracked 21 hand landmarks in five neurosurgeons performing simulated microvascular anastomoses, calculating path length and velocity to assess economy and flow of motion patterns unique to each participant. Despite categorization of technical skills, they acknowledged that a scoring system is still needed to simplify the tracking output.
Azari et al developed a computer vision-based model using simulated anastomosis tasks, defining OSATS-aligned domains derived from quantitative motion metrics and trained to reproduce expert-rated performance scales. 66 A key distinction is that their algorithm predicts expert consensus scores, anchoring outputs to expert-defined performance. Our study’s expert ratings were used for validation only, and the composite score is internally normalized and performer-dependent, which may limit generalizability and lacks external anchoring. Importantly, Azari et al applied their model in the operating room across multiple surgical specialties, demonstrating feasibility of motion tracking and association with expert assessment in real-world open surgical procedures.18-20 Their work represents an early translation of motion tracking-based assessment from controlled environments to intraoperative application.
Building on this, our model converts tracked motion into interpretable scores across technical skill domains, offering a structured framework for evaluation and targeted improvement. While it maps kinematic data to literature-defined skill categories, it does not identify task execution variables or surgical mistakes. This preliminary study focused on verifying the model’s ability to track and transform universal kinematic parameters rather than pattern recognition into meaningful performance scores. Further studies are necessary to determine whether these scores inform actionable feedback and improve surgical training outcomes.
Limitations
The main limitation is the small sample size mostly composed of non-trained allied health professionals. The cohort was heavily clustered at the novice end, with 16 third-year medical students and only a single expert comparator (1 instructor). Consequently, the observed correlations may partly reflect variability among novices performing a basic task rather than discrimination of expertise per se, and the present data do not allow firm conclusions about the framework’s ability to distinguish across the full spectrum of surgical proficiency. Confirmatory studies in larger, stratified cohorts spanning novice, intermediate, and expert levels are essential before the framework can be recommended for educational or credentialing use. Five of the 16 medical students repeated the task in two separate weekly sessions, and their repeated attempts were included in the correlation analyses without applying a repeated-measures or mixed-effects framework. This approach was adopted pragmatically to increase the effective sample size given the small cohort. We acknowledge that this does not fully account for within-subject dependency; however, inter-session variability in camera setup and the learning effect associated with novice participants performing the task for the first time may have rendered each attempt sufficiently distinct to partially mitigate this concern. Nonetheless, this remains a methodological limitation, and future work with larger cohorts will employ a formal repeated-measures or mixed-effects framework to ensure the validity of statistical inference. Blinding of the expert rater was implemented with de-identified videos, masked participant IDs, and scoring performed prior to sharing AI outputs; however, because videos were reviewed sequentially in a small cohort, we cannot fully exclude the possibility that familiarity with an individual participant’s hand appearance or performance style influenced scoring. Future studies will incorporate randomized video review order and independent dual-rater scoring to further mitigate this risk. The cohort-normalized nature of the present scoring means that scores are dependent on the sample; anchoring future versions to predefined proficiency thresholds or multi-expert consensus benchmarks will be required to establish a truly externally valid grading system. Another limitation is the pre-trained model employed; training a full model requires larger amounts of data. The model includes data from 2 planes (X and Y vectors), excluding three-dimensional data (Z vector). Feedback was not provided for the participants. The grading scale and automatization is limited to one single task and one single landmark (wrist joint). The videos included in this model are short (<2 min). This model does not detect surgical mistakes or integrate product quality data in the scoring system; it is designed as a tool for assessment of universal parameters that can be applied in different tasks either for open or minimally invasive surgery. This model was designed to continuously refine a grading system that dynamically adjusts based on participant performance. Because our algorithm’s scoring currently relies on comparisons with the best-performing individual, future assessments should incorporate multiple experts to establish a robust performance benchmark.
Conclusion
The AI-based markerless computer vision framework developed in this study demonstrated the ability to extract kinematic parameters and generate interpretable, cohort-normalized domain-level scores based on performance, without relying on external tracking devices or standardized checklists. By capturing genuine hand movements, this approach enables assessments to be conducted more naturally and efficiently. Its key innovation lies in breaking down technical skills into distinct categories allowing for precise identification of specific performance gaps and enabling focused, efficient training. In the context of one-handed knot-tying, economy of motion emerged as the most relevant skill, strongly correlating with both expert-rated product quality and technical performance scores. Flow of motion and spatial organization domains showed weaker correlation with expert assessments, although some important trends were seen when dichotomizing parameters. We acknowledge that the sample size is small and that the results will need to be confirmed in a larger, expertise-stratified cohort before firm conclusions can be drawn. Despite this constraint, we hope that the innovation demonstrated here helps advance the field forward and paves the way for future confirmatory studies. Future studies enrolling surgeons with different expertise levels and performing more complex tasks are warranted to confirm whether any of the assessed technical skills should be given more weight or be excluded from the model. Planned extensions include recruitment of a larger multi-expertise cohort, incorporation of real-time feedback, empirical reweighting or recalibration of the spatial organization domain, and deployment on platform-agnostic applications (including macOS) that require neither wearable sensors nor specialized hardware for human–computer interaction. This study contributes to the advancement of standardized grading systems by introducing components that offer a consistent framework for assessing technical proficiency, with potential for broader application in more advanced surgical tasks.
Supplemental Material
Supplemental Material - AI-Based Markerless Computer Vision Framework for Open Surgery Skill Assessment: A Prototype Assessment Framework
Supplemental Material for AI-Based Markerless Computer Vision Framework for Open Surgery Skill Assessment: A Prototype Assessment Framework by Alejandro Zulbaran-Rojas, MD, Mohammad Dehghan Rouzi, MSc, Natasha Hansraj, MD, Derek Erstad, MD, Miguel Bargas-Ochoa, MD, Ethan D’Silva, MD, Randall Parker Kirby, MD, Nilson Salas, MD, Yesenia Rojas, MD, Bijan Najafi, PhD in Surgical Innovation
Footnotes
Author contributions
Conceptualization: BN, AZ, MR.
Data collection: AZ, ED, RK.
Data curation: MR, AZ.
Data visualization: MR.
Clinical design of model: AZ.
Developing deep learning model and codes: MR.
Formal analysis: MR, AZ, MB.
Writing – original draft: AZ, MR.
Writing – review and editing: BN, MB, NJ, DE, YR, NS.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The dataset of this study is not publicly available due to the confidentiality of the participants. However, it may be available on request to the senior author (B.N.).
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
