AI-Based Markerless Computer Vision Framework for Open Surgery Skill Assessment: A Prototype Assessment Framework

Abstract

Background

Artificial intelligence (AI) enables hand motion tracking from standard surgical video recordings; however, translating these data into meaningful performance metrics remains challenging. We evaluated the preliminary validity of a markerless, AI-driven system that generates interpretable technical skill scores from an open-surgery task.

Methods

Sixteen medical students and one instructor performed a one-handed knot-tying task recorded with a smartphone camera while wearing motion sensors beneath surgical gloves. A deep learning algorithm tracking 21 hand joints mapped wrist trajectories and generated visualization boundaries from which kinematic parameters were derived and grouped into three domains scored from 0 to 10: economy of motion (EM), flow of motion (FM), and spatial organization (SO). AI metrics were validated against sensor-based data. Parameters and domain scores were correlated with expert-rated product quality (PQ) and technical performance (TP) using validated checklists.

Results

Nineteen performances were analyzed. AI metrics demonstrated strong correlations with sensor-based measures (r = 0.79-0.88, P < 0.01). EM metrics (path length, number of movements, task time) were associated with PQ and TP (r = 0.59-0.67, P < 0.01). Smoothness within the FM domain correlated with PQ and TP (r = 0.56-0.57, P < 0.01), while the composite FM score moderately correlated with TP (r = 0.44, P = 0.057). Working area within the SO domain demonstrated a moderate association with TP (r = 0.41, P = 0.08).

Conclusion

This prototype AI framework translated hand kinematics into interpretable, cohort-normalized domain-level scores that aligned with expert assessment. The findings support the feasibility of video-based kinematic scoring and provide preliminary evidence of construct validity. Further studies are warranted to determine reliability and generalizability.

Keywords

surgical education surgical skill assessment artificial intelligence computer vision hand kinematics markerless motion capture deep learning motion visualization interpretable performance metrics human motion analysis

Introduction

Surgical proficiency remains challenging to assess in a universally unbiased and practical manner. For decades, evaluation has relied primarily on direct observation by expert surgeons. While foundational and still essential, observational assessment is inherently subjective, episodic, and time intensive, and it may vary across evaluators and settings.^1,2 To improve standardization, structured assessment tools such as the Objective Structured Assessment of Technical Skills (OSATS) and global rating scales were introduced.³ These instruments enhanced reliability and reduced some observer bias, yet they remain largely retrospective and outcome focused. They grade performance quality but provide limited insight into the underlying motor processes that generate errors or inefficiencies.

In response, hand motion analysis (HMA) technologies have emerged as more objective approaches for quantifying technical skills. Wearable electromagnetic sensors, virtual reality platforms, and proprietary motion tracking systems have demonstrated the ability to distinguish expertise levels in simulated environments.^4-6 Although these systems reduce subjectivity, they often require specialized hardware, controlled settings, or complex calibration, limiting scalability and real time application.

More recently, artificial intelligence (AI) and computer vision methods have been applied to surgical skill assessment.⁷ Several existing approaches rely on video-based tracking of surgical instruments to derive objective kinematic features, providing an indirect method for assessing hand dexterity during task performance.^8,9 These systems include convolutional neural network models that leverage standard video recordings to track motion in 2D planes^10-14 using parameters derived from velocity and path length calculations. However, the vast majority have been developed for minimally invasive procedures, where camera-fixed views and instrument visibility facilitate motion tracking.

In contrast, objective evaluation of open surgical hand motion remains underdeveloped. To date, a limited number of groups have developed markerless models in simulated scenarios,^11,15,16 with few studies demonstrating feasibility in real-world open procedures.^10,17-20 However, interpretation of hand-dexterity metrics remains limited, as many existing approaches emphasize performance classification or aggregate scoring rather than parameter-level interpretability. This has constrained progress in defining fundamental technical skills, particularly those involving fine motor control of the wrist and fingers, that could be integrated for correction and intuitive actional feedback prior to advancing to more complex techniques.

The present study evaluated the preliminary evidence of construct validity of a markerless, AI-driven computer vision HMA system that tracks wrist and finger kinematics, integrates them into defined technical skill domains, and generates interpretable performance scores from an open surgery simulated task with simplified motion visualizations. We hypothesized that AI-derived kinematic metrics would correlate with established measures of technical skill quantification.

Methods

A cross-sectional study was conducted in March–April 2023, including participants from Baylor College of Medicine (BCM, Houston, TX). During a 6-week Surgical Clerkship Boot Camp, trainees were invited to participate following completion of their weekly session. Interested individuals were enrolled after providing written informed consent under an Institutional Review Board–approved protocol (H-38994).

Study Design

A standardized one-handed knot-tying task consisting of 10 throws^21-23 was used to evaluate system performance. The task was performed on a Knot Tying Board (Ethicon, NJ) mounted with a pre-tied 2.5-foot #2-0 silk suture (Ethicon, NJ), ensuring symmetrical suture ends. A smartphone (iPhone 11 or 13, Apple, CA) mounted on a tripod was positioned in front of the setup to record the task. No predefined camera angle or distance was required, provided both hands remained visible within the frame. To compare the AI-driven system with validated objective assessment tools,^2,24 wearable motion sensors (Biostamp, MC10 Inc., MA) were affixed to the dorsum of each hand and covered with standard surgical gloves.

Participants received standardized instructions from the research team (A.Z., E.D.) and began the task upon verbal confirmation. After task completion, surgical gloves were removed and the sensors were placed on a docking platform that automatically uploaded accelerometer and gyroscope data to a secure cloud server for subsequent sensor-data analysis.

Hand Motion Analysis

Markerless hand motion tracking was performed using custom-developed software built upon the MediaPipe deep-learning framework,²⁵ which automatically extracted 21 wrist and finger joint landmarks from 2D smartphone video recordings (Figure 1). A Multi-layer convolutional neural network predicted join coordinates in each frame.^26,27 For spatial scaling and calibration, the metacarpophalangeal and proximal interphalangeal joints of the index finger were used to estimate proximal phalanx length, selected for its anatomical consistency and reliable visual identification. A standardized reference length of 45 mm^25,28 was applied to ensure consistent scaling across participants, independent of camera distance, angle, or lighting condition.

Figure 1.

AI-based markerless computer vision system designed to map and analyze hand motion. (A) 21 joint landmarks are detected using deep learning. (B) Calibration system detects the index finger first phalange (landmark 5 and 6) to standardize the tracking of parameters. (C) The wrist joint (landmark 0) is used to map the left and hand motion during the task. (D) A hand trajectory pattern from each hand is mapped upon the completion of the task from which working area (pink boundary) and kinematic metrics are derived

Sensor-based motion signals (X, Y, Z axes) were processed in MATLAB (MathWorks, MA) following our prior methodolog.^2,24,29 Velocity, path length, and number of movements were derived from the three-dimensional positional sensor data. A dynamic velocity threshold (30th percentile of the velocity distribution) was applied to attenuate micro-movements and baseline oscillations. Adaptive thresholding approaches are commonly used in signal processing to distinguish meaningful motion from background noise.³⁰ Sensor-derived data were analyzed solely for validation of the AI-driven metrics and were not integrated into the AI scoring model.

Computer Vision–Derived Hand Dexterity and Technical Proficiency Parameters

Following calibration, the wrist landmark was selected to compute the spatial trajectory of each hand across all video frames (Figure 1). From these trajectories, dexterity-related kinematic parameters were derived (Figure 2).

Figure 2.

Graphical description of hand dexterity parameters derived from the hand trajectory pattern. 2D wrist landmark trajectories generate a hand trajectory pattern from which three skill domains are derived. Economy of Motion quantifies task efficiency using total path length, number of hand movements, and time of task. Flow of Motion evaluates dynamic motor control of the dominant hand using sub-threshold velocity time, smoothness (LDLJ), steadiness (SPARC), and speed consistency (CVspeed). Spatial Organization characterizes operative field utilization through working area and inter-hand spatial alignment (horizontal distance and vertical elevation)

Movement Efficiency

Path length was calculated as the cumulative distance traveled by the wrist landmark throughout the task. The total path length of the left and right hands was summed. Shorter cumulative path length indicated greater movement economy. The number of hand movements was determined using a velocity-based segmentation approach. Instantaneous velocity was computed based on frame rate and spatial displacement (m/s). A movement event was defined when velocity exceeded 0.05 m/s, consistent with threshold-based segmentation methods used to exclude tremor, fidgeting, and non-purposeful oscillations.³¹ Fewer discrete movements indicated greater motor efficiency. Sub-threshold motion (<0.05 m/s) was quantified as total jerkiness time,³² representing non-purposeful oscillatory activity. Lower jerkiness time corresponded to improved motor control.

Spatial Organization and Holistic Visualization of the Operative Field

Spatial utilization of the operative field was quantified in 2D to provide an interpretable representation of hand organization during task execution (Figure 2). For each hand, the spatial trajectory was computed across all video frames, and the centroid was defined as the mean X and Y coordinates of the trajectory. To characterize working area dispersion, a boundary encompassing 99.7% of positional samples (±3 standard deviations from the centroid) was constructed, forming an elliptical region representing each hand’s functional workspace.³³ The enclosed area (cm²) reflected the spatial extent of motor activity during the task. The combined working areas of both hands quantified overall operative field utilization, where smaller, more compact regions indicated greater spatial efficiency and controlled motor execution.

Horizontal and vertical separation between hand centroids were computed to quantify relative bimanual alignment. Values closer to zero indicated tighter spatial coupling and coordinated hand positioning, while larger deviations reflected broader spatial distribution within the surgical field. Collectively, spatial parameters provided an intuitive, visual representation of spatial organization and bimanual coordination.

Motion Consistency, Smoothness, and Steadiness

To characterize higher-order motor control features, established kinematic metrics were computed from the dominant hand trajectory, as this hand performed the primary manipulative knot-forming actions across the one-handed task. Motion consistency, which evaluates variability in movement pacing, was quantified using the coefficient of variation of speed ( ${C V}_{s p e e d}$ ), calculated as:

{C V}_{s p e e d} = \frac{σ_{v}}{µ_{v}}

where

σ_{v}

represents the standard deviation and

µ_{v}

the mean of instantaneous speed. Lower values indicated more consistent pacing.³⁴

Motion smoothness was calculated using log dimensionless jerk (LDLJ) assessed by quantifying fluctuations in acceleration, which normalizes jerk relative to trial duration ( $T$ ), path length ( $L$ ), and the jerk vector ( $J$ )³⁵:

L D L J = - \ln (\frac{T^{5}}{L^{2}} \int_{0}^{T} {| J (t) |}^{2} d t)

Higher (less negative) LDLJ values indicated smoother motion.

Motion steadiness was assessed using spectral arc length (SPARC), which evaluates regularity of the speed profile in the frequency domain,³⁶ distinguishing uninterrupted movements from fragmented or stop-and-go patterns. SPARC was computed as the negative arc length of the normalized Fourier magnitude spectrum up to cutoff frequency $ω_{c}$ :

S P A R C = - \int_{0}^{ω_{c}} \sqrt{{(\frac{1}{ω_{c}})}^{2} + {(\frac{d \hat{V} (ω)}{d ω})}^{2} d ω}

Values closer to zero indicated smoother, more continuous movement.

Hand Dexterity Scoring

To facilitate interpretability and domain integration, each kinematic parameter was normalized across participants using min–max scaling and converted to a standardized 10-point score. Higher scores uniformly represented better performance:

H a n d d e x t e r i t y p a r a m e t e r (H D P) s c o r e = 10 \times (1 - \frac{{H D P}_{i} - {H D P}_{\min}}{{H D P}_{\max} - {H D P}_{\min}})

where HDP represents total path length, number of movements, time of task, jerkiness time, total working area, or inter-hand distances. In this equation,

{H D P}_{i}

denotes the participant-specific absolute value, while

{H D P}_{\min}

and

{H D P}_{\max}

represent the minimum and maximum values observed across the cohort. The term “1 −” ensures that parameters in which lower absolute values indicate superior performance are appropriately scaled so that higher normalized scores reflect better proficiency.

Similarly, dominant-hand control metrics were normalized to preserve directional consistency. Motion consistency ( ${C V}_{s p e e d}$ ), where lower values indicate better performance, was normalized using an inverse min–max approach:

{C V}_{s p e e d} s c o r e = 10 \times (\frac{x_{\max} - x_{i}}{x_{\max} - x_{\min}})

Conversely, smoothness (LDLJ) and steadiness (SPARC), where higher values indicate better performance, were normalized using standard min–max scaling:

S P A R C / L D L J s c o r e = 10 \times (\frac{x_{i} - x_{\min}}{x_{\max} - x_{\min}})

This normalization framework ensured comparability across heterogeneous kinematic measures and enabled aggregation into higher-order technical skill domains.

Technical Skills Determination and Scoring

Three fundamental technical skill domains^35,37,38 were derived by integrating normalized hand dexterity and control parameter scores.

Economy of Motion Score (EMS)

EMS was calculated as the average of total path length, total number of hand movements, and time-of-task scores. This domain reflects the ability to execute the task efficiently by minimizing unnecessary movements, reducing travel distance, and completing the task within an optimal timeframe.

Spatial Organization Score (SOS)

SOS was calculated as the average of total working area and horizontal and vertical inter-hand distance scores. This domain represents the ability to utilize the operative field in a compact, coordinated manner, reflecting efficient spatial strategy and appropriate relative positioning of the hands throughout task execution.

Flow of Motion Score (FMS)

FMS was calculated as the average of total jerkiness time and dominant-hand motion consistency, smoothness, and steadiness scores. This domain captures the continuity, stability, and rhythmic coordination of movements over time, reflecting the degree to which motion is executed in a steady and uninterrupted manner by the dominant (active) hand during the task.³⁹

Together, these three domains translate individual kinematic measures into interpretable constructs representing efficiency, spatial organization, and dynamic motor control.

Construct Validity Evaluation

Task performance was independently evaluated by a vascular surgeon and Associate Director of Surgical Simulation (N.H.), who was blinded to AI-derived kinematic outputs, domain-level scores, and sensor-based metrics. Videos were de-identified with coded participant IDs, with no indication of attempt order. Scoring was completed prior to review of AI processing using modified validated knot-tying assessment scales: (1) product quality⁴⁰ (1–18 points) and (2) technical performance⁴¹ (0–10 points). Higher scores indicated superior performance (Supplemental Table 1).

To evaluate the accuracy of the AI-derived metrics obtained from smartphone video recordings, the estimated core kinematic parameters (velocity, path length, and number of hand movements for each hand) were systematically compared with corresponding metrics derived from hand-mounted sensors using previously validated algorithms.^2,24,29 These kinematic measures are well-established descriptors of motor performance⁴² and constituted the foundational parameters from which higher-order AI-based metrics were subsequently derived. Associations between technologies were assessed using Spearman correlation, while agreement was evaluated using Bland–Altman analysis to calculate systematic bias (mean difference, d) and 95% limits of agreement.

Discriminatory Power of Kinematic Parameters and Technical Skills

Spearman correlation analysis was performed to examine the relationship between individual kinematic parameters, composite technical skill domain scores (EMS, SOS, FMS), and expert-rated knot-tying performance measures (product quality and technical performance). Statistical significance was defined as P < 0.05, and effect size strength was categorized as low (r = 0.1-0.3), moderate (r = 0.3-0.5), and high (r > 0.5).⁴³ All statistical analyses were conducted using SPSS (IBM, Chicago, IL) and Python version 3.10 (Wilmington, DE).

Results

Participant Characteristics

Seventeen participants were enrolled and performed a one-handed knot-tying task using 10 throws. The cohort was constituted by sixteen third-year medical students (56% male, 25.6 ± 0.6 years old) and one surgical instructor with 25+ years’ experience in surgical training. None of the medical students had surgical experience, and 25% were interested in pursuing a career in surgery. All participants executed the task in a single session, and 5 participants repeated the task in different weekly sessions, leading to 23 performances (Table 1). None of the participants who repeated the task received feedback between or after the performances.

Table 1.

Participants’ Demographics and Expert Rating Scores

ID	Sex	Level	Age	Interested in surgery	Attempt	Product quality score	Technical performance score
MS001	F	MSY3	26	No	1^a	NA	NA
MS002	F	MSY3	25	No	1	14	4
MS003	M	MSY3	26	No	1	13	3
MS004	F	MSY3	25	No	1^a	NA	NA
MS005	M	MSY3	26	Yes	1^a	NA	NA
MS006	M	MSY4	27	No	1	14	3
					2	15	3
MS007	M	MSY3	26	No	1^a	NA	NA
MS008	M	MSY3	25	Yes	1	18	10
					2	18	10
					3	16	7
MS009	M	MSY3	26	No	1	13	3
MS010	M	MSY3	25	No	1	13	3
MS011	M	MSY3	26	Yes	1	16	4
					2	15	2
MS012	M	MSY3	25	Yes	1	14	2
MS013	F	MSY3	26	No	1	17	6
MS014	F	MSY3	26	No	1	17	4
					2	14	4
MS015	F	MSY3	25	No	1	17	8
MS016	F	MSY3	25	No	1	15	4
					2	14	4
IN001	M	Instructor	59	NA	1	17	8

^aData not analyzed. F, female; M, male; NA, not applicable.

Data Collection and Agreement Between Technologies

Sensor-based data from 4 participants was lost due to signal interruption/noise, thus excluded from the analysis. This led to 19 performances analyzed with both computer-vision and sensor-based systems, 13 collected in a first session, 5 in a second session, and 1 in a third session. Individual data per right and left hand was tracked in each performance, composing n = 38 samples (hands). There was a significant correlation between sensor-based and computer-vision hand dexterity parameters including velocity (r = 0.79, P < 0.01), path length (r = 0.82, P < 0.01), and number of hand movements (r = 0.88, P < 0.01). Similarly, agreement between technologies was achieved for such parameters: Velocity (=−1.26, 95% limits of agreement: −49.32 to 46.8), path length (=−0.36, 95% limits of agreement: −4.27 to 4.99), and number of hand movements (=−0.16, 95% limits of agreement: −19.46 to 19.15).

Task Results

Data from all computer-vision samples (n = 38 hands) in 19 performances were utilized to calculate the EMS, SOS, and FMS. Individual participant data is depicted in Supplemental Tables 2-4. Among first attempts, the instructor achieved the highest EMS (9.34), FMS (8.79) and SOS (9.72). MS008 obtained a higher EMS (9.56) and FMS (9.81), however in their second attempt. Figure 3 represents visualization of the spatial organization parameter in some participants. Product quality median score was 15 points (13-18), while technical performance score median was 4 points (2-10) (Table 1).

Figure 3.

Real-time task recordings demonstrating the AI-based markerless computer vision system mapping hand kinematics. Camera angles varied depending on the day of recording, but they did not impact the system’s ability to track motion data. Environmental factors such as lighting conditions and background objects did not interfere with parameter extraction. The hand trajectory pattern is scaled to real workspace volume occupation as it utilizes a calibration system. This scale permits comparison of parameters between subjects and calculation of scores based on the same performance. (A) Subject IN001-1; (B) MS013-1; (C) MS008-2. Expert-rated scores are depicted in Table 1. AI-derived scores are depicted in Supplemental Tables 2-4

Discriminatory Power of Parameters and Technical Skills

Individual correlations between parameters and expert-rater scores are depicted in Table 2. All economy of motion parameters, including total path length, total number of movements, and time of task, highly correlated with both product quality and technical performance scores (r > 0.59, P < 0.01). Among spatial organization measures, total working area showed a moderate correlation with technical performance (r = 0.41, P = 0.08) but not with product quality, while inter-hand distance and elevation demonstrated no correlations with either checklist. Within the flow of motion domain, dominant hand smoothness highly correlated with both product quality and technical performance (r > 0.56, P = 0.01). Total jerkiness time showed a moderate correlation with technical performance (r = 0.41, P = 0.08) but not with product quality. Dominant hand motion consistency and steadiness showed weaker correlation with technical performance (r = 0.34-0.41), but not with product quality.

Table 2.

Correlation of Individual Motion Parameters With Expert-Rated Surgical Skill

Skill category	Parameter	Product quality score r (P-value)	Technical performance score r (P-value)
Economy of motion	Total path length score	0.60 (0.01)	0.67 (0.001)
	Total number of movements score	0.64 (0.001)	0.73 (0.001)
	Time of task score	0.59 (0.01)	0.61 (0.01)
Spatial organization	Total working area score	0.33 (0.17)	0.41 (0.08)
	Distance score	−0.06 (0.80)	−0.20 (0.42)
	Elevation score	−0.05 (0.82)	−0.24 (0.33)
Flow of motion	Total jerkiness time score	0.24 (0.32)	0.41 (0.08)
	Dominant hand motion consistency score	0.27 (0.26)	0.34 (0.16)
	Dominant hand steadiness score	0.12 (0.63)	0.38 (0.11)
	Dominant hand smoothness score	0.57 (0.01)	0.56 (0.01)

Total means both left- and right-hand data together. Green: high correlation (r > 0.5); Yellow: moderate correlation (r = 0.3-0.5); Red: low correlation (r = 0.1-0.3).

When grouping parameters into technical skills, EMS correlated with technical performance and product quality (r = 0.66, r = 0.58), FMS correlated with technical performance (r = 0.44), and SOS did not show correlations with either validated checklist (Figure 4).

Figure 4.

Computer-vision technical skills association with technical performance during a one-handed knottying task. EMS, economy of motion score; SOS, spatial organization score; FMS, flow of motion score. Association between EMS, FMS, SOS (range 1-10), and expert-rated scores (product quality, range 1-18; technical performance, range 0-10)

Discussion

This study developed an AI-based markerless computer vision framework to quantify technical skills by integrating: (1) standard video-recordings, (2) a calibrated deep learning algorithm tracking hand dexterity parameters, (3) integration of such metrics into technical skill categories, and (4) creation of a cohort-normalized prototype score fed by performance data from participants executing an open surgery task. This work is a prototype assessment framework task (one-handed knot tying) rather than a validated objective grading system, because the scoring is normalized to the best- and worst-performing individuals within this cohort rather than to an externally anchored proficiency threshold or multi-expert consensus benchmark. We also note that agreement between AI- and sensor-derived kinematics supports concurrent validation of motion capture accuracy, whereas associations between domain-level scores and expert-rated product quality and technical performance provide only preliminary evidence of construct validity. This approach allowed us to explore, understand, and refine our system for future application in higher expertise levels and more complex tasks. Establishing educational utility will require prospective studies demonstrating discrimination across expertise levels, responsiveness to training, and predictive validity.

Direct evaluation of hand motion is crucial, as precise movements are essential for developing surgical expertise. Hand dexterity serves as a key differentiator among proficient surgeons, minimizing errors and intraoperative complications while ensuring patient safety and optimal outcomes.⁴⁴ Achieving technical proficiency requires reducing unnecessary hand movements, maintaining bimanual hand dexterity, motion flow, and ensuring stable hand positioning within the operative field.⁶ Our goal was to quantify these fundamental skills during a one-handed knot-tying task in an objective and realistic manner and determine their discriminatory power in distinguishing among varying levels of technical performance.

Aligned with our previous work^2,24 and others assessing knot-tying tasks via objective motion tools,^21,22,45 the present study identified economy of motion, captured via computer vision, as a technical skill strongly associated to both product quality and technical performance. The fact that economy of motion correlated with both outcomes suggests that efficient and purposeful motion is not only perceived by human raters during knot-tying tasks but also reliably captured by algorithm-based assessment. Supporting this, Kasa et al,²³ used deep learning to integrate sensor-based kinematic data and product quality images from knot-tying tasks performed by 72 surgeons, demonstrating superior discrimination for economy of motion, product quality, and overall performance by their system compared to expert raters’ assessments. Although we did not combined kinematic and outcome data or compared assessment approaches, we observed that the instructor achieved the highest scores across all economy of motion components, including path length, distance, and time of task, and was above the 95th percentile of product quality and technical performance scores despite not being aware of the parameters under evaluation (Supplemental Table 2). This indicates that differences in technical experience are detectable by our system and given its simplicity and fast function compared with time-consuming checklist assessments, the proposed economy of motion metrics may be tested for external validation and support future training tasks.

Flow of motion has been consistently described in surgical skills literature as reflecting the continuity and regularity of instrument handling rather than isolated movement efficiency.^46,47 Movement smoothness is characteristic of a healthy well-trained motor behavior and effort minimization.⁴⁸ Prior research on surgical tasks has shown that smoothness, quantified by LDJL, is correlated with expert-based assessment,⁴⁹ and serves as an indicator of coordinated higher performance with experts exhibiting smoother motion and fewer abrupt corrections than novices.⁵⁰ In the present study, motion smoothness was the most sensitive parameter within the flow of motion domain, as it strongly correlated with both product quality and technical performance scores (Table 2). Moreover, total jerkiness time, defined as inefficient movements below the active threshold (Figure 2), showed a moderate correlation with technical performance. This finding may reflect that unnecessary movements and increased inactive hand periods^32,51,52 are readily perceived by human expert raters and similarly quantified by algorithm-based assessment. In contrast, motion steadiness (SPARC), and speed consistency (CVspeed) did not correlate with technical outcomes in this task. This could be explained by the reduced sensitivity of SPARC to brief tracking noise or small local variations in speed as seen in a simple one-handed knot tying task.⁵³ Indeed, SPARC has been shown to distinguish surgical skill levels primarily among higher-skilled performers based on smoother continuous tool or hand motion,^39,54-56 while CV speed has demonstrated utility predominantly in more complex tasks, such as robotic surgery.³⁴ Even though only jerkiness and smoothness correlated to outcomes, there remains a question whether more complex tasks (ie, anastomosis) in higher level trained surgeons (ie, residents, attendings) would yield greater sensitivity to SPARC and CVspeed.

While efficiency and motion smoothness directly reflect technical performance and tissue handling,^39,57 spatial utilization metrics could reflect how compactly surgeons operate within the surgical field. Maintaining a consistent distance between hands, allowing wrists to operate within a controlled range, may not only support surgical hygiene and spatial awareness but also promote efficient, anchored elbow positioning within shoulder boundaries.^58,59 These concepts, although being a one-handed knot-tying task, were evident in the instructor’s performance who consistently scored above the 95^th percentile across all spatial organization parameters (Supplemental Table 3); however, this domain was not associated with product quality nor technical performance. This divergence may be caused by the weak correlation between distance and elevation scores with human expert scores. Even though the instructor scored the highest, it didn’t apply to all participants. In a simple goal-directed knot-tying task, hand spacing can vary by individual strategy and ergonomics without necessarily affecting knot formation; thus, workspace-separation metrics may be less sensitive to skill differences than economy and flow of motion parameters.^60,61 Nonetheless, hand working area showed a moderate correlation (r = 0.41, P = 0.08) with technical performance, suggesting that expert raters may value compact operative field use and controlled spatial execution.^62,63 By translating wrist trajectory data into simplified geometric representations, our system enables a holistic visualization of operative field utilization. D’Angelo et al³³ reported that working volume area, represented as spheres, was inversely proportional to surgeon expertise during open tasks measured via optical motion tracking. Similarly, robotic surgery studies using 3D sensor-based systems have shown that workspace-range measures (ie, path length tracing) distinguish surgeon skill levels by quantifying how broadly instruments occupy the operative field during simulation tasks and real-world procedures.^42,60,63 While this remains speculative in the context of a one-handed knot-tying task, further validation using more complex surgical tasks is needed to clarify its role. It is important to note that despite excluding in-depth motion tracking (Z plane), our model showed a significant association between computer-derived (2D) and sensor-based (3D) parameters, particularly across variations in camera angle and distance. This suggests that complex tracking systems may not be necessary for evaluating spatial organization, and simple video cameras (ie, smartphone) could provide a feasible alternative for assessment.

The successful assessment of technical skills in our study (validated against wearable metrics) relies on accurately predicting hand motion patterns from video-recordings despite confounding environmental factors. For instance, Nagaraj et al¹¹ developed a deep-learning pass/fail system to detect critical errors in instrument handling and knot tying using 213 medical student video recordings. They used convolutional neural network to measure objects’ velocity and capture temporal features of sequential 2D images. This work represents a significant step toward human-machine scoring open surgery tasks, facing minor challenges with lighting, camera positioning, and background noise, however the model did not incorporate glove use. Other approaches attempted to mitigate these gaps by using color-differentiated gloves,^51,64 but at the expense of reducing real-world applicability. Our model addresses these limitations through a calibration system capable of tracking hand motion independently of background, lighting, or camera placement, and is compatible with standard surgical gloves (Figure 3). While this prototype currently uses the wrist joint for hand dexterity assessment, it is adaptable to other anatomical landmarks, including fingers, elbows, or broader body regions. Currently, the system has only been tested in simulated open surgery settings and accuracy remains untested, as comparisons between large cohorts of expert and novice groups were not conducted.

Aside from Nagaraj et al,¹¹ few models have directly assessed hand motion through computer vision in a way that yields interpretable performance scores from standard 2D video recordings. Goodman et al¹⁰ developed a multitask neural network to quantify action recognition and hand/tool detection using large volumes of online real-world procedural videos. The model was trained to extract hand motion metrics (ie, translation, rotation) and task execution variables (ie, knot-tying, suturing, cutting), then validated on an independent set of surgical videos and demonstrated that more localized movements were correlated with surgical expertise. They also identified procedure-specific surgical signatures for 3 open procedures (appendectomy, pilonidal cystectomy and thyroidectomy) and suggested that deviation from these patterns may represent altered flow. These findings offer strong features for the development and validation of future open surgery models with interpretable data, although the approach was not validated against expert surgical assessment tools. Similar to the present study, Thomas On et al⁶⁵ tracked 21 hand landmarks in five neurosurgeons performing simulated microvascular anastomoses, calculating path length and velocity to assess economy and flow of motion patterns unique to each participant. Despite categorization of technical skills, they acknowledged that a scoring system is still needed to simplify the tracking output.

Azari et al developed a computer vision-based model using simulated anastomosis tasks, defining OSATS-aligned domains derived from quantitative motion metrics and trained to reproduce expert-rated performance scales.⁶⁶ A key distinction is that their algorithm predicts expert consensus scores, anchoring outputs to expert-defined performance. Our study’s expert ratings were used for validation only, and the composite score is internally normalized and performer-dependent, which may limit generalizability and lacks external anchoring. Importantly, Azari et al applied their model in the operating room across multiple surgical specialties, demonstrating feasibility of motion tracking and association with expert assessment in real-world open surgical procedures.^18-20 Their work represents an early translation of motion tracking-based assessment from controlled environments to intraoperative application.

Building on this, our model converts tracked motion into interpretable scores across technical skill domains, offering a structured framework for evaluation and targeted improvement. While it maps kinematic data to literature-defined skill categories, it does not identify task execution variables or surgical mistakes. This preliminary study focused on verifying the model’s ability to track and transform universal kinematic parameters rather than pattern recognition into meaningful performance scores. Further studies are necessary to determine whether these scores inform actionable feedback and improve surgical training outcomes.

Limitations

The main limitation is the small sample size mostly composed of non-trained allied health professionals. The cohort was heavily clustered at the novice end, with 16 third-year medical students and only a single expert comparator (1 instructor). Consequently, the observed correlations may partly reflect variability among novices performing a basic task rather than discrimination of expertise per se, and the present data do not allow firm conclusions about the framework’s ability to distinguish across the full spectrum of surgical proficiency. Confirmatory studies in larger, stratified cohorts spanning novice, intermediate, and expert levels are essential before the framework can be recommended for educational or credentialing use. Five of the 16 medical students repeated the task in two separate weekly sessions, and their repeated attempts were included in the correlation analyses without applying a repeated-measures or mixed-effects framework. This approach was adopted pragmatically to increase the effective sample size given the small cohort. We acknowledge that this does not fully account for within-subject dependency; however, inter-session variability in camera setup and the learning effect associated with novice participants performing the task for the first time may have rendered each attempt sufficiently distinct to partially mitigate this concern. Nonetheless, this remains a methodological limitation, and future work with larger cohorts will employ a formal repeated-measures or mixed-effects framework to ensure the validity of statistical inference. Blinding of the expert rater was implemented with de-identified videos, masked participant IDs, and scoring performed prior to sharing AI outputs; however, because videos were reviewed sequentially in a small cohort, we cannot fully exclude the possibility that familiarity with an individual participant’s hand appearance or performance style influenced scoring. Future studies will incorporate randomized video review order and independent dual-rater scoring to further mitigate this risk. The cohort-normalized nature of the present scoring means that scores are dependent on the sample; anchoring future versions to predefined proficiency thresholds or multi-expert consensus benchmarks will be required to establish a truly externally valid grading system. Another limitation is the pre-trained model employed; training a full model requires larger amounts of data. The model includes data from 2 planes (X and Y vectors), excluding three-dimensional data (Z vector). Feedback was not provided for the participants. The grading scale and automatization is limited to one single task and one single landmark (wrist joint). The videos included in this model are short (<2 min). This model does not detect surgical mistakes or integrate product quality data in the scoring system; it is designed as a tool for assessment of universal parameters that can be applied in different tasks either for open or minimally invasive surgery. This model was designed to continuously refine a grading system that dynamically adjusts based on participant performance. Because our algorithm’s scoring currently relies on comparisons with the best-performing individual, future assessments should incorporate multiple experts to establish a robust performance benchmark.

Conclusion

The AI-based markerless computer vision framework developed in this study demonstrated the ability to extract kinematic parameters and generate interpretable, cohort-normalized domain-level scores based on performance, without relying on external tracking devices or standardized checklists. By capturing genuine hand movements, this approach enables assessments to be conducted more naturally and efficiently. Its key innovation lies in breaking down technical skills into distinct categories allowing for precise identification of specific performance gaps and enabling focused, efficient training. In the context of one-handed knot-tying, economy of motion emerged as the most relevant skill, strongly correlating with both expert-rated product quality and technical performance scores. Flow of motion and spatial organization domains showed weaker correlation with expert assessments, although some important trends were seen when dichotomizing parameters. We acknowledge that the sample size is small and that the results will need to be confirmed in a larger, expertise-stratified cohort before firm conclusions can be drawn. Despite this constraint, we hope that the innovation demonstrated here helps advance the field forward and paves the way for future confirmatory studies. Future studies enrolling surgeons with different expertise levels and performing more complex tasks are warranted to confirm whether any of the assessed technical skills should be given more weight or be excluded from the model. Planned extensions include recruitment of a larger multi-expertise cohort, incorporation of real-time feedback, empirical reweighting or recalibration of the spatial organization domain, and deployment on platform-agnostic applications (including macOS) that require neither wearable sensors nor specialized hardware for human–computer interaction. This study contributes to the advancement of standardized grading systems by introducing components that offer a consistent framework for assessing technical proficiency, with potential for broader application in more advanced surgical tasks.

Supplemental Material

Supplemental Material - AI-Based Markerless Computer Vision Framework for Open Surgery Skill Assessment: A Prototype Assessment Framework

Supplemental Material for AI-Based Markerless Computer Vision Framework for Open Surgery Skill Assessment: A Prototype Assessment Framework by Alejandro Zulbaran-Rojas, MD, Mohammad Dehghan Rouzi, MSc, Natasha Hansraj, MD, Derek Erstad, MD, Miguel Bargas-Ochoa, MD, Ethan D’Silva, MD, Randall Parker Kirby, MD, Nilson Salas, MD, Yesenia Rojas, MD, Bijan Najafi, PhD in Surgical Innovation

Footnotes

ORCID iDs

Alejandro Zulbaran-Rojas

Miguel Bargas-Ochoa

Bijan Najafi

Author contributions

Conceptualization: BN, AZ, MR.

Data collection: AZ, ED, RK.

Data curation: MR, AZ.

Data visualization: MR.

Clinical design of model: AZ.

Developing deep learning model and codes: MR.

Formal analysis: MR, AZ, MB.

Writing – original draft: AZ, MR.

Writing – review and editing: BN, MB, NJ, DE, YR, NS.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The dataset of this study is not publicly available due to the confidentiality of the participants. However, it may be available on request to the senior author (B.N.).*

Supplemental Material

Supplemental material for this article is available online.

References

Aggarwal

Grantcharov

Darzi

. Framework for systematic training and assessment of technical skills. J Am Coll Surg. 2007;204:697-705.

Boyajian

Zulbaran-Rojas

Najafi

, et al. Development of a sensor technology to objectively measure dexterity for cardiac surgical proficiency. Ann Thorac Surg. 2024;117:635-643.

Martin

Regehr

Reznick

, et al. Objective structured assessment of technical skill (OSATS) for surgical residents. Br J Surg. 1997;84:273-278.

Cardoso

Suyambu

Iqbal

, et al. Exploring the role of simulation training in improving surgical skills among residents: a narrative review. Cureus. 2023;15:e44654.

Tsuyuki

Miyahara

Hoshina

, et al. Motion capture device reveals a quick learning curve in vascular anastomosis training. Surg Today. 2024;54:275-281.

Yilmaz

Winkler-Schwartz

Mirchi

, et al. Continuous monitoring of surgical bimanual expertise using deep neural networks in virtual reality simulation. npj Digital Medicine. 2022;5:54.

Park

Tiefenbach

Demetriades

. The role of artificial intelligence in surgical simulation. Front Med Technol. 2022;4:1076755.

Lam

Chen

Wang

, et al. Machine learning for technical skill assessment in surgery: a systematic review. npj Digital Medicine. 2022;5:24.

Levin

McKechnie

Khalid

Grantcharov

Goldenberg

. Automated methods of technical skill assessment in surgery: a systematic review. J Surg Educ. 2019;76:1629-1639.

10.

Goodman

Patel

Zhang

, et al. Analyzing surgical technique in diverse open surgical videos with multitask machine learning. JAMA Surg. 2024;159:185-192.

11.

Nagaraj

Namazi

Sankaranarayanan

Scott

. Developing artificial intelligence models for medical student suturing and knot-tying video-based assessment and coaching. Surg Endosc. 2023;37:402-411.

12.

Yang

Goodman

Dawes

, et al. Using AI and computer vision to analyze technical proficiency in robotic surgery. Surg Endosc. 2023;37:3010-3017.

13.

Lavanchy

Zindel

Kirtac

, et al. Automation of surgical skill assessment using a three-stage machine learning algorithm. Sci Rep. 2021;11:5197.

14.

Pan

Wang

Yang

Liang

. An automated skill assessment framework based on visual motion signals and a deep neural network in robot-assisted minimally invasive surgery. Sensors. 2023;23:4496.

15.

Azari

Miller

Greenberg

Radwin

. Quantifying surgeon maneuevers across experience levels through marker-less hand motion kinematics of simulated surgical tasks. Appl Ergon. 2020;87:103136.

16.

Gonzalez-Romo

Hanalioglu

Mignucci-Jiménez

, et al. Quantification of motion during microvascular anastomosis simulation using machine learning hand detection. Neurosurg Focus. 2023;54:E2.

17.

Azari

Frasier

Miller

, et al. Modeling performance of open surgical cases. Simul Healthc. 2021;16:e188-e193.

18.

Azari

Frasier

Quamme

SRP

, et al. Modeling surgical technical skill using expert assessment for automated computer rating. Ann Surg. 2019;269:574-581.

19.

Frasier

Azari

, et al. A marker-less technique for measuring kinematics in the operating room. Surgery. 2016;160:1400-1413.

20.

Glarner

Chen

, et al. Quantifying technical skills during open operations using video-based motion analysis. Surgery. 2014;156:729-734.

21.

Brydges

Classen

Larmer

Xeroulis

Dubrowski

. Computer-assisted assessment of one-handed knot tying skills performed within various contexts: a construct validity study. Am J Surg. 2006;192:109-113.

22.

Huffman

Anton

Martin

, et al. Optimizing assessment of surgical knot tying skill. J Surg Educ. 2020;77:1577-1582.

23.

Kasa

Burns

Goldenberg

Selim

Whyne

Hardisty

. Multi-modal deep learning for assessing surgeon technical skill. Sensors. 2022;22:7328.

24.

Zulbaran-Rojas

Najafi

Arita

Rahemi

Razjouyan

Gilani

. Utilization of flexible-wearable sensors to describe the kinematics of surgical proficiency. J Surg Res. 2021;262:149-158.

25.

Zhang

Bazarevsky

Vakunov

, et al. MediaPipe Hands: On-Device real-time Hand Tracking. 2020.

26.

Wei

Ramakrishna

Kanade

, et al. Convolutional Pose Machines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016:4724-4732.

27.

Cao

Simon

Wei

, et al. Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017:1302-1310.

28.

Özsoy

Öner

Oner

. An attempt to gender determine with phalanx length and the ratio of phalanxes to whole phalanx length in direct hand radiography. Medicine Science International Medical Journal 2019;8(3):692-697.

29.

Zulbaran-Rojas

Rouzi

Zahiri

, et al. Objective assessment of postural ergonomics in neurosurgery: integrating wearable technology in the operating room. J Neurosurg Spine. 2024;41:135-145.

30.

Winter

. Biomechanics and Motor Control of Human Movement. John Wiley & Sons; 2009.

31.

Dhaif

Paparoidamis

Sideris

, et al.

The role of anxiety in simulation-based dexterity and overall performance: does it really matter?

J Invest Surg. 2019;32:164-169.

32.

D'Angelo

Rutherford

Ray

, et al. Idle time: an underdeveloped performance metric for assessing surgical skill. Am J Surg. 2015;209:645-651.

33.

D'Angelo

Rutherford

Ray

Laufer

Mason

Pugh

. Working volume: validity evidence for a motion-based metric of surgical efficiency. Am J Surg. 2016;211:445-450.

34.

Narazaki

Oleynikov

Stergiou

. Robotic surgery training and performance: identifying objective variables for quantifying the extent of proficiency. Surg Endosc. 2006;20:96-103.

35.

Ghanem

Podolsky

Fisher

, et al. Economy of hand motion during cleft palate surgery using a high-fidelity cleft palate simulator. Cleft Palate Craniofac J. 2019;56:432-437.

36.

Balasubramanian

Melendez-Calderon

Burdet

. A robust and sensitive metric for quantifying movement smoothness. IEEE (Inst Electr Electron Eng) Trans Biomed Eng. 2012;59:2126-2136.

37.

Matern

Waller

. Instruments for minimally invasive surgery: principles of ergonomic handles. Surg Endosc. 1999;13:174-182.

38.

Vaidya

Aydin

Ridgley

Raison

Dasgupta

Ahmed

. Current status of technical skills assessment tools in surgery: a systematic review. J Surg Res. 2020;246:342-378.

39.

Aghazadeh

Zheng

Tavakoli

Rouhani

. Motion smoothness-based assessment of surgical expertise: the importance of selecting proper metrics. Sensors (Basel). 2023;23:3146.

40.

Pintér

Kardos

Varga

, et al. Effectivity of near-peer teaching in training of basic surgical skills - a randomized controlled trial. BMC Med Educ. 2021;21:156.

41.

Huang

Vaughn

Chern

O'Sullivan

Kim

. An objective assessment tool for basic surgical knot-tying skills. J Surg Educ. 2015;72:572-576.

42.

Hung

Chen

Jarc

Hatcher

Djaladat

Gill

. Development and validation of objective performance metrics for robot-assisted radical prostatectomy: a pilot study. J Urol. 2018;199:296-304.

43.

Schober

Boer

Schwarte

. Correlation coefficients: appropriate use and interpretation. Anesth Analg. 2018;126:1763-1768.

44.

Stulberg

Huang

Kreutzer

, et al. Association between surgeon technical skills and patient outcomes. JAMA Surg. 2020;155:960-968.

45.

Mackay

Datta

Chang

Shah

Kneebone

Darzi

. Multiple Objective Measures of Skill (MOMS): a new approach to the assessment of technical ability in surgical trainees. Ann Surg. 2003;238:291-300.

46.

Grewal

Kianercy

Gerrah

. Characterization of surgical movements as a training tool for improving efficiency. J Surg Res. 2024;296:411-417.

47.

Uemura

Tomikawa

Kumashiro

, et al. Analysis of hand motion differentiates expert and novice surgeons. J Surg Res. 2014;188:8-13.

48.

Balasubramanian

Melendez-Calderon

Roby-Brami

Burdet

. On the analysis of movement smoothness. J NeuroEng Rehabil. 2015;12:112.

49.

Estrada

Duran

Schulz

Bismuth

Byrne

O'Malley

. Smoothness of surgical tool tip motion correlates to skill in endovascular tasks. IEEE Trans Hum-Mach Syst. 2016;46:647-659.

50.

Ghasemloonia

Maddahi

Zareinia

Lama

Dort

Sutherland

. Surgical skill assessment using motion quality and smoothness. J Surg Educ. 2017;74:295-305.

51.

Mackenzie

Yang

Garofalo

, et al. Enhanced training benefits of video recording surgery with automated hand motion analysis. World J Surg. 2021;45:981-987.

52.

Mohamadipanah

Perrone

Peterson

, et al. Sensors and psychomotor metrics: a unique opportunity to close the gap on surgical processes and outcomes. ACS Biomater Sci Eng. 2020;6:2630-2640.

53.

Singh

Bible

Liu

Zhang

Singapogu

. Motion smoothness metrics for cannulation skill assessment: what factors matter? Front Robot AI. 2021;8:625003.

54.

Belvroy

Murali

Sheahan

O'Malley

Bismuth

. In the fundamentals of endovascular and vascular surgery model motion metrics reliably differentiate competency. J Vasc Surg. 2020;72:2161-2165.

55.

Duran

Estrada

O'Malley

, et al. The model for Fundamentals of Endovascular Surgery (FEVS) successfully defines the competent endovascular surgeon. J Vasc Surg. 2015;62:1660-6.e3.

56.

O’Malley

Byrne

Estrada

Duran

Schulz

Bismuth

. Expert surgeons can smoothly control robotic tools with a discrete control interface. IEEE Trans Hum-Mach Syst. 2019;49:388-394.

57.

Rahimi

Hardon

Willuth

, et al. Force-based assessment of tissue handling skills in simulation training for robot-assisted surgery. Surg Endosc. 2023;37:4414-4420.

58.

Fazlollahi

Yilmaz

Winkler-Schwartz

, et al. AI in surgical curriculum design and unintended outcomes for technical competencies in simulation training. JAMA Netw Open. 2023;6:e2334658.

59.

Scott

Dunnington

. The new ACS/APDS skills curriculum: moving the learning curve out of the operating room. J Gastrointest Surg. 2008;12:213-221.

60.

Brinkman

Luursema

Kengen

Schout

BMA

Witjes

Bekkers

. da Vinci skills simulator for assessing learning curve and criterion-based training of robotic basic skills. Urology. 2013;81:562-566.

61.

Law

Jenewein

Gannon

, et al. Exploring hand coordination as a measure of surgical skill. J Surg Res. 2016;205:192-197.

62.

Constable

Shum

HPH

Clark

. Enhancing surgical performance in cardiothoracic surgery with innovations from computer vision and artificial intelligence: a narrative review. J Cardiothorac Surg. 2024;19:94.

63.

Liss

Kane

Chen

Baumgartner

Derweesh

. Virtual reality suturing task as an objective test for robotic experience assessment. BMC Urol. 2015;15:63.

64.

Zia

Sharma

Bettadapura

, et al. Automated assessment of surgical skills using frequency analysis. In: International Conference on Medical Image Computing and computer-assisted Intervention. Springer; 2015.

65.

Chen

, et al. Deep learning detection of hand motion during microvascular anastomosis simulations performed by expert cerebrovascular neurosurgeons. World Neurosurg. 2024;192:e217-e232.

66.

Azari

Miller

, et al. A comparison of expert ratings and marker-less hand tracking along OSATS-derived motion scales. IEEE Trans Hum-Mach Syst. 2021;51:22-31.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.20 MB