Sage Journals: Discover world-class research

Abstract

Objective

This study aims to systematically review the current literature on the application of machine learning to predict return-to-sport (RTS) decisions after athletic injuries. The review focuses on identifying the types of machine learning models used, the commonly used predictive variables, and the methodological characteristics and limitations between studies in terms of design, model development, evaluation, and reporting.

Method

A comprehensive literature search was conducted on 1 May 2025 in three electronic databases: Web of Science, PubMed, and SPORTDiscus (EBSCO). Two independent reviewers selected the retrieved studies based on predefined inclusion and exclusion criteria. The Prediction Model Risk of Bias Assessment Tool (PROBAST) was used to assess the risk of bias in the included prognostic modeling studies.

Results

Of the 56 studies initially identified, 11 met the inclusion and exclusion criteria. Knee injuries were the most frequently modeled injury type for RTS decision-making (n = 4). The area under the receiver operating characteristic curve (ROC AUC) was the most commonly reported performance metric, presented in 82% of the included studies. Random Forest (RF) was the most widely used machine learning algorithm, applied in six studies (55%), and demonstrated the best predictive performance in four of them, with two studies reporting an AUC greater than 0.9. Some studies employed feature importance analysis or interpretability methods (e.g. SHAP) to identify key predictive variables. However, challenges remain in translating these models into clinical practice.

Conclusions

Machine learning techniques demonstrate promising potential for predicting RTS in athletes. Nevertheless, substantial heterogeneity across studies—particularly in RTS definitions, feature selection, and model development which limits the generalizability and clinical applicability of current models.

Keywords

Return to sport machine learning systematic review sports injury recovery prediction models

Introduction

Return to sport (RTS) represents one of the most critical and challenging phases in an athlete's rehabilitation process, involving complex, multistakeholder decision-making that includes the athlete, medical staff, and coaching team. Due to the high heterogeneity in injury types and recovery trajectories,¹ coupled with the current lack of consensus on optimal functional recovery standards and objective physiological RTS criteria,² RTS decisions are often fraught with uncertainty and debate. Anterior cruciate ligament (ACL) injury, one of the most prevalent types of sports-related injuries,³ remains a central focus in sports medicine research.⁴ However, reinjury rates following RTS after ACL reconstruction remain alarmingly high.^5,6 Similarly, hamstring strain injury (HSI) carries a substantial risk of recurrence, with reported reinjury rates ranging from approximately 14% to 63%. Such variability is largely attributable to differences in the severity of the initial trauma, rehabilitation protocols, and RTS criteria applied across studies.^7,8 These persistent risks raise concerns regarding the reliability of traditional clinical assessments and functional tests used in RTS decision making.⁹

Premature RTS has been strongly associated with increased reinjury risk,¹⁰ whereas delayed RTS does not necessarily guarantee complete functional recovery. Thus, developing a scientifically grounded and individualized RTS decision-making framework is essential to minimize secondary injuries and support long-term athletic health. Although prior studies have identified significant associations between specific functional benchmarks and reinjury risk, such as findings indicating that athletes failing to meet discharge criteria after ACL reconstruction are four times more likely to suffer graft rupture post-RTS compared to those who meet them¹¹—these conclusions often rely on static, predetermined thresholds that fail to capture interindividual variability. In reality, RTS should not be reduced to merely satisfying a set of standardized functional criteria or achieving a certain score on performance-based assessments. The athlete is a complex, dynamic, and adaptive system. This complexity manifests not only within neuromuscular and musculoskeletal systems but also through multilayered interactions between individuals, organizations, sociocultural contexts, and environmental factors.¹² Variations in psychological readiness, physiological baselines, cognitive behaviors, and emotional regulation further challenge the applicability of one-size-fits-all RTS criteria.² Consequently, single-dimensional assessment strategies may be insufficient to comprehensively evaluate an athlete's readiness to safely return to sport.

In recent years, with growing interest in the complex interactions among multidimensional athlete-specific data, such as physiological, biomechanical, and psychological variables, machine learning (ML) techniques have been increasingly applied across various domains of sports medicine. These include medical image recognition,^3,13 sports injury prediction,^14,15 and decision support for RTS planning.² The occurrence of sports injuries and the subsequent recovery process leading to RTS represent a highly dynamic, open-ended, and nonlinear trajectory, encompassing psychophysiological healing, functional assessments, psychological adaptation, and environmental influences.^2,12 Traditional statistical methods often struggle to manage such high-dimensional, multivariate, and interactive data structures. In contrast, ML offers promising advantages in modeling complex nonlinear relationships and extracting latent patterns,¹⁴ positioning it as a potentially powerful tool in RTS decision-making.

Despite the growing interest in applying ML for RTS prediction, the current body of literature remains fragmented, and a comprehensive synthesis of its application landscape is lacking. Specifically, it remains unclear which types of ML models have been employed, how they perform across various RTS prediction tasks, and whether their interpretability and clinical usability are comparable. In the context of increasingly accessible, large-scale, and multisource datasets, systematically reviewing the modeling strategies, performance evaluation methods, and explainability techniques used in ML-based RTS prediction is essential. Such an effort could facilitate the development of more individualized, data-driven RTS decision frameworks.

Accordingly, this systematic review aims to examine and synthesize the existing literature on the application of ML techniques in predicting RTS outcomes among athletes. The specific objectives of this study are to:

Summarize the modeling strategies and ML algorithms used in RTS prediction tasks;

Evaluate the performance characteristics and explainability approaches of different ML models within this context.

Methods

Study design

This systematic review was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.¹⁶ The review protocol was prospectively registered in PROSPERO (Registration ID: CRD420251090967).

Search strategy

A comprehensive literature search was performed in May 2025 across three electronic bibliographic databases: PubMed, Web of Science, and SPORTDiscus (EBSCO). The search strategy combined the following terms: (“return to sports” OR “return to play”) AND (“machine learning” R “transfer learning”) AND (“athletic injuries” OR “sports injuries”). The complete search syntax is provided in Supplemental material 1.

Inclusion and exclusion criteria

The inclusion and exclusion criteria for this systematic review were established as follows. Studies were eligible for inclusion if they met the following conditions: (1) the population of interest comprised athletes or physically active individuals, including both professional athletes and recreational sports participants; (2) the primary focus of the study was the prediction of RTS following sports-related injuries, where RTS is considered a continuum encompassing return to participation, return to sport, and return to performance¹⁷; (3) the study employed ML methodologies for modeling or predictive analysis; (4) the predicted outcomes were directly related to RTS, including but not limited to whether RTS occurred, the time required for RTS, or the level of RTS achieved (e.g. competition load and playing time); (5) only original research studies were included, such as retrospective or prospective cohort studies, case-control studies, data-mining analyses, or modeling studies; (6) studies were required to be published in full-text format and appear in peer-reviewed academic journals; and (7) to ensure recency and methodological relevance, only studies published between 2015 and 2025 were included. Studies were excluded if they did not primarily focus on RTS prediction (e.g. those limited to describing treatment approaches, investigating postinjury psychological status, or examining injury risk factors); if they did not utilize ML techniques for modeling or prediction; if the predictive targets were unrelated to injury-based RTS (e.g. studies predicting training outcomes or athletic performance without injury context); or if they were review articles, commentaries, conference abstracts, or opinion pieces. Additionally, studies involving nonathlete populations, those focusing solely on injury diagnosis or risk scoring without addressing RTS, and articles unavailable in full text were excluded.

Study selection and data collection process

After removing duplicates, two reviewers (JY and QZ) independently screened the titles, abstracts, and full texts for eligibility. Studies were included or excluded only upon consensus between the two reviewers. In cases of disagreement, a third reviewer (YZ) was consulted to reach a final decision through discussion. Reference lists of the included studies were also manually searched to identify additional eligible studies. These records were evaluated according to predefined inclusion and exclusion criteria. Studies that met all inclusion criteria and did not meet any exclusion criteria were included in this review.

Data extraction

Data from each included study were independently extracted by two reviewers (JY and QZ). Any discrepancies were resolved through discussion or adjudication by a third reviewer (YZ). Extracted data included study design, participant characteristics and background, the specific definition and type of RTS, details of the ML methodology, RTS-related prediction outcomes, model performance metrics, and model interpretability.

Risk of bias and applicability assessment

The risk of bias in the included prediction models was systematically assessed using the Prediction model Risk Of Bias ASsessment Tool (PROBAST).¹⁸ PROBAST is specifically designed for evaluating the risk of bias and applicability of diagnostic and prognostic multivariable prediction model studies. Two reviewers (JY and QWZ) independently assessed each study across four domains: participants, predictors, outcome, and analysis. Each domain was rated as having low, high, or unclear risk of bias and applicability concerns. Disagreements were resolved through discussion and consensus with a third reviewer (YZ). Applicability was assessed based on the relevance of study participants, predictors, and outcomes to the review question.

Results

The search strategy and screening process are illustrated in Figure 1. A systematic search was conducted across three electronic databases: Web of Science, PubMed, and SPORTDiscus, yielding an initial total of 56 potentially relevant studies. After removing 6 duplicates, 50 articles remained for title and abstract screening. Of these, 27 articles were deemed eligible based on the inclusion criteria. Independent screening results were then merged, and any discrepancies were resolved through discussion among the authors (JY, YZ, and QZ). Ultimately, 11 studies were included in the final analysis.

Figure 1.

PRISMA flow diagram.

Study quality

Among the included studies, four were rated as having an overall low risk of bias, four as unclear, and three as high risk (see Figures 2 and 3). The primary sources of bias were concentrated in the analysis domain. Specific issues included insufficient sample sizes in some studies and inadequate handling and reporting of missing data during model development. In terms of applicability assessment, six studies were rated as having low concern, four as unclear, and one as high concern. The main applicability issue stemmed from the limited relevance of the predicted outcomes to RTS decisions, thereby reducing the practical utility of the study findings in real-world RTS contexts. (The full PROBAST checklist is provided in Supplemental material 2.)

Figure 2.

Risk of bias and applicability assessment.

Figure 3.

Summary of risk of bias and applicability assessment within the study.

Sporting contexts and participant characteristics

Table 1 summarizes the fundamental characteristics of the 11 included studies. The majority employed a retrospective cohort design (n = 6, 55%),^19–24 followed by prospective cohort studies (n = 3),^25–27 as well as one case-control study²⁸ and one cross-sectional study.²⁹ Regarding the type of sports involved, five studies specifically focused on contact sports,^{19,20,22,24,26} while two included both contact and noncontact sports.^25,27 The remaining four studies did not explicitly report information related to contact nature. The sports represented included football, American football, basketball, and track and field, among others. Four studies covered multiple sport disciplines, whereas five did not specify the type of sport involved. Sample sizes varied considerably, ranging from 32 to 1611 participants. Most studies (n = 9, 82%) reported participant age characteristics, with mean ages ranging from 15.0 ± 2.4 years (adolescent populations) to 30.0 ± 9.3 years (adult populations). The types of athletic injuries studied were diverse, with knee-related injuries being the most frequently reported (n = 4),^21–23^,28 followed by sport-related concussion (SRC) (n = 2).^19,25 Other injury types included Achilles tendon rupture (ATR),²⁰ HSI,²⁷ general muscle injuries,²⁶ shoulder instability,²⁹ and mild traumatic brain injury (mTBI).²⁴ Follow-up periods for the prospective studies ranged from 12 to 24 months, while three of the retrospective studies did not clearly report follow-up durations. All studies performed predictive modeling related to the results of RTS, which were classified into three categories: (1) time-based results, such as the resolution time of symptoms or days of RTS^19,26,27; (2) status-based outcomes, including readiness for RTS, reinjury occurrence, or game absence^21–25^,28; (3) performance-level outcomes,^20,29 such as return to the performance level prior to injury or achievement of a functional threshold reported by the patient. To provide a clearer comparison of study characteristics and RTS outcomes, Table 1 also summarizes the outcome classification, type of RTS criteria, and assessment tools for all included studies.

Table 1.

Study characteristics and RTS criteria.

Reference	Study design	Sport	Contact (yes/no/both)	Participants (n, age mean ± SD)	Injury type	Follow-up duration	RTS definition	Outcome classification	Type of RTS criteria	Assessment tools
Bergeron et al.¹⁹	Retrospective cohort	Multiple	Yes	1611 (NR)	SRC	NR	Days until all reported symptoms resolved	Continuous (days)	Time based	17-item symptom checklist (NCAA Injury Surveillance Program)
Diniz et al.²⁰	Retrospective cohort	Soccer	Yes	209 (28.2 ± 4.0)	ATR	24 months	Difference in average minutes played per match pre- vs. postinjury (ΔMPM)	Continuous (minutes)	Performance based	Transfermarkt database (minutes played, match starts)
Kerr et al.²⁵	Prospective cohort	Multiple	Both	193 (15.0 ± 2.4)	SRC	12 months	Medical clearance after being asymptomatic and passing neurocognitive/physical tests	Binary (ready/not ready)	Status based	ImPACT neurocognitive test, SCAT3/5, Equilibrate Balance Platform, clinical and neurological exam
Skoki et al.²⁶	Prospective cohort	Soccer	Yes	41 (24.0 ± 4.2)	Muscle injury	24 months	Days from injury to full recovery/medical clearance	Continuous (days)	Time based	BAMIC, clinical examination, imaging
Hwang et al.²¹	Retrospective cohort	NR	NR	102 (30.0 ± 9.3)	ACLR	12 months	Functional recovery at 12 months post-ACLR	Continuous (performance scores)	Status based	Single-leg hop, vertical jump, Tegner activity score
Lipps Lene et al.²²	Retrospective cohort	Multiple	Yes	96 (21.5 ± 2.1)	Keen injury	NR	Presence of knee injury sequelae at final RTS stage	Binary (yes/no)	Status based	SAS, KOOS, force-velocity test, dual force plates
Lu et al.²⁸	Case-control	NR	NR	1497 (NR)	ACL injury	24 months	Postoperative ACL graft integrity (graft failure or contralateral ACL injury)	Binary (yes/no)	Status based	Electronic medical records, arthroscopy, MRI
Magni et al.²⁹	Cross-sectional	NR	NR	79 (22.6 ± 5.8)	Shoulder instability	12 months	Self-reported ability to resume preinjury sport	Binary (yes/no)	Performance based	Shoulder Instability RTS Index-5 questionnaire
Torres-Velazquez et al.²⁷	Prospective cohort	Multiple	Both	32 (19.6 ± 1.4)	HSI	NR	Clearance to resume sport ≤25 vs. >25 days postinjury	Binary (≤25/>25 days)	Time based	Clinical evaluation (ROM, palpation, sports-specific tests)
Hwang et al.²³	Retrospective cohort	NR	NR	113 (29.9 ± 9.5)	ACLR	12 months	Functional and psychological readiness at 12 months post-ACLR (IKDC ≥75.9 & ACL-RSI ≥56)	Binary (SRPAS 1/0)	Status based	IKDC, ACL-RSI
Yates et al.²⁴	Retrospective cohort	NR	Yes	375 (24.19 ± 8.3)	mTBI	NR	Missing >5 games due to concussion	Binary (yes/no)	Status based	MRI reports, SCAT5

ACL: anterior cruciate ligament; ACLR: anterior cruciate ligament reconstruction; ACL-RSI: ACL-return to sport after injury; ATR: Achilles tendon rupture; BAMIC: British athletic muscle injury classification; HSI: hamstring strain injury; IKDC: international knee documentation committee; KOOS: knee injury and osteoarthritis outcome score; MRI: magnetic resonance imaging; mTBI: mild traumatic brain injury; NR: not reported; ROM: range of motion; RTS: return to sport; SAS: sport anxiety scale; SRC: sport-related concussion; SRPAS: successful recovery of patient acceptable symptom state.

Data analysis characteristics

Among the included studies, the majority (n = 7, 64%) relied on clinically recorded data by healthcare professionals, such as club medical staff or athletic trainers.^19,21,^23–27 The remaining studies utilized publicly available datasets (n = 2)^20,28 or athlete self-reported data (n = 2).^22,29 Despite considerable heterogeneity in the selection of predictor variables across studies, several common categories were identified. Demographic characteristics (e.g. age, sex, and body mass index (BMI)) were collected in most studies (n = 9). Injury- and rehabilitation-related factors (e.g. injury location, severity, and treatment methods) were included in five studies.^20,^26–29 Clinical assessments and symptom measures (e.g. SCAT scores, balance tests, and muscle strength evaluations) were present in seven studies. Three studies incorporated psychological factors or self-perception scales (e.g. SIRSI (shoulder instability return to sport after injury), KOOS (knee injury and osteoarthritis outcome score), sport-related anxiety scores) as predictors.^22,28,29 Additionally, two studies utilized radiomic features derived from magnetic resonance imaging (MRI) sequences, such as structural attributes or manually annotated muscle injury regions.^24,27

Regarding data preprocessing, approximately 45% of the studies (n = 5) did not report specific strategies. Approaches to handling missing data varied and included multiple imputation (n = 2),^23,28 random forest-based imputation (missForest, n = 1),²⁴ spline interpolation with backward filling (n = 1)²⁰ and listwise deletion (n = 1).²¹ Two studies applied Z-score normalization to standardize the distribution of continuous variables across different measurement scales,^27,29 while one study employed min–max normalization.²² Only four studies explicitly reported feature selection methods, including forward selection,²⁰ random forest-based feature importance ranking,²⁸ logistic regression-based filtering,²⁷ and a combined univariate-multivariate statistical approach.²⁵ The remaining seven studies (64%) did not specify their feature selection strategies.

Cross-validation was the most commonly used model training approach. Ten-fold cross-validation was applied in four studies,^19,20,27,28 and five-fold cross-validation was used in three studies.^21,23,26 Two studies with relatively small sample sizes adopted leave-one-out cross-validation (LOOCV).^25,29 In addition, two studies used random or repeated random splits for model validation.^22,24 Notably, only one study addressed the issue of class imbalance by applying the Synthetic Minority Oversampling Technique (SMOTE), a method designed to mitigate data imbalance and improve model training stability.²⁹ A structured overview of analytical characteristics is presented in Table 2.

Table 2.

Data analysis characteristics.

Reference	Injury data collection	Predictor variables	Data preprocessing	Feature selection	Train strategy
Bergeron et al.¹⁹	AT	Demographics (sex, class year); injury context (sport type, playing surface); symptom and clinical assessment (presence of symptoms, return to play status)	NR	NR	10-fold cross validation
Diniz et al.²⁰	Public data	Demographics (age, height, playing position); injury and recovery-related; performance and participation metrics (minutes played, starting status)	Spline interpolation and Backfilling	Forward selection	10-fold cross validation
Kerr et al.²⁵	Medical staff	ImPACT test; symptom and clinical assessment (headache, sensitivity to light/noise, balance test); temporal (days since injury)	NR	Univariate analysis and multivariate feature combination	LOOCV
Skoki et al.²⁶	Medical staff (club)	Injury context (injury location [training/game/ other], side of body [left/right/middle]); injury anatomical (muscle position, injury depth, affected body part); clinical assessment (swelling level, tone level)	NR	NR	5-fold cross-validation
Hwang et al.²¹	Medical staff	Demographics (age, BMI); muscular strength and power measures (60°/s knee extensor PT, 180°/s knee extensor AP); balance test (Y-Balance, Biodex)	Listwise deletion	NR	5-fold cross-validation
Lipps Lene et al.²²	Self-reported	Demographics (height, BMI, level of practice); psychological and self-reported outcomes (sport anxiety scale score, KOOS symptoms, KOOS pain) Muscular strength and power measures (total peak force, mean velocity, push-off distance)	Min–max normalization	NR	Repeated random train-test split
Lu et al.²⁸	Public data	Demographics (age, BMI, occupation); injury characteristics (tear type, meniscus injury); treatment characteristics (physical therapy, injection)	Multiple imputation	Random Forest	10-fold cross-validation
Magni et al.²⁹	Self-reported	Demographics (age, sex); injury and recovery-related (time since injury, surgical treatment, SIRSI-5 score); sport type (contact vs. noncontact)	Z-score normalization	NR	SMOTE and LOOCV
Torres-Velazquez et al.²⁷	Medical staff	MRI sequences (T1-weighted, T2-weighted, diffusion tensor imaging); Region of Interest (4 hamstring muscles manually segmented, lesion region segmented based on signal abnormality)	Z-score normalization	Logistic regression	10-fold cross-validation
Hwang et al.²³	Medical staff	Demographics (age, BMI); Muscular strength and power measures (60 deg/s knee extensor deficit, 180 deg/s knee extensor deficit,); balance test (Y-Balance test)	Multiple imputation	NR	5-fold cross-validation
Yates et al.²⁴	Medical staff	Demographics (age, sex, height); imaging findings (MRI); injury context (loss of consciousness, post-traumatic amnesia); symptom and clinical assessment (baseline SCAT5 total symptom score)	Missing data imputation using random forest	NR	Random split

AP: average power; AT: athletic trainer; BMI: body mass index; KOOS: knee injury and osteoarthritis outcome score; LOOCV: leave-one-out cross-validation; MRI: magnetic resonance imaging; NR: not reported; PT: peak torque; SIRSI: shoulder instability return to sport after injury; SMOTE: synthetic minority oversampling technique.

Study results characteristics

Commonly used machine learning models

Among the 11 included studies, Random Forest (RF) was the most frequently used algorithm, applied in 6 studies (55%),^{19,21,23,24,28,29} with 2 studies utilizing it as the sole modeling method.^24,28 Support Vector Machine (SVM) was employed in 5 studies,^{19,21,23,28,29} valued for its strong classification performance in high-dimensional feature spaces, which remains advantageous in clinical predictive modeling. Logistic Regression (LR), despite its limited modeling capacity as a traditional linear classifier, was used in 6 studies^{19,21,23,26,28,29} due to its interpretability and role as a benchmark model. Other commonly applied algorithms included XGBoost (n = 4),^20,22,26,28 Decision Tree (DT) (n = 4),^21–23^,26 and Multilayer Perceptron (MLP) (n = 3).^19,22,29 Less frequently used algorithms comprised Fisher Discriminant Analysis (FDA),²⁵ k-Nearest Neighbors (KNN),²⁹ and Radial Basis Function Network (RBFN)¹⁹ (see Table 3).

Table 3.

Study results characteristics.

Reference	ML algorithms applied	Top-performing algorithm	Model performance metrics	Feature importance methods
Bergeron et al.¹⁹	LR; NB; SVM; 5-Nearest Neighbors (5NN); C4.5 decision trees with discretized and normalized features (C4.5D and C4.5N); RF100 and RF500; MLP; Radial Basis Function Network (RBFN)	RF	AUC = 0.74	Chi-square, Information Gain, Gain Ratio
Diniz et al.²⁰	XGBoost	XGBoost	AUC = 0.81, Brier = 0.12	Feature selection was performed via forward selection, sequentially adding features that improved model performance
Kerr et al.²⁵	Fisher Discriminant Analysis (FDA)	FDA	AUC = 0.82, Sensitivity = 0.86, Specificity = 0.89	NR
Skoki et al.²⁶	LR; XGBoost; DT	XGBoost	R² = 0.73, MAPE = 0.33	NR
Hwang et al.²¹	LR; DT; RF; Gradient Boosting; SVM; Neural Network (NN)	RF	Single-leg hop test (AUC = 0.95) Tegner activity score (AUC = 0.95) Single-leg vertical jump test (AUC = 0.87)	Feature Permutation Importance and SHapley Additive exPlanations (SHAP)
Lipps Lene et al.²²	DT; XGBoost; MLP	MLP	Without psychological attributes (AUC = 0.57, Precision = 0.46, Sensitivity = 0.24) With Psychological Attributes (AUC = 0.60, Precision = 0.51, Sensitivity = 0.31)	SHAP
Lu et al.²⁸	XGBoost; SVM; RF; LR; Elastic Net Penalized Regression	XGBoost	Graft failure (AUC = 0.70, Brier = 0.07) Contralateral ACL injury (AUC = 0.67, Brier = 0.08)	Global variable importance analysis, partial dependence curves
Magni et al.²⁹	MLP; LR; SVM; RF; Classification and regression tree (CART); K-Nearest Neighbor (KNN)	MLP	Accuracy = 0.72, Sensitivity = 0.64, Specificity = 0.76	Signal-to-noise ratio
Torres-Velazquez et al.²⁷	Support Vector Classifier (SVC)	SVC	Binary (AUC = 0.95); Multiclass (AUC = 0.81)	Frequency-based feature selection importance
Hwang et al.²³	LR; DT; RF; Gradient Boosting; SVM	Gradient boosting and RF	IKDC PASS (AUC = 0.84); ACL-RSI PASS (AUC = 0.84)	Feature permutation importance and SHAP
Yates et al.²⁴	RF	RF	AUC = 0.96, Sensitivity = 1.00, Specificity = 0.94	NR

ACL: anterior cruciate ligament; AUC: area under the receiver operating characteristic curve; DT: decision tree; LR: logistic regression; MAPE: mean absolute percentage error; ML: machine learning; MLP: multilayer perceptron; NB: naïve Bayes; NR: not reported; RF: random forest; RSI: return to sport after injury; SVM: support vector machine; XGBoost: extreme gradient boosting.

Top-performing machine learning models

RF and XGBoost emerged as the top-performing models in the greatest number of studies, being identified as the best-performing algorithms in 4^19,21,23,24 and 3 studies,^20,26,28 respectively. Notably, the RF-based model developed by Yates et al.²⁴ achieved the highest area under the receiver operating characteristic curve (AUC; 0.96) reported in this review for predicting whether athletes with mTBI would miss more than five games postinjury, demonstrating exceptional discriminatory power. Additionally, Torres-Velazquez et al.²⁷ applied Support Vector Classifier (SVC) to both binary and multiclass classification tasks predicting rehabilitation duration following hamstring strain injury, achieving AUCs of 0.95 and 0.81, respectively, highlighting its robustness across different outcome types. Hwang et al.²¹ constructed predictive models for multiple RTS-related outcomes (e.g. single-leg hop test and Tegner activity score), with RF models achieving or approaching AUC values of 0.95, further validating the consistent performance and wide applicability of tree-based models in multioutput settings. Overall, 7 studies (64%) identified tree-based models (including RF and XGBoost) as the top-performing approach, underscoring the strong capability of ensemble learning methods to capture complex, nonlinear relationships, and to generalize effectively in the context of sports injury prediction. These findings suggest that ensemble learning remains one of the most promising strategies for constructing models related to RTS outcomes (see Table 3).

Model evaluation methods

Among the 11 included studies, researchers employed a wide array of performance metrics to evaluate the effectiveness and robustness of ML models in predicting RTS among athletes. The most commonly reported metric was the ROC AUC, presented in 9 studies (82%). Reported AUC values ranged from 0.57 to 0.96, reflecting substantial variability in model architecture, feature selection strategies, and data quality (see Table 3). According to established classification criteria, AUC values between 0.50 and 0.69 indicate poor discriminatory ability, 0.70-0.79 as fair, 0.80-0.89 as good, and values ≥0.90 as excellent performance.³⁰ In this review, studies reported AUC values below 0.70, indicating poor performance^22,28; one study fell within the fair range¹⁹; four studies demonstrated good discrimination^20,23,25,27; and two studies achieved excellent performance.^21,24 These findings suggest that some models possess strong discriminatory power for identifying high-risk athletes, especially when tailored to specific data structures and prediction tasks.

In addition to AUC, several studies reported complementary performance indicators to capture dimensions beyond pure discrimination. Sensitivity was the most frequently used secondary metric, reported in four studies,^22,24,25,29 with values ranging from 0.24 to 1.00 and a mean of 0.61. Notably, the RF-based model developed by Yates et al.²⁴ achieved a sensitivity of 1.00, indicating perfect identification of positive cases. In contrast, the model by Lipps Lene et al.²² demonstrated a sensitivity of only 0.24 in the absence of psychological variables, underscoring the critical influence of feature composition on predictive performance. Specificity was reported in three studies,^24,25,29 ranging from 0.76 to 0.94 (mean = 0.86), suggesting good ability to correctly identify nonrisk individuals. Accuracy was reported in two studies,^22,29 with an average of 0.72. Furthermore, the Brier Score which a composite metric quantifying the mean squared error between predicted probabilities and actual outcomes was used in two studies,^20,28 yielding values of 0.12 and 0.07-0.08, respectively. These relatively low scores indicate well-calibrated probability estimates that closely approximate observed results.

Model interpretability

Of the 11 studies reviewed, 8 explicitly reported methods for assessing feature importance, reflecting a growing emphasis on model interpretability in the field of sports injury prediction (see Table 3). Common approaches included ranking-based methods such as information gain, chi-square test, and gain ratio¹⁹; model-agnostic techniques like permutation importance^21,23; and SHapley Additive exPlanations (SHAP) values,^21–23 a game-theoretic approach to quantifying the marginal contribution of each feature to the model's output. Some studies integrated multiple interpretability techniques to enhance robustness and clinical relevance. For example, Hwang et al.²¹ employed both SHAP and permutation methods to evaluate the contribution of features in various RTS-related outcomes (e.g. IKDC (international knee documentation committee) PASS, Tegner activity score). Lu et al.²⁸ combined global variable importance with partial dependency plots to illustrate nonlinear relationships between predictors and outcomes, thus strengthening clinical interpretability. Additionally, Magni et al.²⁹ utilized the signal-to-noise ratio (SNR) for feature evaluation, offering an alternative perspective on variable selection strategies. In general, these interpretability approaches facilitated the identification of key predictors of RTS, including the duration of symptoms, the location of the anatomical injury, muscle strength metrics, balance performance, and psychological factors. Such insights enhance the clinical utility of predictive models and inform targeted interventions. However, three studies did not provide a clear description of their methods to evaluate the importance of characteristics.^24–26

Discussion

Among the 11 included studies, the RF algorithm emerged as the most frequently employed model, appearing in 55% of the studies, reflecting its widespread recognition and practical value in the domain of RTS prediction following athletic injury. The AUC was the most commonly reported performance metric, with 9 studies (82%) providing AUC values. As a result, AUC serves as a crucial benchmark for cross-study comparison of predictive model performance within this field. Notably, the highest reported AUC value across all included studies was 0.96 for an RF model, underscoring RF's strong empirical performance in this domain.

This observed advantage is not merely empirical but can be theoretically grounded in several algorithmic strengths of the RF framework, which make it particularly suitable for RTS-related data. The consistent superiority of RF can be explained by several methodological advantages that align closely with the structure of RTS-related datasets. RF inherently accommodates heterogeneous data types—continuous, ordinal, and categorical variables—such as demographic characteristics, clinical indicators, psychometric measures, and imaging-derived metrics. Unlike many parametric models, it imposes minimal distributional assumptions, which reduces the need for intensive preprocessing or encoding transformations.³¹ Another important strength of RF lies in its robustness to outliers and missing data. RF-based imputation algorithms such as missForest have demonstrated superior performance compared to conventional approaches, particularly in datasets containing mixed variable types and nonlinear associations.³² This property helps preserve statistical efficiency and mitigates bias that often results from listwise deletion or naïve imputations, an issue that is common in sports injury datasets with limited sample sizes. As an ensemble learning approach, RF aggregates a large number of decorrelated decision trees to generate a stable and generalizable prediction. This ensemble mechanism reduces variance, enhances predictive reliability, and enables the model to capture complex, nonlinear interactions among predictors.³³ Such characteristics are especially beneficial for RTS prediction, where recovery outcomes are influenced by interacting biomechanical, physiological, and psychosocial factors. In addition, RF models can implicitly detect higher-order feature interactions through their hierarchical tree structures and provide interpretable outputs such as variable importance rankings and partial dependence plots. These features make it possible to identify synergistic relationships—such as between neuromuscular function and psychological readiness—that are often difficult to specify a priori in parametric models.³¹ Together, these methodological characteristics explain why RF consistently outperformed other models in the reviewed studies, achieving AUCs as high as 0.96 and demonstrating strong suitability for complex, multifactorial prediction tasks in sports injury research.

This review also revealed substantial heterogeneity in feature selection across the included studies. Such heterogeneity in variable composition complicates cross-study comparison and may limit the generalizability and clinical applicability of predictive models. Most models primarily relied on demographic and physiological-functional variables. For instance, Hwang et al.²¹ used isokinetic knee extension strength and Y-Balance test metrics in ACL injury prediction models. Similarly, Bergeron et al.^19,23 and Diniz et al.²⁰ incorporated athlete exposure data (e.g. minutes played and starting status) and demographic characteristics (e.g. age, sex, and height) into their model inputs. Skoki et al.²⁶ further extended feature construction by incorporating injury context (e.g. training vs. competition) and clinical indicators of soft tissue status (e.g. swelling grade and tension level), thereby enriching the models with injury-site-specific and experiential clinical information. Although prior studies have highlighted the significance of psychological factors in RTS decision-making,^34–37 only a minority (18%) of the included studies incorporated psychological variables. For example, Lipps Lene et al.²² introduced athletic anxiety scores, KOOS symptom and pain subscales, and self-reported functional ratings, and found that including psychological factors improved model performance (AUC increased from 0.57 to 0.60; precision from 0.46 to 0.51; sensitivity from 0.24 to 0.31), supporting the importance of psychological readiness in RTS prediction. Similarly, Magni et al.²⁹ used the SIRSI-5 questionnaire as a psychosocial indicator to capture athletes perceived recovery. Nonetheless, the limited number of studies that included psychological features, along with inconsistencies in variable types, measurement tools, and standardization, restrict the generalizability and cross-study comparability of these factors.

Another methodological source of variability that affects both model development and cross-study comparability lies in the substantial heterogeneity of sample sizes among the included studies, ranging from as few as 32 to as many as 1611 participants. Such disparities have important implications for model development and evaluation. Models trained on small datasets are particularly vulnerable to overfitting, as they tend to capture idiosyncratic or study-specific patterns rather than generalizable relationships. This often results in inflated performance metrics (e.g. accuracy or AUC) that fail to replicate in independent cohorts, thereby undermining external validity. In contrast, larger samples provide greater data diversity and more robust representation of interindividual variability, facilitating more stable model training and reliable validation. Consequently, the imbalance in sample sizes across studies complicates cross-study comparisons of predictive accuracy and likely contributes to the heterogeneity in model performance observed in this review. Empirical evidence further supports this interpretation. Increasing sample size has been shown to enhance classification accuracy and reduce variability in effect size estimation across datasets, thereby improving model generalizability.³⁸ Conversely, studies based on small samples are prone to report overly optimistic results that fail to replicate in larger, independent cohorts. For instance, in neuroimaging-based prediction research, models trained on limited samples (e.g. N ≈ 20) often achieved unrealistically high accuracies that substantially decreased when evaluated on more representative datasets.³⁹ Similarly, methodological analyses have demonstrated that cross-validation applied to very small datasets produces wide confidence intervals for performance estimates, undermining the stability and reliability of model evaluation.⁴⁰ Collectively, these findings underscore that sample size heterogeneity is not a trivial methodological issue but a critical determinant of model credibility and cross-study comparability. In addition to sample size variability, considerable heterogeneity in the operational definitions of RTS outcomes further complicates model evaluation and cross-study comparability. Definitions of RTS spanned multiple dimensions: symptom resolution (e.g. duration of concussion symptoms in Bergeron et al.¹⁹), functional status (e.g. “readiness to return” in Kerr et al.²⁵), competitive level (e.g. return to elite competition in Diniz et al.²⁰ and Magni et al.²⁹), participation status (e.g. absence from five or more games in Yates et al.²⁴), rehabilitation duration (e.g. Skoki et al.²⁶ and Torres-Velazquez et al.²⁷), and combined objective-subjective criteria (e.g. PASS standards used by Hwang et al.^21,23).

Among all included studies, two demonstrated excellent performance (AUC > 0.90). Yates et al.²⁴ developed a model to predict whether an athlete with mTBI would miss more than five games. This model achieved an AUC of 0.96, supported by a clearly defined outcome, high feature–outcome correlation, and the use of structured neurobehavioral data, thereby reducing model complexity. However, the relative simplicity of this task—despite its practical relevance may limit its ability to capture the full complexity of clinical RTS decision making. Similarly, Hwang et al.²¹ constructed a model to predict whether patients following ACL reconstruction would achieve RTS goals at 12 months based on functional assessments at 3 months (e.g. single-leg hop, Y-Balance, muscle strength recovery), achieving an AUC of 0.95. The model's strong performance may be attributed to two factors: (1) the inclusion of structured, RTS-relevant functional variables that have been widely validated in rehabilitation settings and (2) the use of well-defined RTS criteria, such as recovery of the Tegner activity score, minimizing the risk of outcome label ambiguity. However, this highly standardized modeling framework may face limitations in real-world applications, where variability in rehabilitation trajectories, sport-specific demands, and psychosocial factors may compromise its generalizability. A further study by Hwang et al.²³ employed PASS thresholds from the IKDC and ACL-RSI scales as RTS outcome criteria, constructing models based on postoperative functional data. The resulting Gradient Boosting and RF models yielded AUC values between 0.835 and 0.844, indicating good predictive performance. Despite efforts to integrate subjective symptom perception and psychological readiness, the inclusion of these dimensions introduced added complexity and challenges in outcome standardization, potentially limiting cross-study comparability and interpretive clarity. It is therefore important to emphasize that reported model performance should not be interpreted solely as a reflection of the intrinsic superiority of specific ML algorithms. Model performance is shaped by multiple interacting factors, including the quality of feature selection, outcome definition, sample representativeness, and methodological rigor.^15,41,42 The high AUC values observed in several studies may largely result from the use of well-defined and clearly dichotomized outcome variables. However, such “idealized” evaluation scenarios may obscure the limitations of these models when applied in complex, real-world clinical environments.

Despite the promising predictive performance reported in some studies, model interpretability remains a major bottleneck hindering the clinical translation of ML algorithms in RTS decision making. When the internal mechanisms of ML models are opaque or difficult for clinical experts to understand and validate, their practical utility in real-world decision making is substantially diminished. Approximately 73% of the reviewed studies explicitly reported using some form of interpretability method, although the depth and sophistication of these approaches varied considerably. Early studies on interpretability primarily focused on feature selection strategies. For example, Bergeron et al.¹⁹ employed statistical methods such as chi-square testing, information gain, and gain ratio to assess the relevance of individual features to RTS outcomes. Similarly, Diniz et al.²⁰ used forward selection to sequentially incorporate features based on their incremental contribution to model performance. While these methods can enhance training efficiency to some extent, they fall short of providing a comprehensive understanding of the model's decision-making process and fail to capture complex feature interactions or their joint influence on predictions. In contrast, the recent development of explainable ML (XML) tools has substantially mitigated the “black box” nature of ML models.⁴³ Several studies, including those by Hwang et al.^21,23 and Lipps Lene et al.,²² incorporated SHAP to enhance model transparency.^44,45 SHAP, grounded in cooperative game theory, quantifies the marginal contribution of each feature to a specific prediction and supports both global feature importance ranking and local, individual-level interpretability. This capability offers clinicians a more intuitive framework for understanding model outputs, thereby enhancing trust and likelihood of adoption. It is important to note that SHAP provides insights into statistical associations rather than causal relationships. Therefore, its outputs should not be interpreted as causal inferences. Nonetheless, in clinical decision making, SHAP-based explanations retain substantial value. On the one hand, they help clinicians identify the most influential factors driving individual predictions; on the other hand, they facilitate the identification of modifiable intervention targets, thereby informing tailored rehabilitation strategies. This is particularly relevant in RTS contexts, where decisions are influenced by multifactorial and highly interactive elements. SHAP provides a structured pathway to trace contributing variables, offering a promising avenue for practical implementation.^46,47

Although current research has demonstrated the potential of SHAP to enhance model transparency, a major translational gap remains: how to convert interpretability outputs into actionable clinical guidance for RTS decision making. Several directions may help bridge this gap. First, integrating SHAP-derived importance values into existing RTS assessment frameworks could lead to hybrid, data-informed decision-support tools. By mapping key predictive variables to conventional RTS risk stratification systems, researchers can identify modifiable bio-psycho-social factor clusters that guide structured, evidence-based interventions.^17,48 Second, interpretability methods can facilitate individualized RTS planning. Instead of relying solely on population-level probabilities, patient-specific SHAP profiles can highlight dominant risk drivers—such as psychological fear versus neuromuscular imbalance—allowing clinicians to tailor interventions accordingly (e.g. psychological counseling vs. targeted physical rehabilitation).⁴⁹ Finally, dynamic interpretability approaches, such as WindowSHAP, could enable time-resolved monitoring of predictor importance throughout rehabilitation. By tracking temporal changes in key features, clinicians may identify high-risk recovery phases and adjust treatment intensity or focus accordingly, forming a closed-loop, adaptive management model for RTS.⁵⁰ Collectively, these strategies provide a feasible pathway for transforming interpretability insights into clinically actionable tools, thereby narrowing the gap between ML research and its practical implementation in sports medicine.

Limitations

Despite the implementation of rigorous inclusion and exclusion criteria, there remains considerable heterogeneity across studies in the definition and determination of RTS outcomes, which introduces variability in both the statistical synthesis and interpretability of results. Substantial methodological differences exist among the included studies regarding the types of ML algorithms employed, hyperparameter tuning strategies, feature engineering approaches, and data preprocessing procedures. These discrepancies further undermine the comparability of model performance metrics. Moreover, several studies were constrained by relatively small sample sizes or potential selection bias, which may weaken model generalizability and external validity. Another critical limitation lies in the insufficient integration of psychological variables within RTS prediction models. Although the biopsychosocial model underscores that psychological readiness—such as fear of reinjury, motivation, and confidence—plays a decisive role in determining successful return to sport, only a small proportion of the included studies (18%) incorporated such variables into their predictive frameworks. This omission not only restricts the models’ ability to capture the full complexity of recovery processes but also reduces their applicability in real-world rehabilitation decision making. Future research should therefore prioritize the inclusion of validated psychological assessments alongside physiological and biomechanical indicators to achieve a more holistic and ecologically valid prediction of RTS outcomes. Finally, most included studies employed static modeling approaches without longitudinal tracking or adaptive feature updating, thereby limiting real-time adaptability in dynamic clinical contexts. It is also important to note that RTS represents a progressive continuum—from “return to participation” to “return to performance”—rather than a single endpoint. Given the limited number of ML-based studies available, this review did not restrict specific RTS phases, which may have introduced further variability and affected the consistency and interpretability of findings.

Conclusion

This systematic review analyzed current ML-based studies aimed at predicting RTS in athletes. Although some models demonstrated high predictive performance, substantial heterogeneity in RTS outcome definitions, feature selection, and study designs limits the utility of model performance as the sole criterion for evaluating algorithm quality. Notably, limited model interpretability and barriers to clinical application remain significant challenges. In the context of the complex and multidimensional RTS process, reliance on singular functional or subjective indicators is insufficient to comprehensively capture an athlete's recovery status. Future research should focus on developing a standardized and multidimensional RTS decision-making framework that integrates physiological, psychological, and behavioral indicators to enhance ecological validity and model generalizability. Additionally, improving model interpretability is critical to facilitating the translation of ML algorithms into clinically actionable tools that support individualized and dynamic rehabilitation management.

Supplemental Material

sj-docx-1-dhj-10.1177_20552076251408523 - Supplemental material for From injury to comeback: A systematic review of machine learning models predicting return to sport in athletes

Supplemental material, sj-docx-1-dhj-10.1177_20552076251408523 for From injury to comeback: A systematic review of machine learning models predicting return to sport in athletes by Jin Yuan, Zhuojia Li, Quanwen Zeng, Jun Li, Anjie Wang, Yong Zhang and Fei Xu in DIGITAL HEALTH

Supplemental Material

sj-xlsx-2-dhj-10.1177_20552076251408523 - Supplemental material for From injury to comeback: A systematic review of machine learning models predicting return to sport in athletes

Supplemental material, sj-xlsx-2-dhj-10.1177_20552076251408523 for From injury to comeback: A systematic review of machine learning models predicting return to sport in athletes by Jin Yuan, Zhuojia Li, Quanwen Zeng, Jun Li, Anjie Wang, Yong Zhang and Fei Xu in DIGITAL HEALTH

Supplemental Material

sj-docx-3-dhj-10.1177_20552076251408523 - Supplemental material for From injury to comeback: A systematic review of machine learning models predicting return to sport in athletes

Supplemental material, sj-docx-3-dhj-10.1177_20552076251408523 for From injury to comeback: A systematic review of machine learning models predicting return to sport in athletes by Jin Yuan, Zhuojia Li, Quanwen Zeng, Jun Li, Anjie Wang, Yong Zhang and Fei Xu in DIGITAL HEALTH

Footnotes

Acknowledgements

The authors would like to gratefully acknowledge all those who helped us during the writing of this manuscript.

ORCID iDs

Jin Yuan

Quanwen Zeng

Jun Li

Anjie Wang

Yong Zhang

Ethical approval

Ethical approval was not required for this study.

Consent for publication

Not applicable.

Contributorship

JY and YZ were involved in conceptualization; JY in methodology; YZ and FX in funding acquisition and project administration; JY, QZ, and YZ in database search and risk of bias assessment; JY and ZL in data extraction; JY and JL in data analysis; JL and AW in supervision; and YZ, JY, ZL, FX, and AW writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

The authors declare that financial support was received for the research, authorship, and/or publication of this article: The authors acknowledge the support from the following funding: Key Project of Humanities and Social Sciences in Anhui Province Universities (2024AH052247); The Fundamental Research Funds for the Central Universities: The Generative Logic and Action Path of Digital Technology Empowering Youth Sports Participation (HIT.HSS.202314); The Basic Research Fund for Provincial Universities of Heilongjiang Province: An Experimental Study on Health Qigong Rehabilitation Prescriptions for Chronic Low Back Pain (2020KYYWF-FC13). Major Project of Philosophy and Social Sciences in Anhui Province Universities (2023AH040116); Outstanding Research Team of Universities under Anhui Provincial Department of Education—“Cognitive Neuroscience Innovation Team” (2022AH010060).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Guarantor

Fei Xu

Availability of data

All data generated or analyzed during this study are included in this manuscript and Supplemental materials.

Supplemental material

Supplemental material for this article is available online.

References

Valle

Mechó

Alentorn-Geli

, et al. Return to play prediction accuracy of the MLG-R classification system for hamstring injuries in football players: a machine learning approach. Sports Med 2022; 52: 2271–2282.

Yung

Ardern

Serpiello

, et al. Characteristics of complex systems in sports injury rehabilitation: examples and implications for practice. Sports Med Open 2022; 8: 24.

Siouras

Moustakidis

Giannakidis

, et al. Knee injury detection using deep learning on MRI studies: a systematic review. Diagnostics 2022; 12: 537.

Anderson

Browning III

Urband

, et al. A systematic summary of systematic reviews on the topic of the anterior cruciate ligament. Orthop J Sports Med 2016; 4: 2325967116634074.

Paterno

Rauh

Schmitt

, et al. Incidence of second ACL injuries 2 years after primary ACL reconstruction and return to sport. Am J Sports Med 2014; 42: 1567–1573.

Della Villa

Hägglund

Della Villa

, et al. High rate of second ACL injury following ACL reconstruction in male professional footballers: an updated longitudinal analysis from 118 players in the UEFA Elite Club Injury Study. Br J Sports Med 2021; 55: 1350–1357.

De Visser

Reijman

Heijboer

, et al. Risk factors of recurrent hamstring injuries: a systematic review. Br J Sports Med 2012; 46: 124–130.

Dalton

Kerr

Dompier

. Epidemiology of hamstring strains in 25 NCAA sports in the 2009–2010 to 2013–2014 academic years. Am J Sports Med 2015; 43: 2671–2679.

Schut

Wangensteen

Maaskant

, et al. Can clinical evaluation predict return to sport after acute hamstring injuries? A systematic review. Sports Med 2017; 47: 1123–1144.

10.

Carling

Le Gall

Orhant

. A four-season prospective study of muscle strain reoccurrences in a professional football club. Res Sports Med 2011; 19: 92–102.

11.

Kyritsis

Bahr

Landreau

, et al. Likelihood of ACL graft rupture: not meeting six clinical discharge criteria before return to sport is associated with a four times greater risk of rupture. Br J Sports Med 2016; 50: 946–951.

12.

Bittencourt

Meeuwisse

Mendonça

, et al. Complex systems approach for sports injuries: moving from risk factor identification to injury pattern recognition—narrative review and new concept. Br J Sports Med 2016; 50: 1309–1314.

13.

Germann

Marbach

Civardi

, et al. Deep convolutional neural network-based diagnosis of anterior cruciate ligament tears: performance comparison of homogenous versus heterogeneous knee MRI cohorts with different pulse sequence protocols and 1.5-T and 3-T magnetic field strengths. Invest Radiol 2020; 55: 499–506.

14.

Leckey

Van Dyk

Doherty

, et al. Machine learning approaches to injury risk prediction in sport: a scoping review with evidence synthesis. Br J Sports Med 2025; 59: 491–500.

15.

Van Eetvelde

Mendonça

Ley

, et al. Machine learning methods in sport injury prediction and prevention: a systematic review. J Exp Orthop 2021; 8: 27.

16.

Page

McKenzie

Bossuyt

, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. Br Med J 2021; 372: n71.

17.

Ardern

Glasgow

Schneiders

, et al. 2016 Consensus statement on return to sport from the First World Congress in Sports Physical Therapy, Bern. Br J Sports Med 2016; 50: 853–864.

18.

Fernandez-Felix

López-Alcalde

Roqué

, et al. CHARMS and PROBAST at your fingertips: a template for data extraction and risk of bias assessment in systematic reviews of predictive models. BMC Med Res Methodol 2023; 23: 44.

19.

Bergeron

Landset

Maugans

, et al. Machine learning in modeling high school sport concussion symptom resolve. Med Sci Sports Exercise 2019; 51: 1362–1371.

20.

Diniz

Abreu

Lacerda

, et al. Pre-injury performance is most important for predicting the level of match participation after Achilles tendon ruptures in elite soccer players: a study using a machine learning classifier. Knee Surg Sports Traumatol Arthrosc 2022; 30: 4225–4237.

21.

Hwang

U-J

Kim

J-S

Kim

K-Y

, et al. Machine learning models for predicting return to sports after anterior cruciate ligament reconstruction: physical performance in early rehabilitation. Digit Health 2024; 10. DOI: https://doi.org/10.1177/20552076241299065.

22.

Lipps Lene

Frere

Weissland

. Machine learning in knee injury sequelae detection: unravelling the role of psychological factors and preventing long-term sequelae. J Exp Orthop 2024; 11: e70081–e70081.

23.

Hwang

U-J

Kim

J-S

Chung

. Machine learning predictions of subjective function, symptoms, and psychological readiness at 12 months after ACL reconstruction based on physical performance in the early rehabilitation stage: retrospective cohort study. Orthop J Sports Med 2025; 13. DOI: https://doi.org/10.1177/23259671251319512.

24.

Yates

, et al. Developing a multivariate model for the prediction of concussion recovery in sportspeople: a machine learning approach. BMJ Open Sport Exerc Med 2025; 11: e002090–e002090.

25.

Kerr

Ledet

Hahn

, et al. Quantitative assessment of balance for accurate prediction of return to sport from sport-related concussion. Sports Health Multidiscip Approach 2022; 14: 875–884.

26.

Skoki

Napravnik

Polonijo

, et al. Revolutionizing soccer injury management: predicting muscle injury recovery time using ML. Appl Sci Basel 2023; 13: 6222.

27.

Torres-Velazquez

Wille

Hurley

, et al. MRI Radiomics for hamstring strain injury identification and return to sport classification: a pilot study. Skeletal Radiol 2024; 53: 637–648.

28.

Till

Labott

, et al. Graft failure and contralateral ACL injuries after primary ACL reconstruction: an analysis of risk factors using interpretable machine learning. Orthop J Sports Med 2024; 12. DOI: https://doi.org/10.1177/23259671241282316.

29.

Magni

Webster

Olds

. Validation of the short form shoulder instability return to sport after injury (SIRSI-5) and its association with return to sports. Orthop J Sports Med 2024; 12. DOI: https://doi.org/10.1177/23259671241276865.

30.

Hosmer Jr

Lemeshow

Sturdivant

. Applied logistic regression. Hoboken, NJ, USA: John Wiley & Sons, 2013.

31.

Biau

Scornet

. A random forest guided tour. Test 2016; 25: 197–227.

32.

Stekhoven

Bühlmann

. Missforest—non-parametric missing value imputation for mixed-type data. Bioinformatics 2012; 28: 112–118.

33.

Breiman

. Random forests. Mach Learn 2001; 45: 5–32.

34.

Olds

Webster

. Factor structure of the shoulder instability return to sport after injury scale: performance confidence, reinjury fear and risk, emotions, rehabilitation and surgery. Am J Sports Med 2021; 49: 2737–2742.

35.

Webster

Feller

Lambros

. Development and preliminary validation of a scale to measure the psychological impact of returning to sport following anterior cruciate ligament reconstruction surgery. Phys Ther Sport 2008; 9: 9–15.

36.

Ardern

Webster

Taylor

, et al. Return to sport following anterior cruciate ligament reconstruction surgery: a systematic review and meta-analysis of the state of play. Br J Sports Med 2011; 45: 596–606.

37.

Lentz

Zeppieri Jr

George

, et al. Comparison of physical impairment, functional, and psychosocial measures based on fear of reinjury/lack of confidence and return-to-sport status after ACL reconstruction. Am J Sports Med 2015; 43: 345–353.

38.

Rajput

Wang

W-J

Chen

C-C

. Evaluation of a decided sample size in machine learning applications. BMC Bioinform 2023; 24: 48.

39.

Flint

Cearns

Opel

, et al. Systematic misestimation of machine learning performance in neuroimaging studies of depression. Neuropsychopharmacology 2021; 46: 1510–1517.

40.

Varoquaux

. Cross-validation failure: small sample sizes lead to large error bars. Neuroimage 2018; 180: 68–77.

41.

Remeseiro

Bolon-Canedo

. A review of feature selection methods in medical applications. Comput Biol Med 2019; 112: 103375.

42.

Haury

A-C

Gestraud

Vert

J-P

. The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PLoS ONE 2011; 6: e28210.

43.

Petch

Nelson

. Opening the black box: the promise and limitations of explainable machine learning in cardiology. Can J Cardiol 2022; 38: 204–213.

44.

Lundberg

Lee

S-I

. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst 2017; 30: Paper ID 476.

45.

Sahakyan

Aung

Rahwan

. Explainable artificial intelligence for tabular data: a survey. IEEE Access 2021; 9: 135392–135422.

46.

Kumar

Venkatasubramanian

Scheidegger

, et al. Problems with Shapley-value-based explanations as feature importance measures. In: International Conference on Machine Learning, 2020, pp. 5491–5500. PMLR.

47.

Tourani

. Predictive and causal implications of using Shapley value for model interpretation. In: Proceedings of the 2020 KDD Workshop on Causal Discovery, 2020, pp.23–38. PMLR.

48.

Slater

Kvist

Ardern

. Biopsychosocial factors associated with return to preinjury sport after ACL injury treated without reconstruction: NACOX cohort study 12-month follow-up. Sports Health 2023; 15: 176–184.

49.

Webster

Nagelli

Hewett

, et al. Factors associated with psychological readiness to return to sport after anterior cruciate ligament reconstruction surgery. Am J Sports Med 2018; 46: 1545–1550.

50.

Nayebi

Tipirneni

Reddy

, et al. WindowSHAP: an efficient framework for explaining time-series classifiers based on Shapley values. J Biomed Inform 2023; 144: 104438.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.01 MB

0.26 MB

0.27 MB