Predicting time to clearance of sport-related concussions using machine learning

Abstract

Objective

To evaluate whether integrating longitudinal clinical data improves machine learning (ML)-based prediction of time to medical clearance following sport-related concussion (SRC) and to identify clinical features most strongly associated with classification of either ‘prolonged’ recovery (≥ 30 days) or ‘normal’ recovery (< 30 days).

Methods

A retrospective cohort of 217 athletes (mean age 26.94 years) from the USF Concussion Center (2021–2025) was analyzed. Six ML classifiers were trained on Visit 1 features (n = 48) and combined Visit 1 + Visit 2 features (n = 95). Internal validation was performed using Leave-One-Out Cross-Validation (LOOCV).

Results

Prolonged recovery occurred in 81.1% of the cohort. Adding Visit 2 features improved accuracy in 66% of models, with XGBoost achieving the highest accuracy (0.84, +5% gain over Visit 1). Specificity remained low (0.00–0.34) due to class imbalance. VOR Vertical Headache and its change score were the most frequent predictors of prolonged recovery, present in 81% and 100% of models, respectively. Treatment presence between visits emerged as the strongest predictor of normal recovery.

Conclusions

Longitudinal clinical data modestly improves ML-based SRC recovery predictions. Vestibulo-oculomotor symptoms - particularly headache provoked during vertical VOR testing - are robust prognostic indicators. These findings support the utility of granular VOMS subscores for early risk stratification and targeted rehabilitation. External validation is required before clinical deployment. Code: https://github.com/MeganTran6023/Sport-Related-Concussions_Machine-Learning. IRB: USF STUDY003514.

Keywords

concussion traumatic brain injury (TBI)machine learning (ML)return to sport

Introduction

Recent trends show an annual increase in sports participation in the United States, and subsequently an increased potential for concussions like traumatic brain injury (TBI).¹ Specifically, given the increased exposure from sports participation, current estimates suggest that over 3.8 million concussions occur in the US annually, a steady increase over previous decades (i.e., 1.7 concussions per 10,000 athlete exposures in 1988–1989 to 3.4 in 2003–2004² and to 4.47 in the period between 2009 and 2014.³ This problem is further exacerbated by the fact that estimations also believe that up to 50% of concussions go unreported.¹ As such, it is essential that licensed medical practitioners (e.g., physicians, neurologists, neuropsychologists, and emergency medicine specialists) accurately diagnose TBIs following suspected incidence. To do so they utilize a combinatfion of tests, such as Balance Error Scoring System (BESS)⁴ and Vestibular Ocular Motor Screening (VOMS)⁵ to assess cognitive, observational, and visual outcomes related to TBI symptoms.⁶ However, these preliminary assessments are typically only sufficient to diagnose TBIs without being able to adequately predict a player’s time to clearance.⁷ Fortunately, the use of machine learning (ML) in the healthcare field allows for improvement to diagnostic capabilities, while also allowing for improved prediction modalities for longitudinal prognosis (e.g., time to clearance).⁸ These ML methods can be implemented using data collected at the time of the original diagnosis or can combine longitudinal data to improve prognosis capabilities.⁹

The objective of this study was to utilize machine learning models classifying recovery duration for medical clearance (i.e., return to sport) following sport-related TBI. This is done via the integration of data from gold-standard clinical assessments (e.g, BESS,⁴ VOMS,⁵ ImPACT,¹⁰ etc.) collected across multiple time points from licensed medical practitioners. Specifically, the study aimed to evaluate whether the predictive accuracy of time to clearance improves given longitudinal data, while also identifying and quantifying the specific assessment features that most strongly drive predicting time to clearance. By identifying assessment features that demonstrate a predictive signal, this work provides exploratory insights that could inform the future development of evidence-based protocols for clinical and applied contexts. Ultimately, this study will support a framework for longitudinal monitoring of TBI related to both individualized return-to-play decisions and broader clinical guidelines. This study is primarily comparative in scope: rather than proposing a single deployable clinical tool, it benchmarks six ML classifiers across longitudinal data to identify which model architectures and clinical features best support future development of a validated prediction tool. External validation and prospective testing are required before any model described here could be considered for clinical deployment. The intended target population for a future validated version of this model consists of athletes diagnosed with sport-related concussion (SRC) presenting to a sports medicine or concussion specialty clinic within one year of injury. The intended users are licensed clinicians experienced in concussion management (e.g., sports medicine physicians, neurologists, athletic trainers), who would use model output to supplement—not replace—clinical judgment regarding return-to-play timelines (TRIPOD+AI Item 3b).

Related works

Gold standard concussion assessments

In sports-related activities, athletes are at risk of the acute effects of concussion.¹¹ This includes decreased verbal/visual memory and processing speed during the acute time period defined as 1 – 14 days post-concussion.¹² Consequently, these impairments translate to declines in both cognitive outcomes and athletic performance.¹¹ Because TBI affects several aspects of an athlete’s mental and physical abilities, multiple cognitive assessments have been used for screening/diagnosing TBIs. These tools are used to evaluate cognitive, observational, and visual outcomes related to TBIs (Appendix Table 10).

While it is recommended that a variation of the Sport Concussion Assessment Tool (SCAT) is used in any TBI assessment, given its multi-modal approach to concussion screening,¹³ it is also common for other tests (e.g., King Devick for oculomotor assessment) to be administered in conjunction with the SCAT, so as to increase the overall feature set.¹³ Similarly, while there are several versions of these assessments, many of them have modifications (e.g., Balance Error Scoring System (BESS) and modified BESS (mBESS)) which may provide changes in sensitivity for the assessment of different populations/conditions.¹⁴ Interestingly, the ImPACT Test battery also uses Post-Concussion Symptom Scale (PCSS), which explains why some administration cases do not use both tests together for concussion diagnosis and evaluation.^15,16 Neurocognitive testing is incorporated to better identify concussion in athletes after an injury as solely depending on symptoms is not sufficient enough for proper diagnosis.¹⁷ PCSS is a computerized neurocognitive test many health care professionals use to determine the number and severity of symptoms an athlete experiences following a concussion.¹⁸ Langevin et al. found that the assessment only indicates a low to moderate degree of correlation between the frequency of symptoms reported by a concussed athlete using the PCSS test and the Dizziness Handicap Inventory, Headache Disability Inventory, and Neck Disability Index.¹⁹ Symptom assessments are crucial in concussion testing, as understanding an individual’s specific profile enables clinicians to implement a personalized recovery plan.²⁰ Common physical symptoms, such as headaches, dizziness, and light sensitivity, often appear immediately following the injury. It is imperative to record a patient’s symptoms for multiple reasons including establishing a baseline and tracking its progression if recorded longitudinally.^21–23

Machine learning for health prediction

Given the increased application of ML in healthcare, supervised learning has been extensively deployed as a means for processing neurological disease-based datasets to output explainable results.²⁴ This is possible due to the large amount of digitally-available data (e.g., from gold standard assessments, digital devices, patient reported outcomes, etc.) which relate to different neuro-cognitive functions (e.g., motor, memory, and executive function) of interest.²⁵ The presence of these types of data are extensively impactful for the standardization of diagnosing TBIs.²⁶ However, many of these tools, focused on diagnosing TBIs, only account for a single time point of data which makes it difficult to predict future outcomes.²⁷ Further, as there are various types of features (e.g., binary responses, continuous values, etc.) across multiple neuro-cognitive domains, ML is necessary to employ for analyzing all underlying patterns.²⁴ Consolidating ML with existing evaluation tools allows for clinicians to capture complex relationships between various assessment features that are not easily identifiable.²⁸ This integration can improve clinical and patient outcomes by enabling earlier, more accurate diagnosis, personalized prognosis, and data-driven treatment planning that supports timely interventions and informed clinical decision-making.²⁹ Recent methodological literature in the TBI and concussion domain has increasingly focused on enhancing model robustness through advanced deep learning frameworks and validation strategies. For instance, Ref. 30 demonstrated how convolutional neural network (CNN) architectures and robust preprocessing—such as addressing class imbalance through oversampling—can improve classification accuracy across temporal stages of pathology, offering a valuable perspective on the validation required for clinical translation.

Bergeron et al. found that models Naive Bayes and Random Forest were top performers in predicting concussion resolution in high school athletes within 7, 14, or 28 days.³¹ Top features that drove this high model performance includes difficulty concentrating, sensitivity to light/noise, and balance issues.³¹ Chu et al. reported that the CatBoost model outperformed traditional statistical methods in both predictive and discriminative aspects when predicting concussion recovery time and protracted recovery after using clinical data involving the Vestibular Ocular Motor Screening (VOMS), King-Devick Test, and the C3 Logix Trails Test.³² Thomas and Arnett found that their Random Forrest model performed the best in classifying concussed college athletes recovery timeframes as typical (≤ 28 days) or prolonged (>28 days).³³ These recent, impactful studies add to existing literature with the use ML models to capture nonlinear relationships inside complex datasets to properly determine recovery timeframes for concussed patients.³¹ Thomas and Arnett’s study highlight an issue from human-driven statistical method that ML models overcome which is to use past data for prospective concussion recovery predictions.³³ Furthermore, Chu et al. models required less features than traditional methods to accurately predict concussion recovery time—specifically they used 11 features for prediction while traditional models used 25–27 features.³²

Methodology

This study was reported in accordance with the TRIPOD+AI checklist.³⁴ The completed checklist is provided in the Appendix, with locations indicated by page number.

Cohort

The dataset utilized in this study was provided from the University of South Florida (USF) Concussion Center via the USF Research Electronic Data Capture (REDCap) server. The study population consisted of 3,038 patients diagnosed with a concussion between 2017 and 2026 at USF facilities. This dataset includes patient data with multiple visits collected from 2021 to 2025. Within this dataset, rows represent each patient with respective visit information while columns represent a combination of patient intake and examination data. Within the full USF Concussion Center database, patients were categorized based on one of four mechanisms of injury causing concussion (i.e., Sports Related Concussions (this study), Motor Vehicle Accidents, Falls, and Other – represented by assaults, collision with random impediments, etc. Initially, the full dataset included 3038 unique patient records. Following data preprocessing (as further depicted in the subsequent section), 217 unique patients with concussion remained for ML analysis. Of the 217, 80 (36.9%) are males and 137 (63.1%) are females, with an average age of 26.94 years. Treatment Present is a binary variable (0/1) indicating whether a patient received any form of treatment following their initial visit. Treatment types span both pharmacological and non-pharmacological approaches: Selective Serotonin Reuptake Inhibitors (SSRI), amantadine, stimulants, preventative headache medication, vestibular therapy, physical therapy, chiropractic care, psychological treatment, neuropsychological treatment, neurology treatment, and cognitive therapy. Of the 217 patients, 168 received at least one form of treatment Treatment Present = 1) and 49 did not (Treatment Present = 0). Anxiety and depression are the most common mood disorders in the dataset, where the prevalence of depression ranges from 6% to 34%³⁵ and the prevalence of anxiety is 46.72% in youth athletes.³⁶ GAD-7 Score (General Anxiety Disorder-7): This was used to quantify anxiety status at the time of the visit. PHQ-9 Score (Patient Health Questionnaire-9): This was used to quantify depression/mood status at the time of the visit.

To account for the ‘treatment_present’ variable, we included features for Selective Serotonin Reuptake Inhibitors (ssri_tx), psychological treatment (psych_tx), neuropsychological evaluation or treatment (neuropsych_tx), and cognitive therapy (cognitive_tx), each of which was coded as a binary value (0/1) to indicate whether the patient received the treatment during their single intake visit.

Finally, clearance is determined by being either asymptomatic or back to baseline levels of symptoms that were present pre-injury in addition to normal functional measures based on VOMS, BESS, CNS vital signs, return to school. Physicians experienced in the diagnosis and management of concussions make that determination.

Data preprocessing

To clean the dataset with 3038 unique patients, a multi-step iterative process was employed, beginning with the establishment of specific cleaning standards: a column missingness threshold of 0.9, a row missingness threshold of 0.8, and a threshold step of 0.1 for subsequent rounds. During the step-by-step cleaning phase, the percentage of missing values was calculated for every row and column, leading to the removal of any records or features that exceeded the established thresholds. Following each removal cycle, the standards were adjusted by the threshold step and the process was repeated until the data stabilized, resulting in an intermediate dataset of 2,338 rows and 49 columns. Finally, outliers were addressed by filtering for clinically relevant timelines, specifically retaining records where the duration to the first visit was between 0 and 365 days and the duration to clearance was at least 1 day. This rigorous refinement process produced a final dataset of 1,865 rows and 49 essential columns such that the full data frame includes no null values. This process was chosen to avoid the use of data imputation, as imputation in the medical data tends to lead to bias and it is noted that there are still no optimal imputation solutions in the medical domain.^37,38 After all preprocessing steps filtering for patients with two hospital visits, this resulting dataset included 1201 unique patients with 217 of the total with sports-related concussions.

Subsequently, both feature inclusion and engineering was performed to acknowledge clinically relevant variables related to either time instances and/or longitudinal changes experienced by patients. For feature inclusion ‘prior head injury’ and ‘history of mood disorders’ were selected for inclusion for Visit 1, whereas ‘prior head injury’, ‘history of mood disorders’, and ‘treatment presence’ were selected for inclusion for Visit 2 (i.e., as treatment was not formally administered until after the collection of data in the first visit). In addition, difference features were engineered for Visit 2 by subtracting Visit 1 values from corresponding Visit 2 values for all shared base features, excluding ‘prior head injury’, ‘history of mood disorders’, and ‘treatment presence’, to provide the models with information related to longitudinal changes between visits. This process resulted in 49 features for Visit 1 and 95 features for Visit 2. Data for patients with 2+ visits was included–highlighting only the baseline and second visit–as there were not enough patients with 3 or more visits within the normal one month time to clearance classifier.

Machine learning

Each model were trained separately for each visit using the same hyperparameter tuning, training, and testing pipeline. Visit 1 models used the clinical dataset that included new features prior head injury and history of mood disorders in addition to the original features remaining from preprocessing. Visit 2 models used the clinical dataset that included new features prior head injury, history of mood disorders, and treatment presence, the original features remaining from preprocessing, as well as the difference variants of the original features. The model’s prediction calculations are expressed in this section via Equations 1-9 (TRIPOD+AI Item 12g,22).

Light Gradient Boosting Machine (LGBM) is a Gradient Boosting Decision Tree that incorporates techniques Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). This ML model optimizes the number of features to focus on to quickly and accurately make predictions using a small dataset while reducing memory usage.³⁹ Similar to other gradient boosting frameworks, LGBM constructs an additive model by minimizing a regularized objective function of the form

L = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}) + \sum_{k = 1}^{K} Ω (f_{k})

(1)

In the objective function, n denotes the number of training samples, y_i represents the true label for the ith sample, and ${\hat{y}}_{i}$ is the corresponding model prediction. The function l(·) denotes the loss function measuring prediction error. The term f_k represents the kth decision tree in the ensemble, where K is the total number of trees. The regularization term Ω(f_k) penalizes model complexity by constraining tree depth, number of leaves, and leaf weights, thereby reducing overfitting. By leveraging GOSS to retain instances with large gradients and EFB to bundle mutually exclusive features, LGBM efficiently optimizes this objective, making it well suited for achieving the study objectives when working with limited data and computational resources.

Decision Tree Classifier applies a divide and conquer algorithm for each feature to determine the most optimal order of splits that best capture nonlinear patterns in a dataset.⁴⁰ Partitioning the problems into binary sub-outputs makes it easy to track what order of features leads to the prediction of a patient’s time to clearance that results in the highest predictive accuracy.⁴¹ Three splitting criteria choices used for this model, along with their respective mathematical formulations, are as follows:

a) Gini Impurity

G (S) = 1 - \sum_{k = 1}^{K} p_{k}^{2}

(2)

b) Entropy

H (S) = - \sum_{k = 1}^{K} p_{k} \log (p_{k})

(3)

c) Log Loss

L o s s = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \log (p_{i}) + (1 - y_{i}) \log (1 - p_{i})]

(4)

In the splitting criteria equations, S denotes the set of samples at a given node, and K is the total number of classes. The term p_k represents the proportion of samples in S that belong to class k. For log loss, N denotes the total number of samples, y_i is the true binary class label for sample i, and p_i is the predicted probability that the sample belongs to the positive class. These criteria quantify node purity and guide the selection of optimal feature splits. A clarification statement to be made is that the output of the models are the binary class labels 0 (‘normal’ recovery) or 1 (‘prolonged’ recovery).

Random Forest Random Forest combines multiple decision trees to improve the generalizability of the model’s outputs.⁴¹ Given that our dataset has 48 features and 97 features for initial and second visits, respectively, combining multiple decision trees provides a holistic understanding of feature influence on the model’s high accuracy rather than relying on a single order of splits.

Formally, Random Forest predictions can be analyzed using the margin function, defined as

m (X, Y) = P_{Θ} (h (X, Θ) = Y) - \max P_{Θ} (h (X, Θ) = j),

(5)

which measures how much more strongly the ensemble supports the correct class compared to the most competitive incorrect class, reflecting prediction confidence.

The corresponding generalization error is given by

P E = P_{X, Y} (m (X, Y) < 0),

(6)

which represents the probability that the ensemble misclassifies an input, linking model accuracy to the strength and diversity of the individual trees. In the margin function, X represents the input feature vector, Y denotes the true class label, and h(X, Θ) is the prediction of an individual tree parameterized by random variables Θ, which control feature selection and bootstrapped sampling. The probability P_Θ(·) is taken over the ensemble of trees. The generalization error PE measures the likelihood that the ensemble predicts an incorrect class for a randomly drawn input-output pair (X, Y).

Support Vector Classifiers (SVCs) SVCs perform well with small datasets since its time and space complexity is proportional to the input dataset size.⁴² SVCs are a specific type of Support Vector Machines (SVMs). For classification problems, SVMs attempt to linearly separate points from a dataset into distinct groups on a feature space of higher dimension.⁴³ This separation is achieved by solving the following optimization problem:

\min_{w, b, ξ} \frac{1}{2} ∥ w ∥^{2} + C \sum_{i = 1}^{n} ξ_{i}

(7)

In the optimization problem, w represents the weight vector defining the separating hyperplane, and b is the bias term. The slack variables ξ_i quantify the degree of misclassification for the ith sample. The regularization parameter C controls the trade-off between maximizing the margin and minimizing classification error. A larger C places greater emphasis on correctly classifying training samples, while a smaller C allows for a wider margin with increased tolerance for misclassification. Given its practicability with small datasets, this model was utilized as part of this study.

XGBoost This model is a widely used implementation of gradient tree boosting that is designed for scalability, efficiency, and strong performance, even when working with sparse datasets. Its effectiveness stems from greedy optimization in which each new tree is trained to model the residual errors left by the previous ensemble of trees. By iteratively reducing residuals, the model progressively improves its predictions, leading to a robust and highly accurate boosting framework.⁴⁴ Formally, XGBoost minimizes the following regularized objective function where n denotes the number of training samples, y_i is the true label, and ${\hat{y}}_{i}$ is the predicted output for sample i.

L = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}) + \sum_{k = 1}^{K} Ω (f_{k})

(8)

The loss function l(·) measures prediction error, while f_k denotes the kth tree in the ensemble, with K being the total number of trees. The regularization term Ω(f_k) penalizes model complexity by incorporating constraints on tree structure and leaf weights, improving performance.

Ridge Regression This is useful for addressing the collinearity problem that occurs in general linear regression analyses without dropping any of the features from the original set of independent variables.⁴⁵ The equation used introduces an ℓ₂ regularization term into the standard least-squares objective and is defined as

\hat{β} = \arg \min_{β} (∥ y - X β ∥^{2} + λ ∥ β ∥^{2})

(9)

In the optimization expression, X denotes the design matrix of input features, y is the vector of observed class labels, and β represents the regression coefficients. The regularization parameter λ controls the strength of the ℓ₂ penalty, with larger values enforcing greater shrinkage of coefficients. This constraint reduces variance and stabilizes coefficient estimates in the presence of multicollinearity. This way, each feature has some impact on the prediction of the binary classes for our dataset which enables us to analyze their individual contribution to the prediction’s accuracy.

Study design

As ML models are intended to predict whether each patient’s time to clearance is within a month or over a month we used a binary encoding method to classify these states. Specifically, this study treats recovery prognosis as a binary classification task of recovery duration, where the models differentiate between ‘normal’ (< 30 days) and ‘prolonged’ (≥ 30 days) recovery timelines. Class 0 denotes recovery within a month and Class 1 denotes recovery over a month. One month was selected as the threshold since this is the general clinical recovery timeframe for sports-related concussions.⁴⁶ This was completed to not only evaluate if the models can predict the time to clearance, but also demonstrate increases in prediction accuracy comparing between first and second visit datasets (TRIPOD+AI Item 8a).

Leave One Out Cross Validation (LOOCV) was utilized to determine the model’s accuracy for predicting the patients’ time to medical clearance for both initial and second visits. This method was selected due to the dataset’s small size.⁴⁷ For LOOCV with a dataset of size N, one data point is used in the training set while the remaining N - 1 data points is the testing set. This procedure is repeated until each data point has been used once as the training instance.⁴⁸

No patients or members of the public were involved in the design, conduct, reporting, or dissemination plans of this research.

ML Model Hyperparameter Tuning LightGBM, Decision Tree, Random Forest, SVC, and XGBoost were hyperparameter tuned using an 80:20 train–test split. For these models, hyperparameters were optimized using randomized search via ParameterSampler,⁴⁹ where 50 distinct parameter combinations were randomly sampled from predefined distributions; the configuration achieving the highest validation accuracy was selected. This number of combinations was selected given a performance plateau. In contrast, the Ridge classifier was tuned using k-fold cross-validation, where the data were split into k folds, the model was trained on k − 1 folds and validated on the remaining fold. This process was repeated k times, the validation scores were averaged across folds for each candidate alpha (α). Subsequently, the α with the highest average score was selected and the final Ridge model was re-fit on the full dataset using this optimal α. Since the Ridge model only had one hyperparameter to tune (i.e. α), its hyperparameter tuning process differed from those of the other models implemented in the paper.⁵⁰ Details for hyperparameter tuning for each model are explained in the following paragraphs.

LightGBM - Hyperparameter tuning used a randomized search with n_iter = 50 over a predefined parameter distribution. The search space included num_leaves in {31, 50, 70}, max_depth in {-1, 10, 20, 30}, learning_rate in {0.01, 0.05, 0.1, 0.2}, n_estimators in {100, 200, 500, 1000}, min_child_samples in {10, 20, 30, 50}, subsample in {0.6, 0.8, 1.0}, and colsample_bytree in {0.6, 0.8, 1.0}. Parameter combinations from ParameterSampler were evaluated in a manual loop to select the best-performing configuration.

Decision Tree - Hyperparameter tuning was performed via randomized sampling over a defined grid with n_iter = 50. The search space included criterion in {gini, entropy, log_loss}, max_depth in {None, 5, 10, 20, 30}, min_samples_split in {2, 5, 10}, and min_samples_leaf in {1, 2, 5}. Sampled parameter sets were evaluated iteratively to identify the best validation accuracy.

Random Forest - Randomized hyperparameter search with n_iter = 50 was applied using sampled combinations from a parameter distribution that included n_estimators in {100, 200, 500}, criterion in {gini, entropy}, max_depth in {None, 10, 20, 30}, min_samples_split in {2, 5, 10}, min_samples_leaf in {1, 2, 5}, and max_features in {sqrt, log2}. Each sampled configuration was trained and evaluated to determine the best-performing model.

XGBoost - Hyperparameter tuning used a randomized search with n_iter = 50. The sampled search space included max_depth in {3, 5, 7}, learning_rate in {0.05, 0.1, 0.2}, n_estimators in {50, 100, 150}, subsample in {0.7, 0.85, 1.0}, colsample_bytree in {0.7, 0.85, 1.0}, gamma in {0, 0.1, 0.3}, min_child_weight in {1, 3, 5}, reg_alpha in {0, 0.1, 0.5}, and reg_lambda in {0.5, 1.0, 1.5}. Parameter sets were sampled and evaluated in a manual random search loop.

Support Vector Classifier (SVC) - Hyperparameter tuning was conducted using randomized sampling over a discrete parameter set. The search space consisted of C in {0.1, 10, 100} and gamma in {1, 0.1, 0.01}. The experiment was restricted to a linear kernel, as this was the only configuration allowing feature importance extraction. Due to limited computational resources, 27 sampled combinations were evaluated iteratively, and the configuration yielding the highest accuracy was selected.

Ridge Regression/Ridge Classifier - Hyperparameter tuning targeted the regularization strength alpha. A candidate set alpha in {0.001, 0.01, 0.1, 1, 5, 10, 15, 20} was used. For some experiments, a random subset of these values (up to n_iter = 50) was sampled and evaluated using k-fold cross-validation, with the best-performing alpha selected based on validation performance.

Class Imbalance The dataset exhibits substantial class imbalance: 176 patients (81.1%) belong to the ‘prolonged’ recovery class (Class 1) versus 41 patients (18.9%) in the normal recovery class (Class 0). No resampling techniques (e.g., SMOTE, random oversampling) were applied in the primary analysis, as the authors judged that synthetic data generation could misrepresent the true clinical distribution.⁵¹ Instead, class weighting was incorporated into the pipeline to penalize misclassification of the minority class (Class 0) proportionally to class frequency. The effect of this imbalance is reflected in near-zero specificity values in the unweighted pipeline and is discussed as a primary limitation of the reported accuracy and recall metrics (TRIPOD+AI Item 13).

Feature Importance (FI) To identify and quantify which assessment features most strongly influence predictions of time to clearance, FI analyses were conducted using methods appropriate to each model type on a final refitted model. For tree-based models (i.e., Decision Tree, Random Forest), importance was derived from the reduction in Gini impurity attributable to each feature. Specifically, these FI values in addition to their respective Gini impurity values were extracted from each trained tree-based model.^52,53 Conversely, for gradient-boosted models (i.e., LightGBM, XGBoost) importance was based on gain, defined as the total reduction in loss from splits using a given feature.⁵⁴ For linear SVC, FI was quantified using the absolute magnitude of the learned coefficients, reflecting each feature’s influence on the decision function.⁵⁵ For Ridge, permutation importance was derived by measuring the decrease in predictive accuracy resulting from random permutation of individual features.⁵⁶ Appendix Table 11 denotes formal feature names related to specific features of importance.

Results

All six classifiers produce binary class labels as their primary output: Class 0 denotes predicted ‘normal’ recovery (<30 days to medical clearance) and Class 1 denotes prediction of ‘prolonged’ recovery (≥30 days). Predicted class probabilities are available for probabilistic models (LightGBM, XGBoost, Random Forest, Decision Tree, Ridge) and can be inspected via the public repository. Classification thresholds were set at 0.5 for all models; threshold optimization (e.g., via ROC analysis) was not performed at this stage given the exploratory scope of the study (TRIPOD+AI Item 15).

ML Model Metrics The classification of a TBI patient’s time to clearance was completed to gain additional insights on assessment features that drive high predictive accuracy. Table 1 presents the accuracy (i.e., proportion of all correct predictions), precision (i.e., proportion of predicted positives that are true positives), recall (i.e., proportion of actual positives (Class 1) correctly identified), F1 (harmonic mean of precision and recall, reflecting their balance), and specificity (i.e., proportion of actual negatives (Class 0) correctly identified) for each model’s performance on both visits. Model performance was also evaluated using balanced accuracy (accounts for class imbalance by averaging sensitivity and Specificity), MCC (provides a single-value summary robust to class imbalance), and the Brier score (measures probabilistic calibration). No clinical utility analysis (e.g., decision curve analysis) was performed at this stage, as the study is comparative and exploratory rather than proposing a deployment-ready tool. Bootstrap 95% confidence intervals (1,000 resamples) are reported for all metrics to quantify estimation uncertainty (TRIPOD+AI Item 12e). Given only Visit 1 data, SVC and Random Forest performed the best with accuracies of 0.81, recall of 1.00, F1-score of 0.9 for both models. Precision was highest from both LightGBM and XGBoost for Visit 1 data at 0.84. With the introduction of Visit 2 data, XGBoost provided the highest accuracy (0.84), recall (0.99), and F1-score (0.91) with an adequate precision metric of 0.84. However, Decision Tree had the best precision of 0.86. Overall, specificity values were low with LightGBM performing the best using only Visit 1 data (0.29) whereas Decision Tree performed the best when expanding the dataset to include Visit 2 features (0.34). This could be attributed to the small dataset used for both visits as there were more samples for class 1 than in class 0. As a result, ML models are prone to over predict for the dominant class to optimize a single summary metric (e.g. overall error rate).⁵⁷ The event-per-variable (EPV) ratio for models in the first visit is 0.84 and 0.43 for second visit.

Table 1.

Visit 1 – model performance with 95% bootstrap confidence intervals.

Model	Accuracy	Balanced Accuracy	Precision	Recall	F1	Specificity	MCC	Brier
LightGBM	0.77	0.57	0.84	0.89	0.87	0.29	0.23	0.16
LightGBM	[0.77, 0.87]	[0.51, 0.63]	[0.78, 0.88]	[0.94, 0.99]	[0.86, 0.93]	[0.05, 0.29]	[0.04, 0.39]	[0.12, 0.21]
Dec. Tree	0.77	0.61	0.82	0.85	0.83	0.17	0.23	0.17
Dec. Tree	[0.71, 0.83]	[0.53, 0.68]	[0.79, 0.90]	[0.82, 0.92]	[0.82, 0.90]	[0.19, 0.49]	[0.07, 0.38]	[0.14, 0.21]
Rand. For.	0.81	0.50	0.81	1.00	0.90	0.00	0.00	0.15
Rand. For.	[0.76, 0.86]	[0.50, 0.50]	[0.76, 0.86]	[1.00, 1.00]	[0.86, 0.93]	[0.00, 0.00]	[0.00, 0.00]	[0.12, 0.18]
XGBoost	0.79	0.58	0.84	0.95	0.89	0.22	0.26	0.15
XGBoost	[0.77, 0.87]	[0.52, 0.65]	[0.78, 0.89]	[0.94, 0.99]	[0.86, 0.93]	[0.08, 0.32]	[0.08, 0.42]	[0.12, 0.19]
SVC	0.81	0.50	0.81	1.00	0.90	0.00	-0.03	0.16
SVC	[0.75, 0.86]	[0.49, 0.50]	[0.76, 0.86]	[0.98, 1.00]	[0.86, 0.92]	[0.00, 0.00]	[-0.06, 0.00]	[0.13, 0.19]
Ridge	0.80	0.51	0.81	0.97	0.88	0.02	0.03	0.18
Ridge	[0.74, 0.84]	[0.48, 0.55]	[0.76, 0.87]	[0.94, 0.99]	[0.85, 0.91]	[0.00, 0.12]	[-0.09, 0.18]	[0.16, 0.19]

Note: Values within brackets represent the 95% confidence intervals.

Notably, the integration of class weighting yielded superior specificity across all models compared to the initial machine learning implementation. By assigning greater weights to minority class observations, this strategy mitigated the inherent bias toward the majority class, ensuring that class imbalance did not impede the predictive integrity of the models.

FI Scores and Average Effect on Predictive Outcome Listing out the top 20 features that influence each model’s prediction is important for knowing which assessments to use for future TBI identification and treatment. To further understand the specific effect each of the top 20 features has on the model’s predictive output, plotting their average effects for each target class as well as their FI scores was executed. Results for the top 20 features by model and visit number are presented in Tables 1 and 2. Imp. represents Importance Scores in these tables. Additionally, to aid in identifying high-yield assessments in longitudinal modeling, we quantified FI by examining both rank order and selection frequency across models. This is displayed in Tables 3 and 4. From these tables several key features are highlighted. From Visit 1, four key features (i.e., NPC Headache, VMST Dizziness, VMST Headache, and VOMS Headache) show up across all 6 models (Table 3) with high feature importance scores as presented in Table 2. Similarly, expanding to the Visit 2 feature set, three key features (i.e., VOR Vertical Headache, Treatment Present, and VOR Vertical Headache Difference) show up across all 6 models (Table 4) with high feature importance scores presented in Table 1.

Table 2.

Counts of specific treatment received.

Treatment column	Count
amantadine_tx	17
stimulant_tx	15
preventative_headache_tx	46
vestibular_tx	86
pt_tx	68
chiro_tx	66
psych_tx	31
neuropsych_tx	21
neurology_tx	5
cognitive_tx	8

Table 3.

Visit 2 – model performance with 95% bootstrap confidence intervals.

Model	Accuracy	Balanced Accuracy	Precision	Recall	F1	Specificity	MCC	Brier
LightGBM	0.82	0.58	0.84	0.91	0.88	0.24	0.31	0.14
LightGBM	[0.78, 0.88]	[0.53, 0.64]	[0.78, 0.88]	[0.97, 1.00]	[0.87, 0.93]	[0.06, 0.29]	[0.13, 0.47]	[0.11, 0.17]
Dec. Tree	0.80	0.56	0.86	0.91	0.88	0.34	0.13	0.17
Dec. Tree	[0.69, 0.80]	[0.49, 0.64]	[0.78, 0.88]	[0.81, 0.91]	[0.80, 0.88]	[0.13, 0.42]	[-0.02, 0.29]	[0.14, 0.21]
Rand. For.	0.82	0.57	0.83	0.98	0.90	0.15	0.25	0.14
Rand. For.	[0.77, 0.87]	[0.51, 0.63]	[0.78, 0.88]	[0.96, 1.00]	[0.87, 0.93]	[0.05, 0.26]	[0.07, 0.41]	[0.11, 0.17]
XGBoost	0.84	0.60	0.84	0.99	0.91	0.17	0.27	0.15
XGBoost	[0.76, 0.86]	[0.53, 0.68]	[0.79, 0.89]	[0.90, 0.97]	[0.86, 0.92]	[0.14, 0.42]	[0.09, 0.42]	[0.12, 0.18]
SVC	0.81	0.50	0.81	0.99	0.89	0.00	0.00	0.15
SVC	[0.76, 0.86]	[0.50, 0.50]	[0.76, 0.86]	[1.00, 1.00]	[0.86, 0.93]	[0.00, 0.00]	[0.00, 0.00]	[0.12, 0.19]
Ridge	0.80	0.50	0.81	0.97	0.89	0.05	0.01	0.18
Ridge	[0.71, 0.83]	[0.46, 0.55]	[0.76, 0.86]	[0.89, 0.97]	[0.83, 0.90]	[0.00, 0.17]	[-0.12, 0.17]	[0.16, 0.20]

Note: Values within brackets represent the 95% confidence intervals.

Table 4.

Statistical significance of accuracy performance gains across visits per model.

Model	Visit 1 accuracy	Visit 2 accuracy	Gain [95% confidence interval]	Significant
XGBoost	0.79	0.84	0.051	True
XGBoost	0.79	0.84	[0.023, 0.083]	True
LightGBM	0.77	0.82	0.045	True
LightGBM	0.77	0.82	[0.018, 0.074]	True
Dec. Tree	0.77	0.80	0.028	True
Dec. Tree	0.77	0.80	[0.009, 0.051]	True
Rand. For.	0.81	0.82	0.009	False
Rand. For.	0.81	0.82	[0.000, 0.023]	False
SVC	0.81	0.81	0.000	False
SVC	0.81	0.81	[0.000, 0.000]	False
Ridge	0.80	0.80	0.000	False
Ridge	0.80	0.80	[0.000, 0.000]	False

Further, Figures 1–3 present the average predictive effect for each of the top 20 features per model and visit. In these figures, the blue bars predict towards Class 0 and red bars predict towards Class 1. For the identified features from Visit 1 (i.e., NPC Headache, VMST Dizziness, VMST Headache, and VOMS Headache), Figures 1–3 highlight that these features tend to predict towards Class 1 with 19 of 24 instances (79.17%) of these features being highlighted in red. Similarly, for Visit 2, the VOR Vertical Headache and VOR Vertical Headache Difference features predicted towards Class 1 (i.e., red) across 11 of 12 instances (91.67%). Conversely, for Visit 2, the treatment present feature predicted towards Class 0 (i.e., black) in all 6 models.

Figure 1.

Data preprocessing flowchart.

Figure 2.

Average effects – Visit 1 vs Visit 2. Blue bars predict towards Class 0 and red bars predict towards Class 1.

Figure 3.

Average effects – Visit 1 vs Visit 2. Blue bars predict towards Class 0 and red bars predict towards Class 1.

Discussion

The primary objective of this study was to evaluate the utility of machine learning (ML) models in predicting time to medical clearance via binary classification following sport-related traumatic brain injury (TBI), with a secondary aim of assessing whether the integration of longitudinal clinical data across two visits improves predictive accuracy compared to a single-visit model. To accomplish these goals, six ML models—LightGBM, Decision Tree, Random Forest, XGBoost, Support Vector Classifier, and Ridge Regression—were trained and evaluated using gold-standard clinical assessment data from the USF Concussion Center. Traditional statistical approaches commonly applied in concussion research, such as linear or logistic regression, are typically constrained by assumptions of linearity, independence, and normally distributed errors, which may not adequately capture the complex, nonlinear interactions inherent in multidimensional clinical datasets^32,33; In contrast, ML models are well-suited to handle high-dimensional feature spaces, mixed data types, and nonlinear relationships, while also offering the advantage of feature importance quantification to support interpretable, evidence-based clinical decision-making.³¹ Furthermore, the feasibility of implementing ML in SRC research has been demonstrated across a growing body of work, including studies predicting concussion resolution within 7, 14, and 28 days using clinical symptom profiles,³¹ recovery time using vestibular and oculomotor screening data,³² and recovery trajectories in collegiate athletes.³³ Building upon this foundation, the present study extends prior work by incorporating longitudinal data across two clinical visits and applying feature engineering—specifically, difference features capturing symptom change between visits—to further enhance the predictive framework and identify clinically actionable assessment targets for personalized return-to-play decision-making.

Model Performance Focusing on accuracy is important because it informs clinicians on properly diagnosing and treating patients.⁵⁸ As such, highlighting improved accuracy when including secondary visit features is of interest. Following this inclusion 66% of the models demonstrated an increase in accuracy when including Visit 2 features with XGBoost yielding both the highest change in accuracy (i.e., from Visit 1 to Visit 2 prediction) as well as second visit accuracy alone (84%).

A reason behind the upwards trend in accuracy between initial and second visits in 66% of the ML models is attributed to applying domain specific feature engineering. Deriving new features from accounting for the interactions between multiple original features to enhance the models’ predictive performance as they provide additional context to said field.⁵⁹ This is evident in the usage of difference features for the second visits dataset, which represents the difference between the second visit and initial visit values of the original features list and allow the model to make more correct predictions for both classes

(Figures 4–6). It is interesting to note that for Visit 2, models show an increase of correctly predictions for Class 0 compare to initial visits which may be due to the additional difference features and treatment presence in the dataset. Thus, this hints about these added features’ potential to be clinically significant. Also, the dataset consisted of 176 patients belonging in Class 1 and 41 patients in Class 0. From this, the model will have a high chance of correct predictions since it will overpredict for Class 1 (Figures 4–6). Table 5 supports this as the recall values are close to 1 while that of specificity is close to 0. While there is support for why specificity is close to 0, from a clinical standpoint, it is important to highlight the need for accurately identifying true negatives (e.g. Class 0) rather than only correctly classifying true positives (e.g. Class 1). By keeping TBI patients under medical treatment longer than needed, aspects such as cognitive and physical abilities may decline.⁶⁰

Figure 4.

Average effects – Visit 1 vs Visit 2. Blue bars predict towards Class 0 and red bars predict towards Class 1.

Figure 5.

Confusion matrix – Visit 1 vs Visit 2.

Figure 6.

Confusion matrix – Visit 1 vs Visit 2.

Table 5.

Description of dataset between outcome groups.

Metric	‘Normal’ recovery (Class 0)	‘Prolonged’ recovery (Class 1)
Total Patients (N)	41	176
No Treatment	21	28
Treatment Present	20	148
Male Patients	17	63
Female Patients	24	113
Days to Clearance (Mean ± SD)	20.59 ± 5.33	98.18 ± 90.54
Days from Injury to First Visit (Mean ± SD)	6.98 ± 3.24	29.53 ± 50.52
Days from 1st to 2nd visit	12.56 ± 4.58	28.39 ± 19.36

XGBoost has the highest accuracy change between its Visit 1 and Visit 2 accuracy scores as well as has the highest accuracy for Visit 2 alone. This is attributed to the model’s majority voting combination method of base models and boosting ensemble learning method.⁶¹ The accuracy score for XGBoost’s second visits aligns with work completed in Thomas et al. of 82.5%⁶² regarding the differentiation of concussed versus healthy cohorts. Also, as overfitting is minimized for this model, this means it can generalize well to unseen data. These model characteristics allow XGBoost to successfully output accurate classifications.⁶¹

The lack of improvement in accuracy across both visits for SVC and Ridge can be attributed to their shared limitation as linear models. While the linear kernel in SVC was applied due to its simple implementation on limited dataset sizes,⁶³ and Ridge regression is inherently a linear method,⁴⁵ neither is capable of capturing the complex, non-linear relationships between the input assessment features and the target classes.^64,65 The SVC accuracy scores observed in this study align with the approximately 81% accuracy reported by⁶⁶ for detecting prior concussions in retired athletes, and the Ridge classifier’s predictive accuracy-measured by Mean Squared Error (MSE) and Mean Absolute Error (MAE)-is comparable to LASSO, identified as the top-performing model for predicting sports-related TBI in NCAA athletes.⁶⁷ Despite these encouraging benchmarks, the additional difference-based features derived from the second visit provided little new information beyond what was already encoded in the original variables, yielding no meaningful gain in predictive accuracy across visits for either model.

Feature Importance and Average Effect The evident cross-model agreement (e.g., showing convergence of similar features in top 20 importance lists as well as their average effect on prediction for days to clearance) is present given the nature in which these rely on similar supervised learning methods.⁶⁸ Not only is the feature similarity due to the model’s learning method, but there is a possible true relationship present between those features and the prediction of a patient’s time to clearance.⁶⁹

Features that appear across all models denote that these must be assessed and incorporated into a TBI patient’s personalized treatment plan as these were found to be most important to the time to clearance prediction per the models’ results. Further, if adding a base feature and its difference variant together from the second visit dataset result in a total occurrence count of at least 9, then it is deemed to have high clinical significance. Specific features that reflect this are explained in detail in the following paragraphs (Tables 6–9).

Table 6.

Top 20 feature scores for machine learning models – Visit 1.

LightGBM		Decision Tree		Random Forest		XGBoost		SVC		Ridge
Feature	Imp.	Feature	Imp.	Feature	Imp.	Feature	Imp.	Feature	Imp.	Feature	Imp.
Cervical Flexion	169.712095	Saccades Horizontal Headache	0.101640	Cervical Flexion	0.104531	VOMS Headache	3.696006	VOMS Fogginess	0.132768	VOMS Fogginess	0.101229
Left Cervical Rotation	137.412696	VOR Horizontal Dizziness	0.075271	Left Cervical Rotation	0.104429	Saccades Horizontal Headache	2.603712	Saccades Horizontal Fogginess	0.066317	VMST Headache	0.075422
PHQ-9 Score	117.477276	Right Cervical Rotation	0.074313	Right Cervical Rotation	0.085136	VOR Horizontal Headache	1.998317	Smooth Pursuits Fogginess	0.066310	Saccades Horizontal Fogginess	0.028571
Left Lateral Flexion	113.081305	PHQ-9 Score	0.069096	PHQ-9 Score	0.081346	Left Cervical Rotation	1.726872	VMST Headache	0.002160	VOR Vertical Headache	0.023502
GAD-7 Score	89.292642	Previous Head Injury	0.067866	Left Lateral Flexion	0.064458	Cervical Flexion	1.720641	VOR Vertical Headache	0.002008	VOMS Headache	0.016436
Right Cervical Rotation	81.066041	Left Lateral Flexion	0.067476	Cervical Extension	0.057082	VMST Dizziness	1.582768	Smooth Pursuits Dizziness	0.001783	Saccades Vertical Dizziness	0.015515
Cervical Extension	61.320671	VOR Vertical Dizziness	0.066531	GAD-7 Score	0.051770	Right Cervical Rotation	1.547318	VOMS Dizziness	0.001707	VMST Dizziness	0.013057
NPC Headache	34.372974	Saccades Vertical Fogginess	0.061698	VOR Horizontal Headache	0.029481	NPC Headache	1.426695	NPC Headache	0.001482	Smooth Pursuits Dizziness	0.009370
VOR Horizontal Dizziness	29.291269	Cervical Flexion	0.057859	NPC Headache	0.025966	Smooth Pursuits Headache	1.311398	Saccades Vertical Dizziness	0.001470	VOR Vertical Dizziness	0.008295
VOR Vertical Dizziness	28.412666	VOMS Headache	0.056673	VOR Vertical Headache	0.025339	PHQ-9 Score	1.248514	NPC Dizziness	0.001383	Saccades Horizontal Dizziness	0.003687
VMST Headache	26.949200	Left Cervical Rotation	0.048231	VOR Vertical Dizziness	0.024689	VOR Horizontal Dizziness	1.074843	Smooth Pursuits Headache	0.001161	NPC Dizziness	0.003226
Smooth Pursuits Fogginess	24.984704	VMST Headache	0.040304	VMST Dizziness	0.024522	Left Lateral Flexion	0.953129	NPC Fogginess	0.001006	Cervical Flexion	0.002304
VOR Horizontal Headache	19.395496	NPC Headache	0.039941	VOR Horizontal Dizziness	0.023149	Previous Head Injury	0.920716	History of Mood Disorder	0.000703	Left Lateral Flexion	0.000614
VMST Dizziness	18.911416	NPC Dizziness	0.038961	VMST Headache	0.022981	VMST Headache	0.869875	VOMS Headache	0.000573	Previous Head Injury	0.000000
Saccades Horizontal Headache	16.961958	VOR Vertical Fogginess	0.032568	Smooth Pursuits Headache	0.021290	VOR Vertical Headache	0.789330	VOR Horizontal Headache	0.000502	Smooth Pursuits Headache	-0.000461
VOMS Headache	16.128770	Cervical Extension	0.026422	Saccades Vertical Headache	0.020215	GAD-7 Score	0.781334	VOR Horizontal Fogginess	0.000365	NPC Headache	-0.000768
VOR Vertical Headache	15.519637	VMST Dizziness	0.026287	Saccades Horizontal Headache	0.018980	Smooth Pursuits Fogginess	0.644567	VOR Vertical Fogginess	0.000329	VOR Horizontal Fogginess	-0.001536
History of Mood Disorder	13.483883	GAD-7 Score	0.011150	VOMS Headache	0.018487	VOMS Dizziness	0.629748	VMST Fogginess	0.000318	Saccades Vertical Headache	-0.001843
Previous Head Injury	12.862027	VOMS Dizziness	0.007611	VMST Fogginess	0.017119	VOR Vertical Dizziness	0.570209	Saccades Vertical Headache	0.000246	VMST Fogginess	-0.001843
Saccades Horizontal Dizziness	12.215913	Saccades Horizontal Dizziness	0.004396	Saccades Horizontal Dizziness	0.016843	Saccades Vertical Headache	0.542436	VMST Dizziness	0.000237	PHQ-9 Score	-0.002151

Bold text and colored cells denote features that show up across all ML models.

Table 7.

Top 20 feature scores for machine learning models – Visit 2.

LightGBM		Decision Tree		Random Forest		XGBoost		SVC		Ridge
Feature	Imp.	Feature	Imp.	Feature	Imp.	Feature	Imp.	Feature	Imp.	Feature	Imp.
Cervical Extension	132.997165	Treatment Present	0.209153	PHQ-9 Score	0.158378	PHQ-9 Score	7.520260	Treatment Present	0.277474	VOR Vertical Headache Difference	0.112289
PHQ-9 Score	132.629271	NPC Headache Difference	0.130095	Treatment Present	0.107146	Treatment Present	6.863630	VMST Headache Difference	0.229089	VMST Headache Difference	0.086329
Treatment Present	102.996456	GAD-7 Score	0.093987	Right Cervical Rotation	0.091488	VOR Horizontal Headache	4.832900	VOR Vertical Headache Difference	0.225829	VOR Vertical Headache	0.035484
GAD-7 Score	99.256491	Right Cervical Rotation	0.080408	VOR Vertical Headache	0.072016	VOR Vertical Headache Difference	4.686990	NPC Headache	0.183861	Saccades Vertical Headache Difference	0.035023
Left Cervical Rotation	94.990231	Smooth Pursuits Fogginess	0.070633	Cervical Extension	0.067849	VMST Headache Difference	4.647623	Smooth Pursuits Headache Difference	0.124145	NPC Headache	0.027957
Right Cervical Rotation	85.222731	VOR Horizontal Dizziness Difference	0.056453	Left Cervical Rotation	0.060300	VMST Dizziness Difference	3.845895	Saccades Vertical Headache Difference	0.114409	VOMS Headache Difference	0.020891
Cervical Flexion	65.853179	VOR Horizontal Dizziness	0.055287	Cervical Flexion	0.054013	Right Cervical Rotation	3.588670	VMST Headache	0.107742	VMST Dizziness Difference	0.018894
VOR Horizontal Dizziness Difference	50.793655	Cervical Extension	0.052546	VOR Horizontal Dizziness	0.047586	VOR Vertical Headache	3.425193	VOR Vertical Headache	0.091372	VOR Horizontal Fogginess	0.014439
VOR Vertical Headache	41.973389	Left Cervical Rotation	0.049496	Left Lateral Flexion	0.040433	VOR Horizontal Dizziness	3.338928	Saccades Vertical Headache	0.090034	Left Cervical Rotation	0.010753
VMST Dizziness Difference	40.289953	VOR Vertical Headache	0.047843	VMST Dizziness Difference	0.040085	Left Cervical Rotation	2.505095	VOR Horizontal Headache Difference	0.088041	PHQ-9 Score	0.009831
Left Lateral Flexion	39.628385	PHQ-9 Score	0.042504	VOR Horizontal Dizziness Difference	0.038589	VOR Horizontal Dizziness Difference	2.120906	VMST Dizziness	0.064442	Treatment Present	0.009677
VOR Vertical Headache Difference	29.793191	Saccades Horizontal Headache Difference	0.032939	GAD-7 Score	0.038330	GAD-7 Score	2.113929	History of Mood Disorder	0.063288	VMST Headache	0.009677
VOR Horizontal Headache Difference	29.516739	Left Lateral Flexion	0.009242	VOR Vertical Headache Difference	0.021061	VOR Horizontal Headache Difference	1.921161	NPC Headache Difference	0.061687	VOR Horizontal Headache	0.008141
VOR Horizontal Dizziness	28.416646	Smooth Pursuits Headache Difference	0.008370	VMST Dizziness	0.018404	Cervical Extension	1.817560	Saccades Horizontal Headache	0.060007	VOR Horizontal Dizziness	0.007527
Saccades Horizontal Headache Difference	25.220808	VOR Horizontal Headache Difference	0.007884	VOR Horizontal Headache Difference	0.015961	VMST Dizziness	1.443005	VOR Horizontal Fogginess	0.058896	Cervical Flexion	0.007066
VMST Headache Difference	21.213691	VMST Dizziness Difference	0.007234	VOMS Headache Difference	0.015678	NPC Headache Difference	1.431498	VOR Vertical Fogginess	0.053299	Saccades Vertical Headache	0.005837
VOR Horizontal Fogginess	13.139492	VOR Vertical Dizziness	0.006691	VOR Horizontal Headache	0.013487	Left Lateral Flexion	1.201151	VMST Fogginess	0.053299	Saccades Horizontal Headache	0.005684
NPC Headache Difference	9.903239	Saccades Horizontal Fogginess	0.006682	Saccades Vertical Headache Difference	0.012717	Cervical Flexion	1.153562	Saccades Horizontal Dizziness	0.052364	VOMS Headache	0.003994
Previous Head Injury	9.243800	VOR Vertical Headache Difference	0.006217	VMST Headache Difference	0.012322	Saccades Horizontal Headache Difference	1.048383	Smooth Pursuits Headache	0.051990	VMST Fogginess	0.003994
VMST Dizziness	9.105162	Saccades Vertical Headache Difference	0.003349	Saccades Horizontal Headache Difference	0.008727	Saccades Horizontal Headache	0.707218	VOMS Headache	0.051990	Smooth Pursuits Dizziness	0.003840

Bold text and colored cells denote features that show up across all ML models. Olive and forest green colored cells denote base feature and difference variant that sum up to 9 or more for occurence total.

Table 8.

Feature frequency across models — Visit 1.

Feature	Occurrence in no. of models
NPC Headache	6
VMST Dizziness	6
VMST Headache	6
VOMS Headache	6
Cervical Flexion	5
Left Cervical Rotation	5
Left Lateral Flexion	5
VOR Vertical Dizziness	5
VOR Vertical Headache	5
GAD-7 Score	4
PHQ-9 Score	4
Previous Head Injury	4
Right Cervical Rotation	4
Saccades Horizontal Dizziness	4
Saccades Horizontal Headache	4
Smooth Pursuits Headache	4
VOR Horizontal Dizziness	4
VOR Horizontal Headache	4
Cervical Extension	3
NPC Dizziness	3
Saccades Vertical Headache	3
Smooth Pursuits Fogginess	3
VMST Fogginess	3
VOMS Dizziness	3
VOR Vertical Fogginess	3
History of Mood Disorder	2
Saccades Horizontal Fogginess	2
Saccades Vertical Dizziness	2
Smooth Pursuits Dizziness	2
VOMS Fogginess	2
VOR Horizontal Fogginess	2
NPC Fogginess	1
Saccades Vertical Fogginess	1

Green colored cells denote features that show up across all models.

Table 9.

Feature frequency across models — Visit 2.

Feature	Occurrence in No. of Models
VOR Vertical Headache	6
Treatment Present	6
VOR Vertical Headache Difference	6
VOR Horizontal Dizziness	5
VMST Headache Difference	5
VMST Dizziness Difference	5
VOR Horizontal Headache Difference	5
Left Cervical Rotation	5
PHQ-9 Score	5
Cervical Extension	4
Right Cervical Rotation	4
GAD-7 Score	4
Left Lateral Flexion	4
NPC Headache Difference	4
VOR Horizontal Dizziness Difference	4
Saccades Horizontal Headache Difference	4
Saccades Vertical Headache Difference	4
Cervical Flexion	4
VMST Dizziness	3
VOR Horizontal Fogginess	3
VOR Horizontal Headache	3
Saccades Horizontal Headache	3
Smooth Pursuits Headache Difference	3
VMST Headache	2
NPC Headache	2
VOMS Headache Difference	2
VOMS Headache	2
Saccades Vertical Headache	2
VMST Fogginess	2
History of Mood Disorder	1
VOR Vertical Dizziness	1
Saccades Horizontal Dizziness	1
Smooth Pursuits Fogginess	1
Previous Head Injury	1
VOR Vertical Fogginess	1
Smooth Pursuits Headache	1
Smooth Pursuits Dizziness	1
Saccades Horizontal Fogginess	1

Green colored cells denote features that show up across all models. Cells in orange and bold identify variable pairs (base feature and difference variant) whose combined occurrences show up at least 75% of the time.

The high frequency of Vestibular Ocular Motor Screening (VOMS)-related features across top-performing models warrants a nuanced discussion regarding potential redundancy. Measures such as Near Point of Convergence (NPC) Headache, Visual Motor Stability (VMST) Headache, and Vestibulo-Ocular Reflex (VOR) Vertical Headache appear as primary predictors in nearly all model architectures. While these features originate from the same clinical battery and share variance related to global symptom provocation, their independent selection suggests they capture distinct physiological stressors. Unlike an aggregate VOMS composite score, which serves as a general diagnostic marker, the individual sub-assessments isolate specific vestibular and oculomotor pathways; for instance, the identification of VOR Vertical Headache as more prognostic than horizontal variants highlights the importance of keeping granular data even when features appear collinear.

For the first visit, the patient records for Near Point of Convergence headache (i.e., npc_headache), Visual Motor Stability dizziness (i.e. vmst_dizziness), Visual Motor Stability headache (i.e. vmst_headache), and Vestibular Ocular Motor Screening headache (i.e. voms_headache) appear on every ML models’ top 20 features list. This predicted favorably for a time to clearance of over a month for 81% of models for npc_headache, vmst_headache, and voms_headache and for 66% of the models for vmst_dizziness. The model where the features did not have an average effect for over a month time to clearance prediction is for SVC. This discrepancy is attributed to the type of kernel used as a linear kernel does not properly capture complex relationships between the input and target output.⁶⁴ Vestibular Ocular Motor Screening (VOMS) is a well-established physical examination tool for concussion with evidence supporting both its diagnostic sensitivity and prognostic value.^70,71 The composite score produced by the VOMS assessment has been identified as the most accurate data point in the diagnosis of concussion with vertical saccades and horizontal vestibular/ocular reflex testing demonstrating the greatest diagnostic impact.⁷⁰ In addition, 40% of the total unique patients in the data received vestibular therapy (vestibular_tx). This includes various exercises that improve overall life, vertigo, gaze, and posture.⁷² However, the presence of dizziness during visual motor stability testing and headache symptoms overall - particularly during near point convergence (NPC) and visual motor stability (VMST) testing at the initial assessment - has not been previously identified as having independent clinical significance beyond the overall VOMS score. Similarly, headache produced during vertical vestibular ocular reflex testing and its difference from between first and second visit has not been previously identified as unique in its significance. It is important to mention that variants of the general symptom follow the same trend of increasing or decreasing across VOMS sub-assessments. This may appear redundant, as the individual components — Near Point of Convergence (NPC), Visual Motor Sensitivity Test (VMST), Vestibulo-Ocular Reflex (VOR) horizontal and vertical, Smooth Pursuits, and Saccades — all contribute to the same overarching symptom domains of headache, dizziness, and fogginess. Because the VOMS is administered as a sequential battery, a patient presenting with an elevated baseline headache prior to testing will likely carry that symptom load across every sub-component, making it difficult to disentangle component-specific provocation from general symptom burden without intra-assessment change-from-baseline scoring.

An attempt to represent inter-visit change is made using a _diff variant of each feature, where the difference value reflects the recorded score at the second visit minus the value at the initial visit. However, these difference features still capture changes in specific VOMS components between clinical visits rather than isolating symptom provocation within a single assessment. Within the VOMS battery specifically, this distinction is clinically meaningful: VOR Vertical Headache and its difference variant (VOR Vertical Headache Difference) both appeared across all six Visit 2 models, suggesting that headache provoked during vertical vestibulo-ocular reflex testing and its change over time carries independent prognostic signal beyond what is captured by the composite VOMS score or by horizontal-plane VOR testing alone. The directional asymmetry between vertical and horizontal VOR headache frequency across models further supports retaining granular sub-assessment data rather than collapsing to an aggregate score.

To better identify which VOMS components carry genuine clinical significance beyond global symptom burden, a conservative criterion is proposed: a VOMS feature should be considered clinically significant only when both its original variant and its difference variant appear in the Top 20 Features list across models. For example, VMST Headache and VMST Headache Difference both appearing would indicate that the specific physiological stressor isolated by visual motor stability testing distinct from the NPC or VOR components contributes independently to recovery prediction at both the initial and follow-up visit. This paired-feature criterion helps separate true component-level signal from the shared variance attributable to overall symptom severity, and provides a more principled basis for prioritizing which VOMS sub-assessments to emphasize in longitudinal clinical monitoring protocols.

Treatment presence (i.e. treatment_present) appeared in all six ML models’ top 20 features list. It also ranks in the top 3 features for 81% of the models and has the highest average effect for Class 0 (Figures 1–3). Provided this, observing patient TBI recovery longitudinally using this feature is imperative since prescribing some kind of treatment to a patient after the first visit also holds clinical importance in the prediction of time to medical clearance. Treatment presence consists of having one of the 11 specific types which can be generalized as a mix of both pharmacological and non-pharmacological treatments: Selective Serotonin Reuptake Inhibitors (SSRI), amantadine, stimulant, preventative headache, vestibular, physical therapy (pt), chiropractic, psychological, neuropsychological, neurology, and cognitive. Various approaches to TBI rehabilitation is imperative following a patient’s diagnosis for restoring one’s capabilities to eventually return to play.⁷³ From this interpretation, it can be unclear whether the presence of treatment is directly related to recovery time or if there is another confounding variable that results in this finding. Providing treatment after initial visit reduces a concussed athlete’s perception of pain and improves performance on both cognitive and physical tests measured by gold standard assessments. The improvement in these individual components would then link with the time to medical clearance, and not solely on the treatment. Presence of treatment appears with VOR Horizontal Dizziness Difference, VOR Vertical Headache Difference, Saccades Horizontal Headache Difference, and VMST Dizziness Difference for the four models that exhibited improvement in accuracy (Table 1). This could mean that the difference values for the listed assessed features play a role in recovery time along with treatment presence.⁷⁴

Additionally there is a large indication of importance for vestibulo-ocular reflex (VOR) - vertical headache (e.g. vor_vert_headache) as it appeared the most across all features when accounting for both initial and secondary visits. Specifically, in 81% of initial visit models and 100% of second visit models. Based on the output this specific feature may be considered the most important, especially as its difference variant (e.g., vor_vert_headache_diff) also appears in all six models for the second visits. Also, this is important to identify as this pair of features commonly predicts towards Class 1 (i.e., time to clearance greater than a month). The addition of difference features provided all models additional information to learn from and establish a true relationship between the two inputs vor_vert_headache and vor_vert_headache_diff with the predicted time to clearance.⁵⁹ Given this relationship, it is imperative to longitudinally assess this feature as it serves to be of clinical importance in the prediction of time to medical clearance. Disruptions to the visual and vestibular signals occur in a patient experiencing TBI from sports-related activities.⁷⁵ It has been found that headaches correlate with visual-vestibular mismatch, which prolongs recovery if left untreated.⁷⁶ Thus, medical personnel must create treatment plans that directly address headaches stemming from injuring the vestibulo-ocular reflex to ensure for effective TBI recovery.

Assessment features vestibulo-ocular reflex (VOR) - horizontal dizziness (i.e. vor_horiz_dizziness) and vestibulo-ocular reflex (VOR) - horizontal dizziness difference (i.e. vor_horiz_dizziness_diff) sum to 9 total occurrences based off of their respective values in Table 4. For second visits, 81% of models and 100% of models display that the base feature and its difference variant (i.e. vestibulo-ocular reflex (VOR) - horizontal dizziness and vestibulo-ocular reflex (VOR) - horizontal dizziness difference) drive for Class 1 predictions respectively. Stemming from these findings, it can be deduced that the ML models have extensive information about vestibulo-ocular reflex (VOR) - horizontal dizziness after accounting for both that and its difference variants which allows for a strong relationship between these inputs with the target prediction outcome time to clearance.⁶⁹ Moreover, these need to be longitudinally assessed in TBI patients as they are important to the prediction of time to medical clearance. Dizziness is a core symptom of vestibular migraine⁷⁶ which can result from activities such as concussions. Patients with TBI report more instances of headaches compared to dizziness,⁷⁷ which explains the difference between its frequency of occurring across the models as a clinically important assessment feature with that of vor_vert_headache along with its difference variant.

Features that are not as frequently occurring across ML models as the ones mentioned earlier still hold some degree of significant clinical importance (Tables 1 and 2). Since they appeared in the top 20 features list, the ML models highlight that feature to be essential for predicting time to medical clearance following concussion.

Thus, personalized treatment plans can involve any combination of the features in the top 20 features scores list (Table 2) as combining multi modal approaches translates to improved sports-related TBI recovery.⁷⁸ It is highly encouraged to first account for VOR Vertical Headache, VOR Vertical Headache Difference, Treatment Present, VOR Horizontal Dizziness, VOR Horizontal Dizziness Difference, VMST Dizziness, and VMST Dizziness Difference before including additional assessment features as these were found to be of highest clinical significance due to their frequency across ML models. The choice of combination will depend on the patient’s attributes (past records, condition, etc.).

While XGBoost achieved the highest overall accuracy (0.84) with Visit 2 features and statistically significant improvement across visits in the initial method implementation, we do not designate a single final deployable model at this stage. The six models serve as a comparative framework for identifying robust clinical predictors rather than as ready-for-deployment tools. Model specification details and code are available at https://github.com/ MeganTran6023/Sport-Related-Concussions_Machine-Learning to support future replication and external validation (TRIPOD+AI Item 22).

Comparison with Previous Studies Previous research studied concussion diagnosis at a single time point with gold standard assessments using ML models such as XGBoost and Random Forest.^62,79,80 A different study on predicting sports injuries in football uses multiple time points and extracts important features that drive models’ predictions; however, it is not specific to concussions from sports related activities.⁸¹This current study utilizes not only multiple time points for analysis, but also determines essential assessment features that drive models to demonstrate an improvement in accuracy between initial and second visits.

To extend the detection of clinically important features determined by the models, this study incorporates feature engineering to generate new features, mainly the difference variants of the base original features. These new features provide models more information about the existing features and their interactions to produce informed results.⁸²

Through these novel implementations, this study paves the way to optimize assessments for clinicians to provide high quality personalized rehabilitation protocols for athletes with sports-related TBI injuries.

Clinical Relevance The intended users of a future validated version of this model are licensed clinicians experienced in concussion management (e.g., sports medicine physicians, neurologists). Input features are derived from standardized clinical assessments (VOMS, BESS, ImPACT) already routinely administered in concussion care. Clinicians would need to input structured assessment data at one or two time points; no specialized computing expertise is required beyond use of a provided interface. However, poor-quality or missing input data — particularly VOMS subscores — should prompt clinical judgment rather than model reliance, as the current model was not trained on imputed data and may underperform when key features are absent (TRIPOD+AI Item 27b).

The trend of an increase in concussion can be primarily attributed to increased awareness in the diagnosis of concussions by a trained medical professional^83,84 This may have led to more consistent reports of concussion cases that may have gone unnoticed in previous studies/findings.⁸⁵ Clinicians who treat concussion patients may benefit from predictive models in several important ways.⁸⁶ The results explore the feasibility of identifying individuals at risk for prolonged recovery during initial assessment, which could eventually serve as a preliminary signal for investigating targeted interventions. For example, identifying aspects of the VOMS assessment as discussed above early may inform clinicians and result in earlier and/or more targeted physical therapy to address these higher risk features. Early initiation of physical therapy has already been identified as beneficial in recovery⁸⁷ and ML models may serve as a tool to identify patients most likely to benefit. Furthermore, enhancing recovery efficiency not only improves quality of life, but also facilitates a quicker return to normal function or athletic participation.⁸⁸ In sports settings, recognizing athletes at risk for extended recovery can inform individualized treatment strategies and help establish realistic expectations for both the athlete and the team.

Limitations Although the results indicate a positive direction for concussion screening and treatment, there are limitations to this study. First, the relatively small dataset size represents an important limitation of this work as the limited number of samples could affect the model’s ability to generalize to TBI patients with diverse demographic backgrounds and clinical profiles. As a result, the findings may not fully reflect the variability observed in broader patient populations.⁸⁹ In addition, having a greater number of data points belonging to Class 0 would be beneficial, as this would allow for a more robust evaluation of the ML model’s performance, particularly in assessing its ability to classify and predict time to medical clearance accurately. This is a common issue when it comes to handling ML problems with clinical data where classifiers learn best on the class with most data points because they attempt to optimize a single aggregate metric while overlooking the distribution of the data across the target classes.⁵⁷

From the reported results, the small and uneven balance between outcome classes 0 and 1 (176 patients and 41 patients respectively) resulted in low specificity scores. Additionally, the inflated both accuracy and recall scores could potentially be attributed to overpredicting for the dominant class. Balancing techniques such as class weighting, resampling, or a combination of other methods should be implemented to ensure the reported results are not skewed by this limitation. Another aspect to note is that overprediction for a class could result in false positives and/or negatives. Clinically speaking, incorrectly predicting a patient’s time to recover earlier than the actual time to clearance leads to undesired effects such as worsening existing concussive symptoms⁹⁰ and reduced responsiveness to brain-derived neurotrophic factor (BDNF).⁹¹ Conversely, mistakenly predicting a patient to recover longer than normal would hinder the concussed patient’s physical and psychological state, reducing their quality of life.⁹²

Formal statistical filtering, initially piloted using independent t-tests and Mann-Whitney tests, was excluded from the final preprocessing workflow as it failed to enhance longitudinal predictive performance. Instead, feature selection prioritized iterative cleaning based on missingness thresholds and clinical relevance to maintain a comprehensive feature space for the machine learning models. While the omission of formal statistical testing during preprocessing is acknowledged as a limitation to analytical rigor, future research should investigate alternative statistical frameworks and structured dimensionality reduction to optimize feature utility and address the event-per-variable ratio.

A notable limitation of this analysis is the absence of formal statistical testing to compare performance differences between visits and the lack of reported confidence intervals for key metrics. While 66% of the models showed numerical improvements in accuracy with the addition of Visit 2 data, including a 5% increase for the XGBoost model, the statistical significance of these changes was not formally evaluated. Additionally, without confidence intervals, the precision of performance values-such as the peak accuracy of 0.84 -remains unquantified.⁹³ This lack of statistical rigor is underscored by the small cohort of 217 patients and the potential for distributional bias inherent in the Leave-One-Out Cross-Validation (LOOCV) method.

A supplementary factor to consider is the gender imbalance in the dataset used. Of the 176 total number of patients in Class 1, 113 (64.2%) are female and 63 (35.8%) are male. Of the 41 total number of patients in Class 0, 24 are female and 17 are male. From this, these findings may be biased to a female population. More specifically, that means that there is a possibility that for some or all models, the correctly classified Class 0 and 1 predictions are mainly females. This is important to note as previous works show that females take longer to recover from sports-related TBI than males do across different sports related activities (i.e., basketball, rugby, soccer, and squash)⁹⁴ (TRIPOD+AI Item 3c). As our dataset does not specify the sports for each sports-related concussions, the reason why more females are in Class 1 compared to Class 0 may be due to the nature of the sports each participated in. Another study that also used a small dataset size found that female collegiate soccer athletes who sustained a sports-related TBI experienced a longer recovery time than that of male collegiate soccer athletes,⁹⁵ but this was limited by its small dataset size as well as having an gender imbalance.

Another limitation relates to the features utilized during the ML stage. Although a feature selection method was applied to identify the most essential predictors, some of the retained features were based on self-reported data. Self-reported measures are inherently subject to bias, which can introduce uncertainty into the model.^96,97 This bias may reduce the reliability of the model’s predicted class outcome and perceived clinical importance of certain features, potentially affecting their true relationship with days to medical clearance.

The lack of improvement in accuracy across visits can largely be attributed to the reliance on linear models. In the case of the SVC, the use of a linear kernel-while appropriate given the limited dataset size and its straightforward implementation-likely restricted the model’s ability to capture complex, non-linear relationships between assessment features and class labels.⁶⁴ As a result, the model may have failed to leverage additional information introduced at the second visit, leading to unchanged performance. Similarly, the Ridge regression model’s inherently linear nature limits its capacity to model non-linear interactions in the data.⁴⁵ Consequently, the difference-based features derived from the second visit may not have contributed meaningful new information beyond what was already represented in the first-visit features, resulting in comparable predictive accuracy across visits.

The current prediction method of binary classification oversimplifies the prediction of patients. The assignment of classes ‘under a month’ and ‘equal to or greater than 1 month’ sacrifices granular clinical utility. A reformulation of this method would be to utilize granular, multiclass buckets. To ensure nuanced predictions, the buckets could be divided into acute, typical, and prolonged recovery for both improved methodological accuracy and clinical significance.⁹⁸

Also, the current dataset lacks neurocognitive data outside of symptom reporting. This type of data is useful regarding concussion assessments. For instance, tests for this specific aspect accounts for specific trends in a patient’s symptoms during recovery as recording whether a patient is taking some neurocognitive treatment is not sufficient enough to conclude that the outcome predictions are of high accuracy.¹⁷

Treatment presence does not account for the varying time for each treatment’s therapeutic effect (e.g. Some Selective Serotonin Reuptake Inhibitor (SSRI) treatment takes 8 weeks to have a full therapeutic effect⁹⁹ If recording patient’s data before the time frame for certain treatments a patient received, this would then skew the interpretation of what features are most significant to their predicted concussion recovery. To resolve this, it would be beneficial to collect data on features associated with the patient that account for each treatment’s time-frame to have a more accurate depiction of the results.

Although the filtering process did eliminate the majority of features not significant for concussion days to clearance classification, this still resulted in redundant features kept (ie. headache, dizziness, etc.). An attempt to differentiate the specific components of the exams was done when calculating the difference between the second and initial visits. Furthermore, the acquired sample is reduced given the number of features the dataset has. This is prone to overfitting and instability in feature importance listings.¹⁰⁰ A possible future implementation is recursive feature elimination (RFE) for feature selection. This method should be used in relation to the inclusion of more patients and datatypes collected over time, resulting in a dataset that will include a more robust, objective set of measures. In this setting RFE would be able to systematically removes the least important features/types from a dataset to improve model performance, reduce overfitting, and enhance interpretability.

RFE-MF (MissForest) was found to outperform four of the classic data imputation methods (mean/mode imputation, kNN, MICE, and MF) in addressing the critical need for data accuracy in medical research, where it helps mitigate challenges that can impair clinical decision-making and ultimately affect the quality of patient care.¹⁰¹

While LOOCV is recommended for small datasets such as the one utilized in the study, using this method leads to distributional bias. In turn, this leaks information pertaining to the removed item as the test set to the model as well as reduces performance of commonly-used ML models.¹⁰²Therefore, stratified repeated holdout and a modified version of k-fold cross-validation is recommended to avoid this bias¹⁰³ While Leave-One-Out Cross-Validation (LOOCV) is a standard validation strategy for small datasets, its results remain highly contingent upon the specific characteristics of the cohort. Although LOOCV maximizes data utility to minimize estimation bias, these unbiased outputs do not inherently ensure clinical generalizability. This limitation is often a consequence of the model overfitting to the unique noise within the sample. Therefore, while the methodology may yield stable internal performance metrics, it does not guarantee that the model will demonstrate comparable accuracy across external, heterogeneous datasets.

Despite its limitations, both the point estimate (accuracy) and variance are constant when using LOOCV. The lack of fluctuation in both of these measures allows for reproducible and deterministic findings for analysis.

The small sample size used is prone to overfitting. This was primarily addressed through strategic model selection and the implementation of regularization techniques. Specifically, the Ridge classifier employs L2 regularization to shrink the coefficients of features with weaker associations to the outcome groups, which reduces variance and stabilizes the model’s estimates.¹⁰⁴

Future Work The focus of this study is to ensure that the ML models accurately predict concussed patients’ time to clearance from sport-related TBI with the integration of longitudinal data and gold-standard assessments per our objectives. While we are optimistic about the acquired results, there are ways to expand upon this current study.

Future work should prioritize increasing the overall dataset size. A larger dataset would allow the ML models to learn more stable and representative patterns,⁸⁹ which in turn would improve their ability to identify the most important features for accurately predicting time to medical clearance. With more data, the influence of noise and bias would be reduced, leading to clearer insights into which variables truly contribute to prediction performance and improving confidence in the model’s results. Moreover, having a balanced dataset between both genders and target Class classifications (Class 0 and Class 1) would allow the results to best describe a generalized set of athletes which will be applicable to patients who experienced other methods of TBI.

Expanding the analysis beyond athletic-based injuries is another important direction. While the current methodology has shown success in TBI cases resulting from athletic concussions, applying this approach to a broader range of TBI patients would improve its clinical relevance.¹⁰⁵ Extending the framework to non-athletic injuries would support the development of more general clinical guidelines and help determine whether the same assessment features and prediction strategies remain effective across different injury mechanisms. This broader application would also motivate further research into TBI individualized treatment and recovery within the wider medical field.

Including patients with more than two clinical visits could further strengthen analysis. Additional visits provide more longitudinal clinical data, allowing models to better capture changes in patients’ symptoms over time.¹⁰⁶ With richer temporal information, ML models would be better positioned to accurately predict a TBI patient’s time to clearance and reflect the progression of recovery more reliably.

Additional features that incorporates objective methods of evaluating concussed patients should also be included. These will not only allow clinicians to more accurately diagnose an injured athlete, they also can diagnose and provide personalized treatment in a timely manner. This is imperative if the injured athlete sustains the concussion before adolescence since this is the period of major brain development.¹⁰⁷

Currently, the LOOCV method done on the current dataset lacks external validation, temporal validation, optimism correction and calibration assessment, which is necessary for determining gernalizability.¹⁰⁸ This is a given especially since the small dataset and the distribution for the two outcome classes are prone to overfitting. Future works will run the listed elements to evaluate its clinical applicability in the concussion domain.

Running the feature selection method prior to the modeling instead of within the modeling’s cross-validation folds increases the risk for data leakage.¹⁰⁹ Future works will repeat the analysis where the variable selections are performed nested within the validation process to account for internal validity.

Future work could also focus on incorporating algorithmic modifications to the existing SVC and Ridge models utilized in this study¹¹⁰ or using more nonlinear ML models. This will enable a more informed listing of clinically important assessment features given the model’s demonstration of an increase in accuracy provided longitudinal patient sports-related concussion data. From this, proper personalized protocols can be put together for athletes.

Finally, applications in digital health should be explored. Integrating ML models into technologies such as wearable devices would help translate the study’s findings into real-world clinical use. These tools could allow patients to monitor their condition outside of clinical settings, while simultaneously providing clinicians with real-time data to support decision-making.¹¹¹ Such integration would enhance continuous assessment, improve personalized treatment planning, and extend the practical impact of the proposed methodology.

Conclusion

This study is primarily comparative in scope: rather than proposing a single deployable clinical tool, it benchmarks six ML classifiers across longitudinal data to identify which model architectures and clinical features best support future development of a validated prediction tool. External validation and prospective testing are required before any model described here could be considered for clinical deployment.

Utilizing longitudinal clinical assessments to predict time to medical clearance represents a preliminary approach that offers exploratory insights into diagnosis and return-to-play decision-making. By applying ML models to longitudinal data, patterns not readily detectable by licensed medical practitioners were identified. These models achieved high predictive accuracy and highlighted specific assessment features that significantly influence clearance timelines. Collectively, these exploratory findings highlight assessments that may warrant investigation in future validation studies and suggest the potential for a longitudinal monitoring framework to inform return-to-play decisions and guidelines. The findings can be applied to future return-to-play decisions by enabling clinicians to better anticipate recovery trajectories and focus on the most informative assessments to best formulate personalized sports- related TBI rehabilitation protocols.

Footnotes

Acknowledgements

We acknowledge those at the University of South Florida (USF) Concussion Center for their assistance with data collection and all subjects who were involved in this study.

ORCID iDs

Megan Tran

Byron Moran

Nathan D. Schilaty

John Michael Templeton

Ethical considerations

This study was approved by the University of South Florida (USF) Institutional Review Board (IRB STUDY003514). This approval explicitly covers the retrospective review and analysis of patient data stored within the REDCap database.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study received funding from the Florida Department of State Center for Neuromusculoskeletal Research.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

Dataset is not publicly available as it is part of a clinical database within USF Health. However, code used for preprocessing and ML modeling is made available at .

Guarantor

Megan Tran and John M. Templeton.

Contributorship

The authors confirm contribution to the paper as follows: conception, design, manuscript preparation: M. Tran; review and editing: J. Holler, B. Moran, N. Schilaty, and J. M. Templeton. All authors reviewed and approved the final version of the manuscript.

Appendix

Table 10.

Gold standard tests summary table.

Assessment	Ref.	Admin. time	Scoring	Assessors	What it Tests	Instructions
Vestibular Ocular Motor Screening (VOMS)	5	5–10 minutes	Max score 10	Clinicians	Assesses vestibular and ocu-lomotor symptoms, includ-ing dizziness, gaze stabi-lization, eye-tracking ability, and visual motion sensitiv-ity.	Smooth Pursuits (Horizontal & Vertical): Patient seated 3 ft. from examiner, follows fingertip horizontally (±1.5 ft) and vertically (±1.5 ft), 2 repetitions each. Rate symptoms.
						Saccades (Horizontal & Vertical): Patient moves eyes quickly between two fingertips horizontally or vertically (10 reps each). Rate symptoms.
						Convergence: Patient focuses on target at arm’s length, brings it to nose. Measure distance at diplopia or eye deviation, 3 trials. Rate symptoms.
						Vestibular-Ocular Reflex (VOR) Horizontal & Vertical: Patient rotates head ±20° while focusing on target, 10 reps each at 180 bpm. Rate symptoms.
						Visual Motion Sensitivity (VMS): Patient stands, rotates head, eyes, and trunk ±80° focusing on thumb, 5 reps at 50 bpm. Rate symptoms.
Balance Error Scoring Sys-tem (BESS)	4	2 minutes	Max score 60	Clinicians and Non-Clinicians	Assesses static postural stability under controlled stance conditions.	Patient performs three 20-second balance stances on various surfaces.
King Devick Test (KD)	13	2 minutes	Tally of errors	Clinicians and Non-Clinicians	Measures rapid eye movements, attention, and language function by timing number-reading performance as an index of saccadic speed and visual tracking.	Read single-digit numbers aloud from three cards, left to right, as quickly and accurately as possible. Includes one demonstration card.
Gait Initiation (GI)	14	10–15 minutes	Max score 14	Clinicians and Non-Clinicians	Evaluates gait initiation from a stationary position to assess balance control, coordination, and lower-limb motor planning.	Patient walks across room with therapist support (if needed) at usual pace, then rapid pace.
Sensory Organization Test (SOT)	15	2 minutes	Max score 100	Clinicians	Measures vertical reaction forces generated as the body’s center of gravity moves over a fixed base of support.	Patient completes three 20-second trials under varying visual (eyes open, eyes closed, sway-referenced) and surface (fixed, sway-referenced) conditions, standing shoulder-width apart.
Sensory Organization Test (SOT)	15	2 minutes	Max score 100	Clinicians		Stay as motionless as possible.
Tandem Gait (TG)	16	Time taken for patient to walk to end of line and back	Recorded time to complete test	Clinicians	Tests dynamic balance and coordination by having the individual walk heel-to-toe in a straight line.	Patient stands with feet together at a line, walks heel-to-toe to end line and back, 4 trials. Record fastest time.
Gait Termina-tion (GT)	14	10–15 minutes	Max score 14	Clinicians and Non-Clinicians	Assesses anticipatory postu-ral control and balance dur-ing gait termination.	Patient walks across room with therapist support (if needed) at usual pace, then rapid pace.
Mobile Universal Lexicon Evaluation System (MULES)	17	Duration taken to recite all images	Recorded duration to name all pictures	Clinicians and Non-Clinicians	Measures rapid picture nam-ing to assess visual process-ing speed, attention, and lan-guage function.	Name pictures aloud from left to right, top to bottom, as quickly as possible without errors. Record total duration.
Head Injury Assessment Version 1 (HIA01)	18	12–17 minutes	Composite scoring includes Immediate Memory (max 30), Maddock’s questions (5 items), digits backward performance, balance-error counts, symptoms (9-item checklist), clinical signs (3 items), and Delayed Memory (max 10)	Clinicians and Non-Clinicians	Structured sideline concussion assessment used in rugby, evaluating symptoms, balance, cognition, and coordination immediately after injury.	Assessment via observation, video, or instru-mented mouthguard. Includes Criteria 1 indications, head acceleration data, off-field assessment, pitch-side video review, and clinical evaluation.
Pitch-Side Concussion Assessment Version 1 (PSCA1)	19	5 minutes	Symptom Checklist: Pres-ence/absence (0/1). Maddock’s Questions: Cor-rect/incorrect (0/1). Balance: Number of errors recorded.	Clinicians and Non-Clinicians	Early version of the Head Injury Assessment (HIA1) used for sideline concussion screening.	Complete symptom checklist, Maddocks Ques-tions, and tandem stance in medical room or agreed location. Temporary 5-min replacement allowed if Criteria 2 indications met.
Pitch-Side Concussion Assessment Version 2 (PSCA2)	19	5 minutes	Max score 132	Clinicians and Non-Clinicians	Updated pitch-side concus-sion assessment including refined symptom and cog-nitive measures to improve diagnostic sensitivity.	Same as PSCA1, updated with 5 Criteria 1 indica-tors including suspected loss of consciousness and obvious ataxia. 5-min temporary replacement retained.
Motor Cognitive Test Battery (MotCoTe)	20	30 minutes	Recorded reaction time	Clinicians and Non-Clinicians	Measures multilimb reaction times and tapping speed, integrating motor and cog-nitive demands that progress from simple to complex tasks for concussion assess-ment.	Reaction Time Tests: Press arrow-indicated switch as quickly as possible under six condi-tions (Simple, Choice, Inhibition, Conflict, Sin-gle/Double Limb). Tapping Speed Tests: Tap switches as fast as possible for 10 sec under Single/Double Limb conditions.
Sport Concussion Assessment Tool Version 2 (SCAT2)	16	10–15 minutes	Max score 100	Clinicians	Evaluates seven domains including symptoms, physical signs, Glasgow Coma Scale, Maddocks questions, cognition, balance, and coordination.	Symptom Evaluation: The participant reports the presence and severity of 22 common concussion symptoms. Physical Signs: The examiner observes and records any overt signs of concussion. Glasgow Coma Scale (GCS): Standard assess-ment of eye, verbal, and motor responses. Orientation (Maddocks Questions): Participants answer five standardized questions to assess orientation and memory. Cognition: Immediate memory is tested by asking participants to recall a list of five words over three trials. Concentration is evaluated using number sequences and backward recitation tasks. Delayed recall is assessed after a short interval. Balance and Coordination: Balance is tested via the modified Balance Error Scoring System (mBESS), and coordination is assessed with simple physical tasks such as finger-to-nose.
Sport Concussion Assessment Tool Version 3 (SCAT3)	21	15–25 minutes	Max score 132	Clinicians	Updated SCAT version incorporating expanded cognitive and balance assessments for tracking concussion recovery.	Symptom Evaluation: Participant reports presence and severity of 22 common concussion symptoms. Physical Signs and GCS: Examiner records observable signs of concussion; Glasgow Coma Scale assessed immediately after suspected injury. Orientation (Maddocks Questions): Five standard-ized questions to assess orientation and immediate memory at the time of injury.
Sport Concussion Assessment Tool Version 3 (SCAT3)	21	15–25 minutes	Max score 132	Clinicians		Cognition: Immediate memory assessed with three trials of a five-word list; concentration and delayed recall evaluated using standard tasks. Balance and Coordination: Modified BESS including foam surfaces; coordination assessed with simple physical tasks.
Sport Concussion Assessment Tool Version 5 (SCAT5)	22	10 minutes	Symptom Number: score out of 22. Symptom Severity: score out of 132. Orientation: score out of 5. Immediate Memory: score out of 15 (trial 1) + 30 (trials 2–3), total 45. Concentra-tion: score out of 5.	Clinicians	Standardized concussion assessment incorporating symptom scoring, cognitive screening, balance testing, and coordination measures.	Symptom Evaluation: Participant reports presence and severity of 22 symptoms. Cognitive Assessment: Orientation (Maddocks questions), immediate memory (three trials), concentration, and delayed recall are measured. Balance and Coordination: Balance tested via modified BESS (including foam stances) and simple coordination tasks.
Immediate Post-Concussion and Cognitive Testing (ImPACT)	23	20–25 minutes	Max score 100	Clinicians	Computerized neurocogni-tive test assessing memory, attention, reaction time, and processing speed following concussion.	Six computerized modules assessing memory, attention, reaction time, and processing speed. Follow instructions for each module.
Standardized Assessment of Concussion (SAC)	24	5 minutes	Max score 30	Clinicians and Non-clinicians	Brief sideline cognitive assessment measuring orientation, immediate memory, concentration, and delayed recall.	Orientation (Maddocks Questions): Participant answers a set of standardized questions to assess awareness of time, place, and event. Immediate Memory: Examiner reads a list of five words; participant recalls as many as possible immediately, repeated over three trials. Concentration: Participant completes number sequence tasks backward and recites months of the year in reverse order. Delayed Recall: After a short interval, participant is asked to recall the same five words from the immediate memory task. Neurologic Function: Examiner notes any clinical signs relevant to concussion.
Post-Concussion Symptom Scale (PCSS)	25	Not specified, but noted as ‘Relatively short time to administer’	Max score 132	Clinicians and Non-clinicians	Self-report checklist quanti-fying the severity of com-mon concussion symptoms such as headache, dizziness, fatigue, and irritability.	Self-report severity of each symptom using 7-point Likert scale.
Modified Balance Error Scoring System (mBESS)	26	1 minute	Max score 30	Clinicians and Non-clinicians	Abbreviated version of the BESS incorporating simpli-fied balance tasks for rapid field-based assessment.	Perform three 20-second balance stances; count errors after starting.

Table 11.

Feature naming mapping.

Feature (Underscore format)	Formatted feature name
bess_double_ec	Balance Error Scoring System - Double-Leg Stance with Eyes Closed
bess_single_ec	Balance Error Scoring System - Single-Leg Stance with Eyes Closed
bess_tandem_ec	Balance Error Scoring System - Tandem Stance with Eyes Closed
cerv_ext	Cervical Extension
cerv_flex	Cervical Flexion
hx_mood_disorder	History of Mood Disorder
import_gad7_score	General Anxiety Disorder-7 Score
import_phq9_score	Patient Health Questionnaire-9 Score
l_cerv_rot	Left Cervical Rotation
l_lat_flex	Left Lateral Flexion
npc_dizziness	Near Point of Convergence Test - Dizziness
npc_fogginess	Near Point of Convergence Test - Fogginess
npc_headache	Near Point of Convergence Test - Headache
npc_measure	Near Point of Convergence Test - Measurement
npc_nausea	Near Point of Convergence Test - Nausea
prev_head_injury	Presence of a Previous Head Injury
r_cerv_rot	Right Cervical Rotation
r_lat_flex	Right Lateral Flexion
saccades_horiz_dizziness	Saccades Horizontal Dizziness
saccades_horiz_fogginess	Saccades Horizontal Fogginess
saccades_horiz_headache	Saccades Horizontal Headache
saccades_horiz_nausea	Saccades Horizontal Nausea
saccades_vert_dizziness	Saccades Vertical Dizziness
saccades_vert_fogginess	Saccades Vertical Fogginess
saccades_vert_headache	Saccades Vertical Headache
saccades_vert_nausea	Saccades Vertical Nausea
smoothpursuits_dizziness	Smooth Pursuits Dizziness
smoothpursuits_fogginess	Smooth Pursuits Fogginess
smoothpursuits_headache	Smooth Pursuits Headache
smoothpursuits_nausea	Smooth Pursuits Nausea
subocc_ext	Suboccipital Extension
subocc_flex	Suboccipital Flexion
vmst_dizziness	Visual Motion Sensitivity Test Dizziness
vmst_fogginess	Visual Motion Sensitivity Test - Fogginess
vmst_headache	Visual Motion Sensitivity Test - Headache
vmst_nausea	Visual Motion Sensitivity Test - Nausea
voms_dizziness	Vestibular Ocular Motor Screening - Dizziness
voms_fogginess	Vestibular Ocular Motor Screening - Fogginess
voms_headache	Vestibular Ocular Motor Screening - Headache
voms_nausea	Vestibular Ocular Motor Screening - Nausea
vor_horiz_dizziness	Vestibulo-Ocular Reflex - Horizontal Dizziness
vor_horiz_fogginess	Vestibulo-Ocular Reflex - Horizontal Fogginess
vor_horiz_headache	Vestibulo-Ocular Reflex - Horizontal Headache
vor_horiz_nausea	Vestibulo-Ocular Reflex - Horizontal Nausea
vor_vert_dizziness	Vestibulo-Ocular Reflex - Vertical Dizziness
vor_vert_fogginess	Vestibulo-Ocular Reflex - Vertical Fogginess
vor_vert_headache	Vestibulo-Ocular Reflex - Vertical Headache
vor_vert_nausea	Vestibulo-Ocular Reflex - Vertical Nausea
Treatment_present	Presence of Administered Treatment (e.g., either pharmacological treatment such as selective serotonin reuptake inhibitors, amantadine, or stimulants, etc. and/or non-pharmacological treatments such as physical therapy, chiropractic, psychological, neuropsychological, neurology, and cognitive therapies.)

Figure 7.

Confusion matrix – Visit 1 vs Visit 2.

Table 12.

TRIPOD - AI checklist.

Item	Dev/Eval	Checklist item	Reported on page	Notes from manuscript
TITLE
1	D;E	Identify the study as developing or evaluating the performance of a multivariable prediction model, the target population, and the outcome to be predicted	p. 1 (Title)	Title states: ’Predicting Time to Clearance of Sport-Related Concussions Using Machine Learning’. Identifies target population (athletes with SRC) and outcome (time to medical clearance).
ABSTRACT
2	D;E	See TRIPOD+AI for Abstracts checklist	p. 1 (Abstract)	Abstract reports objective, methods (217 athletes, 6 ML classifiers, LOOCV), results (XGBoost 0.84 accuracy), and conclusions including external validation caveat.
INTRODUCTION – Background
3a	D;E	Explain the healthcare context and rationale for developing or evaluat-ing the prediction model, including references to existing models	pp. 1–3	Introduction describes rising SRC rates, clinical assessment limitations, and prior ML studies (Bergeron et al., Chu et al., Thomas & Arnett) that motivate the present work.
3b	D;E	Describe the target population and intended purpose of the prediction model in the context of the care pathway, including its intended users	p. 2	Explicitly states (TRIPOD+AI Item 3b): target population = athletes with SRC presenting to sports medicine or concussion clinic; intended users = licensed clinicians (sports medicine physicians, neurologists, athletic trainers) to supplement clinical judgment.
3c	D;E	Describe any known health inequalities between sociodemographic groups	p. 18	Discusses gender imbalance in dataset (64.2% female in Class 1); cites literature that females take longer to recover from SRC than males. Noted as limitation (TRIPOD+AI Item 3c).
INTRODUCTION – Objectives
4	D;E	Specify the study objectives, including whether the study describes the development or validation of a prediction model (or both)	pp. 1–2	Objectives state:¹²⁴ evaluate whether longitudinal data improves ML accuracy;⁸⁹ identify features most strongly associated with prolonged vs. normal recovery. Study is model development with internal validation only; explicitly states external validation is required.
METHODS – Data
5a	D;E	Describe the sources of data sepa-rately for the development and eval-uation datasets, the rationale for using these data, and representa-tiveness of the data	p. 3	Data from USF Concussion Center via REDCap (2021–2025). Single-site retrospective cohort. No separate evaluation dataset; internal validation via LOOCV. Rationale for data source described.
5b	D;E	Specify the dates of the collected participant data, including start and end of participant accrual; and, if applicable, end of follow-up	p. 3	Multi-visit data collected 2021–2025; original database spans 2017–2026. Clearance date used as end of follow-up per patient.
METHODS – Participants
6a	D;E	Specify key elements of the study setting including the number and location of centres	p. 3	Single centre: USF Concussion Center, University of South Florida. Secondary care/concussion specialty clinic setting.
6b	D;E	Describe the eligibility criteria for study participants	pp. 3–4	Inclusion: sports-related concussion diagnosis, ≥ 2 clinical visits, first visit within 0–365 days of injury, clearance ≥ 1 day. Exclusion: non-sports mechanisms (MVA, falls, other), missing data exceeding thresholds.
6c	D;E	Give details of any treatments received, and how they were han-dled during model development or evaluation, if relevant	pp. 3–4	Treatment types detailed (11 categories, pharmacological and non-pharmacological). ’Treatment present’ binary variable included only in Visit 2 feature set, as treatment was not administered until after Visit 1 data collection.
METHODS – Data Preparation
7	D;E	Describe any data pre-processing and quality checking, including whether this was similar across relevant sociodemographic groups	pp. 3–4	Multi-step iterative cleaning: column missingness threshold 0.9, row missingness 0.8, step 0.1. Outlier filtering for clinically relevant timelines. No imputation used. Figure 1 shows preprocessing flowchart. Sociodemographic-stratified preprocessing not reported.
METHODS – Outcome
8a	D;E	Clearly define the outcome that is being predicted and the time horizon, including how and when assessed, the rationale for choosing this outcome, and whether the method of outcome assessment is consistent across sociodemographic groups	pp. 5–6	Outcome: binary classification of time to medical clearance (< 30 days = ’normal’; ≥ 30 days = ’prolonged’). Threshold rationale: general clinical recovery timeframe for SRC. Clearance determined by experienced concussion physicians assessing symptoms and functional measures (VOMS, BESS, CNS vital signs, return to school). (TRIPOD+AI Item 8a)
8b	D;E	If outcome assessment requires subjective interpretation, describe the qualifications and demographic characteristics of the outcome assessors	p. 3	Clearance determined by physicians experienced in diagnosis and management of concussions. Demographic characteristics of assessors not reported.
8c	D;E	Report any actions to blind assess-ment of the outcome to be predicted	N/A	Not reported. Retrospective study design; blinding of outcome assessors not described.
METHODS – Predictors
9a	D	Describe the choice of initial pre-dictors and any pre-selection of pre-dictors before model building	pp. 3–4	Predictors retained from preprocessing based on missingness thresholds and clinical relevance. Feature inclusion of ’prior head injury’, ’history of mood disorders’ (Visit 1) and ’treatment presence’ (Visit 2) explicitly justified. No formal statistical pre-selection.
9b	D;E	Clearly define all predictors, including how and when they were measured	pp. 3–4, Appendix Tables 10 and 11	All predictors defined with feature naming mapping (Table 11). Assessment tools described in Appendix Table 10 (VOMS, BESS, ImPACT, etc.) with administration time, scoring, and assessors. Visit 1 vs Visit 2 collection timing specified.
9c	D;E	If predictor measurement requires subjective interpretation, describe the qualifications and demographic characteristics of the predictor assessors	Appendix Table 10	Appendix Table 10 lists assessors (Clinicians vs. Clinicians and Non-Clinicians) per assessment. Demographic character-istics of assessors not reported.
METHODS – Sample Size
10	D;E	Explain how the study size was arrived at and justify that the study size was sufficient to answer the research question	pp. 3–4	Final N=217 after preprocessing (from 3,038). LOOCV selected due to small dataset size. Event-per-variable (EPV) ratio reported: 0.84 (Visit 1), 0.43 (Visit 2). No formal a priori sample size calculation; small sample acknowledged as primary limitation.
METHODS – Missing Data
11	D;E	Describe how missing data were handled. Provide reasons for omit-ting any data	pp. 3–4	No imputation used; rationale given (imputation in medical data leads to bias; no optimal solutions exist). Iterative row/column removal based on missingness thresholds. Final dataset contains no null values.
METHODS – Analytical Methods
12a	D	Describe how the data were used in the analysis, including whether the data were partitioned, considering any sample size requirements	pp. 5–6	No train/test partition for final LOOCV evaluation. Hyper-parameter tuning used 80:20 split prior to LOOCV. LOOCV justified by small dataset size.
12b	D	Describe how predictors were han-dled in the analyses (functional form, rescaling, transformation, or standardisation)	pp. 4–6	Binary features coded 0/1. Difference features engineered (Visit 2 - Visit 1). Continuous features (PHQ-9, GAD-7, cervical range of motion) used as-is. No explicit standardization/normalization reported.
12c	D	Specify the type of model, rationale, all model-building steps, including any hyperparameter tuning, and method for internal validation	pp. 4–7	Six models: LightGBM, Decision Tree, random Forest, XGBoost, SVC, Ridge regression. Mathematical formulations provided (Equations (1)–(9)). Hyperparameter tuning via random-ized search (n_iter=50) for all except Ridge (k-fold CV over alpha candidates). Internal validation: LOOCV.
12d	D;E	Describe if and how any het-erogeneity in estimates of model parameter values and model perfor-mance was handled across clusters	N/A	Single-centre study; no clustering or multi-site analysis. Not applicable.
12e	D;E	Specify all measures and plots used to evaluate model performance	pp. 7–8	Metrics: accuracy, balanced accuracy, precision, recall, F1, specificity, MCC, Brier score. Bootstrap 95% CI (1,000 resamples) reported for all metrics. Confusion matrices (Figures 5–7), average effect plots (Figures 2–4). No decision curve analysis (exploratory scope). (TRIPOD+AI Item 12e)
12f	E	Describe any model updating aris-ing from the model evaluation	N/A	No external validation performed; model updating not applicable in this development study.
12g	E	For model evaluation, describe how the model predictions were calculated	p. 4	Model prediction calculations expressed via Equations (1)–(9); code available at GitHub repository (TRIPOD+AI Item 12g, 22).
METHODS – Class Imbalance
13	D;E	If class imbalance methods were used, state why and how this was done, and any subsequent methods to recalibrate the model or predictions	p. 6	Class imbalance: 176 (81.1%) prolonged vs. 41 (18.9%) normal recovery. No SMOTE/resampling (rationale: synthetic data misrepresents clinical distribution). Class weighting applied to penalize minority class misclassification. Effect reflected in near-zero specificity; discussed as primary limitation. (TRIPOD+AI Item 13)
METHODS – Fairness
14	D;E	Describe any approaches that were used to address model fairness and their rationale	p. 18	Gender imbalance acknowledged (63.1% female). No formal fairness-aware algorithms implemented. Discusses potential female-biased predictions and cites literature on sex differ-ences in SRC recovery. Identified as limitation requiring future work.
METHODS – Model Output
15	D	Specify the output of the prediction model. Provide details and rationale for any classification and how thresholds were identified	pp. 5, 7	Output: binary class labels (0 = ’normal’ recovery <30 days; 1 = ’prolonged’ recovery ≥ 30 days). Threshold 0.5 for all probabilistic models; ROC-based threshold optimization not performed (exploratory scope). (TRIPOD+AI Item 15)
METHODS – Training vs. Evaluation
16	D;E	Identify any differences between the development and evaluation data in healthcare setting, eligibility criteria, outcome, and predictors	N/A	Internal validation only (LOOCV on same dataset). No separate external evaluation dataset. Difference between Visit 1 and Visit 2 feature sets described (pp. 4–5).
METHODS – Ethical Approval
17	D;E	Name the institutional research board or ethics committee that approved the study and describe participant-informed consent or ethics committee waiver	p. 20	IRB approval: USF STUDY003514, University of South Florida Institutional review Board. Explicitly covers retro-spective review and analysis of patient data in REDCap database.
OPEN SCIENCE
18a	D;E	Give the source of funding and the role of the funders for the present study	p. 20	Funded by the Florida Department of State Center for Neuromusculoskeletal research. Role of funders not explicitly described.
18b	D;E	Declare any conflicts of interest and financial disclosures for all authors	p. 20	All authors declare no conflicts of interest in the authorship nor publication of this contribution.
18c	D;E	Indicate where the study protocol can be accessed or state that a protocol was not prepared	1	Github code repository included
18d	D;E	Provide registration information for the study, including register name and registration number, or state that the study was not registered	p. 20	In section “Ethical approval” - IRB STUDY003514
18e	D;E	Provide details of the availability of the study data	p. 20	Dataset not publicly available (part of clinical database within USF Health). Statement provided in Data Availability section.
18f	D;E	Provide details of the availability of the analytical code	pp. 1, 17, 20	Code available at: https://github.com/MeganTran6023/Sport-Related-Concussions_Machine-Learning
PATIENT & PUBLIC INVOLVEMENT
19	D;E	Provide details of any patient and public involvement during the design, conduct, reporting, interpre-tation, or dissemination of the study or state no involvement	p. 6	Explicitly stated: ’No patients or members of the public were involved in the design, conduct, reporting, or dissemination plans of this research.’
RESULTS – Participants
20a	D;E	Describe the flow of participants through the study, including the number of participants with and without the outcome and, if appli-cable, a summary of the follow-up time	pp. 3–4	Figure 1 (Data Preprocessing Flowchart) shows participant flow: 3,038 → 2,338 → 1,865 → 1,201 (2 visits) → 217 (sports-related). Table 5 shows outcome group breakdown: 41 normal (Class 0), 176 prolonged (Class 1). Mean days to clearance reported per group.
20b	D;E	Report the characteristics overall and, where applicable, for each data source or setting, including key dates, key predictors, treatments received, sample size, number of outcome events, follow-up time, and amount of missing data	pp. 3–4, Tables 2 and 5	Table 5 reports characteristics by outcome group (sex, treatment, days to clearance, days from injury to first visit). Table 2 reports treatment counts. Demographics: 80 male (36.9%), 137 female (63.1%), mean age 26.94 years.
20c	E	For model evaluation, show a comparison with the development data of the distribution of important predictors	N/A	Internal validation only; no separate evaluation dataset to compare against.
RESULTS – Model Development
21	D;E	Specify the number of participants and outcome events in each analysis	pp. 3, 7	N=217 total; 41 Class 0, 176 Class 1 for both Visit 1 and Visit 2 analyses. LOOCV uses N-1 samples per fold.
RESULTS – Model Specification
22	D	Provide details of the full predic-tion model to allow predictions in new individuals and to enable third-party evaluation and implementa-tion	pp. 4–7, GitHub	Mathematical formulations for all 6 models provided (Equations (1)–(9)). Code and model objects available at GitHub repository. (TRIPOD+AI Item 22)
RESULTS – Model Performance
23a	D;E	Report model performance esti-mates with confidence intervals, including for any key subgroups	pp. 7–8, Tables 1–4	Tables 1 and 3 report accuracy, balanced accuracy, precision, recall, F1, specificity, MCC, Brier score for all 6 models at both visits. Table 4 reports statistical significance of accuracy gains across visits. No subgroup analysis by demographics.
23b	D;E	If examined, report results of any heterogeneity in model perfor-mance across clusters	N/A	Single-centre study; no cluster analysis performed.
RESULTS – Model Updating
24	E	Report the results from any model updating, including the updated model and subsequent performance	N/A	No external validation or model updating performed.
DISCUSSION – Interpretation
25	D;E	Give an overall interpretation of the main results, including issues of fairness in the context of the objectives and previous studies	pp. 14–17	Discussion interprets accuracy gains, feature importance findings (VOR Vertical Headache, treatment presence), and compares to prior studies. Gender imbalance and potential bias toward female population discussed as fairness concern.
DISCUSSION – Limitations
26	D;E	Discuss any limitations of the study and their effects on any biases, statistical uncertainty, and generalizability	pp. 17–19	Extensive limitations section: small/imbalanced dataset, low specificity from class imbalance, gender imbalance, self-reported features, linear model limitations, binary outcome oversimplification, no neurocognitive data, LOOCV distribu-tional bias, overfitting risk, lack of external validation, feature selection leakage risk.
DISCUSSION – Usability
27a	D	Describe how poor quality or unavailable input data should be assessed and handled when imple-menting the prediction model	p. 17	States that poor-quality or missing VOMS subscores should prompt clinical judgment rather than model reliance, as model was not trained on imputed data. (TRIPOD+AI Item 27b)
27b	D	Specify whether users will be required to interact in the handling of the input data or use of the model, and what level of expertise is required	p. 17	Intended users are licensed clinicians experienced in concus-sion management. Input requires structured assessment data at 1–2 time points from standardized assessments already routinely administered. No specialized computing expertise required beyond use of a provided interface. (TRIPOD+AI Item 27b)
27c	D;E	Discuss any next steps for future research, with a specific view to applicability and generalizability of the model	pp. 19–20	Future work: larger/balanced dataset, non-athletic TBI popula-tions, 3+ visit longitudinal data, objective biomarkers, exter-nal/temporal validation, optimism correction, nested feature selection, wearable integration, nonlinear modifications to SVC/Ridge.

References

Hallock

Mantwill

Vajkoczy

, et al. Sport-Related Concussion: A Cognitive Perspective. Neurology. Clinical practice 2023; 13(2): e200123. https://doi.org/10.1212/CPJ.0000000000200123

Hootman

Dick

Agel

. Epidemiology of collegiate injuries for 15 sports: summary and recommendations for injury prevention initiatives. Journal of athletic training 2007; 42(2): 311–319.

Zuckerman

Kerr

Yengo-Kahn

, et al. Epidemiology of sports-related concussion in ncaa athletes from 2009-2010 to 2013-2014: incidence, recurrence, and mechanisms. The American journal of sports medicine 2015; 43(11): 2654–2662. https://doi.org/10.1177/0363546515599634

Guskiewicz

. University of North Carolina Sports Medicine Research Laboratory. Balance Error Scoring System (BESS). Clinical assessment manual for static postural stability following mild head injury.

Mucha

Collins

Elbin

, et al. A brief vestibular/ocular motor screening (voms) assessment to evaluate concussions: preliminary findings. The American journal of sports medicine 2014; 42(10): 2479–2486. https://doi.org/10.1177/0363546514543775

Daly

Pearce

Finnegan

, et al. An assessment of current concussion identification and diagnosis methods in sports settings: a systematic review. BMC Sports Science, Medicine and Rehabilitation 2022; 14(1): 125. https://doi.org/10.1186/s13102-022-00514-1

Taylor

Cameron

DeMatteo

. Examining how time from sport-related concussion to initial assessment predicts return-to-play clearance. The Physician and Sportsmedicine 2022; 50(2): 132–140.

Khalili

Rismani

Ali Nematollahi

, et al. Prognosis prediction in traumatic brain injury patients using machine learning algorithms. Scientific reports 2023; 13(1): 960. https://doi.org/10.1038/s41598-023-28188-w

Cascarano

Mur-Petit

Hernandez-Gonzalez

, et al. Machine and deep learning for longitudinal biomedical data: a review of methods and applications. Artificial Intelligence Review 2023; 56(Suppl 2): 1711–1771. https://doi.org/10.1007/s10462-023-10561-w

10.

Lovell

. Impact test administration and interpretation manual. Impact applications. Inc, 2015.

11.

Khan

Talley

. Beyond the hit: The hidden costs of repetitive head trauma. Neuroscience Insights 2025; 20: 26331055251316315. https://doi.org/10.1177/26331055251316315

12.

Collins

Ofa

Miskimin

, et al. Cognitive deficits following concussion: A systematic review. Journal of Orthopaedic Experience & Innovation 2023; 4(1): https://doi.org/10.60118/001c.68393

13.

Echemendia

Broglio

Davis

, et al. What tests and measures should be added to the scat3 and related tests to improve their reliability, sensitivity and/or specificity in sideline concussion diagnosis? a systematic review. British journal of sports medicine 2017; 51(11): 895–901. https://doi.org/10.1136/bjsports-2016-097466

14.

Buckley

Munkasy

Clouse

. Sensitivity and specificity of the modified balance error scoring system in concussed collegiate student athletes. Clinical journal of sport medicine 2018; 28(2): 174–176. https://doi.org/10.1097/JSM.0000000000000426

15.

ImPACT Applications, Inc . ImPACT: Administration and Interpretation Manual. ImPACT Applications, Inc, 2016.

16.

Schatz

Pardini

Lovell

, et al. Sensitivity and specificity of the impact test battery for concussion in athletes. Archives of clinical neuropsychology 2006; 21(1): 91–99. https://doi.org/10.1016/j.acn.2005.08.001

17.

Van Kampen

Lovell

Pardini

, et al. The “value added” of neurocognitive testing after sports-related concussion. The American journal of sports medicine 2006; 34(10): 1630–1635. https://doi.org/10.1177/0363546506288677

18.

Lovell

Iverson

Collins

, et al. Measurement of symptoms following sports-related concussion: reliability and normative data for the post-concussion scale. Applied neuropsychology 2006; 13(3): 166–174. https://doi.org/10.1207/s15324826an1303_4

19.

Langevin

Frémont

Fait

, et al. Responsiveness of the post-concussion symptom scale to monitor clinical recovery after concussion or mild traumatic brain injury. Orthopaedic journal of sports medicine 2022; 10(10): 23259671221127049. https://doi.org/10.1177/23259671221127049

20.

Alsalaheen

Almeida

, et al. Factor structure for the sport concussion assessment tool symptom scale in adolescents after concussion. Clinical journal of sport medicine 2022; 32(4): 400–407. https://doi.org/10.1097/JSM.0000000000000959

21.

Duhaime

A-C

Beckwith

Maerlender

, et al. Spectrum of acute clinical characteristics of diagnosed concussions in college athletes wearing instrumented helmets. Journal of neurosurgery 2012; 117(6): 1092–1099. https://doi.org/10.3171/2012.8.JNS112298

22.

Hoffman

Lucas

Dikmen

, et al. Natural history of headache after traumatic brain injury. Journal of neurotrauma 2011; 28(9): 1719–1725. https://doi.org/10.1089/neu.2011.1914

23.

Keatley

Bechtold

Psoter

, et al. Longitudinal trajectories of post-concussive symptoms following mild traumatic brain injury. Brain Injury 2023; 37(8): 737–745. https://doi.org/10.1080/02699052.2023.2172612

24.

Michael Templeton

Poellabauer

Schneider

. Classification of parkinson’s disease and its stages using machine learning. Scientific reports 2022; 12(1): 14036. https://doi.org/10.1038/s41598-022-18015-z

25.

Michael Templeton

Poellabauer

Schneider

. Enhancement of neurocognitive assessments using smartphone capabilities: Systematic review. JMIR mHealth and uHealth 2020; 8(6): e15517. https://doi.org/10.2196/15517

26.

Gómez-Río

Caballero

Saez

JMG

, et al. Diagnosis of neurodegenerative diseases: the clinical approach. Current Alzheimer research 2016; 13(5): 469–474. https://doi.org/10.2174/1567205013666151116141603

27.

Peters

Schnell

Saugstad

, et al. Longitudinal course of traumatic brain injury biomarkers for the prediction of clinical outcomes: a review. Journal of neurotrauma 2021; 38(18): 2490-2501.

28.

Michael Templeton

Poellabauer

Schneider

. Design of a neurocognitive digital health system (ndhs) for neurodegenerative diseases. In: Proceedings of the 2021 Workshop on Future of Digital Biomarkers, 2021, pp. 26–33.

29.

Sharma

Singh

Tripathi

, et al. Machine learning in medical diagnosis and treatment planning. In: 2024 1st International Conference on Advances in Computing, Communication and Networking (ICAC2N). IEEE, 2024, pp. 1554–1559.

30.

José de Antunes e Sousa

Afonso Sá

Gomes

MASM

, et al. Deep learning-based classification of temporal stages of at8-labeled tau pathology after experimental traumatic brain injury. Neuroinformatics 2026; 24(1): 7. https://doi.org/10.1007/s12021-025-09763-0

31.

Bergeron

Landset

Maugans

, et al. Machine learning in modeling high school sport concussion symptom resolve. Medicine & Science in Sports & Exercise 2019; 51(7): 1362–1371. https://doi.org/10.1249/MSS.0000000000001903

32.

Chu

Knell

Brayton

, et al. Machine learning to predict sports-related concussion recovery using clinical data. Annals of physical and rehabilitation medicine 2022; 65(4): 101626. https://doi.org/10.1016/j.rehab.2021.101626

33.

Thomas

Arnett

. Get your brain in the game: Using machine learning to predict recovery timelines following sports-related concussion. Archives of Clinical Neuropsychology 2025; 40(8): 1533–1545. https://doi.org/10.1093/arclin/acaf066

34.

Collins

Moons

KGM

Dhiman

, et al. Tripod+ ai statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. bmj 2024; 385: e078378. https://doi.org/10.1136/bmj-2023-078378

35.

Golding

Giles Gillingham

Perera

NKP

. The prevalence of depressive symptoms in high-performance athletes: a systematic review. The Physician and Sportsmedicine 2020; 48(3): 247–258. https://doi.org/10.1080/00913847.2020.1713708

36.

Beidler

Bretzin

Schmitt

. A-09 youth athlete concussion-related anxiety perceptions. Archives of Clinical Neuropsychology 2023; 38(5): 811. https://doi.org/10.1093/arclin/acad042.09

37.

Ryan

Ozturk

Fearnley

, et al. Does not impute! performance and ethical implications of missing data for an ai-based diabetes co-morbidity predictor. In: International Conference on Computer Safety, Reliability, and Security. Springer, 2025, pp. 511–523.

38.

Chen

Cummings

. To impute or not to impute: How machine learning modelers treat missing data. arXiv preprint arXiv:2503.16644, 2025.

39.

Shang

Hao

. A lightgbm-based pricing method for healthcare data. Procedia Computer Science 2025; 266: 1102–1108. https://doi.org/10.1016/j.procs.2025.08.136

40.

PPY

Toohey

, et al. Next generation models for subsequent sports injuries. Applied Stochastic Models in Business and Industry 2025; 41(4): e70034. https://doi.org/10.1002/asmb.70034

41.

Belle

Papantonis

. Principles and practice of explainable machine learning, 2020.

42.

Dai

. Research on svm improved algorithm for large data classification. In: 2018 IEEE 3rd International Conference on Big Data Analysis (ICBDA). IEEE, 2018, pp. 181–185.

43.

Dai

Yang

Qin

, et al. Physical layer authentication algorithm based on svm. In: 2016 2nd IEEE International Conference on Computer and Communications (ICCC). 2016, pp. 1597–1601.

44.

Chen

Guestrin

. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016, pp. 785–794.

45.

McDonald

. Ridge regression. Wiley Interdisciplinary Reviews: Computational Statistics 2009; 1(1): 93–100. https://doi.org/10.1002/wics.14

46.

Iverson

Gardner

Douglas

, et al. Predictors of clinical recovery from concussion: a systematic review. British journal of sports medicine 2017; 51(12): 941–948. https://doi.org/10.1136/bjsports-2017-097729

47.

Sreedharan

Prajapati

Engineer

, et al. Leave-one-out cross-validation in machine learning. In: Ethical Issues in AI for Bioinformatics and Chemoinformatics. CRC Press, 2023, pp. 56–71.

48.

Linden

. Looclass: Stata module for generating classification statistics of leave-one-out cross-validation for binary outcomes. Statistical Software Components, 2015.

49.

Pedregosa

Varoquaux

Gramfort

, et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 2011; 12: 2825–2830.

50.

Rokem

Kay

. Fractional ridge regression: a fast, interpretable reparameterization of ridge regression. GigaScience 2020; 9(12): giaa133. https://doi.org/10.1093/gigascience/giaa133

51.

Serra

Piersanti

Mastro

, et al. Synthetic data generation for addressing class imbalance in medical datasets: A case study on mitral regurgitation post-neochord procedure. In: 2025 IEEE International Conference on Metrology for eXtended Reality, Artificial Intelligence and Neural Engineering (MetroXRAINE). IEEE, 2025, pp. 907–912.

52.

Boulesteix

A-L

Bender

Bermejo

, et al. Random forest gini importance favours snps with large minor allele frequency: impact, sources and recommendations. Briefings in Bioinformatics 2012; 13(3): 292–304. https://doi.org/10.1093/bib/bbr053

53.

Tangirala

. Evaluating the impact of gini index and information gain on classification using decision tree classifier algorithm. International Journal of Advanced Computer Science and Applications 2020; 11(2): 612–619. https://doi.org/10.14569/ijacsa.2020.0110277

54.

MGK

Finley

Wang

, et al. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst 2017; 30.

55.

Gupta Mudiyanur

Sravya Popuri

Vallamkonda

. Machine learning-based stroke prediction with efficient feature importance analysis. International Journal of Electrical and Computer Engineering Research 2025; 5(3): 1–6. https://doi.org/10.53375/ijecer.2025.462

56.

Liu

Han

, et al. Interpretable machine learning to identify important predictors of birth weight: A prospective cohort study. Frontiers in Pediatrics 2022; 10: 899954. https://doi.org/10.3389/fped.2022.899954

57.

Poolsawad

Kambhampati

Cleland

JGF

. Balancing class for performance of classification with a clinical dataset. Proceedings of the World Congress on Engineering 2014; 1: 1–6.

58.

Prakash Lohani

Thirunavukkarasan

. A review: Application of machine learning algorithm in medical diagnosis. In: 2021 International Conference on Technological Advancements and Innovations (ICTAI). 2021, pp. 378–381.

59.

Verdonck

Baesens

Óskarsdóttir

, et al. Special issue on feature engineering editorial. Machine learning 2024; 113(7): 3917–3928. https://doi.org/10.1007/s10994-021-06042-2

60.

Marc

PDF

Silverberg

Kirkwood

, et al. Prolonged activity restriction after concussion. Clinical Pediatrics 2016; 55: 443–451.

61.

Domor Mienye

Sun

. A survey of ensemble learning: Concepts, algorithms, applications, and prospects. Ieee Access 2022; 10: 99129–99149. https://doi.org/10.1109/access.2022.3207287

62.

Jacob

Hall

Bliss

, et al. A review: Application of machine learning algorithm in medical diagnosis. Medical Engineering & Physics 2025; 104402.

63.

Klassen

Kim

Liu

. Empirical study of support vector machine kernels with applications to microarray data. In: International Conference on Computers and Their Applications. 2010.

64.

Zhang

Lan

Wang

, et al. Scaling up kernel svm on limited resources: A low-rank linearization approach. In: Artificial intelligence and statistics. PMLR, 2012, pp. 1425–1434.

65.

Fan

Deng

Qiu

, et al. Well logging curve reconstruction based on kernel ridge regression. Arabian Journal of Geosciences 2021; 14(16): 1559. https://doi.org/10.1007/s12517-021-07792-y

66.

Boshra

Dhindsa

Boursalie

, et al. From group-level statistics to single-subject prediction: Machine learning detection of concussion in retired athletes. IEEE Transactions on Neural Systems and Rehabilitation Engineering 2019; 27(7): 1492–1501. https://doi.org/10.1109/TNSRE.2019.2922553

67.

Czerniak

Garcia

G-GP

Genthe

, et al. Prediction of symptom burden, cognitive status, and risk of psychological distress in ncaa athletes with sport-related concussion (s): findings from the ncaa-dod care consortium. Annals of Biomedical Engineering 2025; 53(11): 3172–3189. https://doi.org/10.1007/s10439-025-03824-w

68.

Lai

Cai

Tan

. Many faces of feature importance: Comparing built-in and post-hoc feature importance in text classification. ArXiv, abs/1910.08534, 2019.

69.

Gifi

. Nonlinear multivariate analysis. Vol. 1. Wiley, 1990.

70.

Kontos

Eagle

Marchetti

, et al. Discriminative validity of vestibular ocular motor screening in identifying concussion among collegiate athletes: A national collegiate athletic association-department of defense concussion assessment, research, and education consortium study. The American Journal of Sports Medicine 7 2021; 49(8): 2211–2217. https://doi.org/10.1177/03635465211012359

71.

Morgan

McAllister-Deitrick

Marchetti

, et al. Risk factors for vestibular and oculomotor outcomes after sport-related concussion. Clinical Journal of Sport Medicine 2021; 31(4): e193–e199.

72.

Han

. Vestibular rehabilitation therapy: review of indications, mechanisms, and key exercises. Simplified vestibular rehabilitation therapy 2021; 1–16.

73.

Cernich

Kurtz

Mordecai

, et al. Cognitive rehabilitation in traumatic brain injury. Current Treatment Options in Neurology 2010; 12(5): 412–423. https://doi.org/10.1007/s11940-010-0085-6

74.

Kakavas

Tsaklis

. 8.8 Neck strengthening versus visual tracking speed rehabilitation following sports related concussion (src) in women footballers: a randomized controlled trial. Second Round Abstract Submissions 2024; A127-A127.

75.

Wallace

Lifshitz

. Traumatic brain injury and vestibulo-ocular function: current challenges and future prospects. Eye and Brain 2016; 8: 153–164. https://doi.org/10.2147/EB.S82670

76.

Al-Sharif

Roehm

Logan Lindemann

, et al. Visual-vestibular mismatch correlates with headache. Journal of Vestibular Research 2021; 31: 173–180. https://doi.org/10.3233/VES-201539

77.

Tsang

Marcus

Paine

, et al. Tp1-9 vestibular dysfunction in acute traumatic brain injury. Journal of Neurology, Neurosurgery and Psychiatry 2019; 90(3): e12. https://doi.org/10.1136/jnnp-2019-abn.38

78.

Kline

Jacob

Radabaugh

, et al.

Combination therapies for neurobehavioral and cognitive recovery after experimental traumatic brain injury: is more better?

Progress in neurobiology 2016; 142: 45–67. https://doi.org/10.1016/j.pneurobio.2016.05.002

79.

Hope

Vashisth

Parker

, et al. Phybrata sensors and machine learning for enhanced neurophysiological diagnosis and treatment. Sensors 2021; 21(21): 7417. https://doi.org/10.3390/s21217417

80.

Yates

, et al. Developing a multivariate model for the prediction of concussion recovery in sportspeople: a machine learning approach. BMJ Open Sport & Exercise Medicine 2025; 11(1): e002090. https://doi.org/10.1136/bmjsem-2024-002090

81.

Zumeta-Olaskoaga

Weigert

Larruskain

, et al. Prediction of sports injuries in football: a recurrent time-to-event approach using regularized cox models. AStA Advances in Statistical Analysis 2023; 107(1): 101–126. https://doi.org/10.1007/s10182-021-00428-2

82.

Dabek

Hoover

Jorgensen-Wagers

, et al. Evaluation of machine learning techniques to predict the likelihood of mental health conditions following a first mtbi. Frontiers in neurology 2022; 12: 769819. https://doi.org/10.3389/fneur.2021.769819

83.

Hanson

Stracciolini

Mannix

, et al. Management and prevention of sport-related concussion. Clinical Pediatrics 2014; 53(13): 1221–1230. https://doi.org/10.1177/0009922813518429

84.

Colvin

Thurm

Pate

, et al. Diagnosis and acute management of patients with concussion at children’s hospitals. Archives of disease in childhood 2013; 98(12): 934–938. https://doi.org/10.1136/archdischild-2012-303588

85.

Amoo-Achampong

Rosas

Schmoke

, et al. Trends in sports-related concussion diagnoses in the usa: a population-based analysis using a private-payor database. The Physician and Sportsmedicine 2017; 45(3): 239–244. https://doi.org/10.1080/00913847.2017.1327304

86.

Singh

Rakhra

. From imaging to outcomes: machine learning and ai for tbi detection. In: 2025 International Conference on Networks and Cryptology (NETCRYPT). IEEE, 2025, pp. 1162–1168.

87.

Anderson

McCorkle

Hammonds

, et al. Early vestibular rehabilitation initiation is associated with faster recovery after sport-related concussion. Journal of Science and Medicine in Sport 2025; 28(3): 222–227.

88.

Misiura

Ruban

Honcharov

, et al. The results of the corrective rehabilitation program on the gait of amateur athletes with long-term consequences of brain injury. Wiadomosci lekarskie 2024; 77(2): 233–240. https://doi.org/10.36740/WLek202402107

89.

Riley

Ensor

Snell

KIE

, et al. Importance of sample size on the quality and utility of ai-based prediction models for healthcare. The Lancet Digital Health 2025; 7(6).

90.

Grace

. Exercise after traumatic brain injury: is it a double-edged sword? PM&R 2011; 3(6): S64–S72.

91.

Griesbach

Hovda

Molteni

, et al. Voluntary exercise following traumatic brain injury: brain-derived neurotrophic factor upregulation and recovery of function. Neuroscience 2004; 125(1): 129–139. https://doi.org/10.1016/j.neuroscience.2004.01.030

92.

DiFazio

Silverberg

Kirkwood

, et al.

Prolonged activity restriction after concussion: are we worsening outcomes?

Clinical pediatrics 2016; 55(5): 443–451. https://doi.org/10.1177/0009922815589914

93.

Sim

Reid

. Statistical inference by confidence intervals: issues of interpretation and utilization. Physical Therapy 1999; 79(2): 186–195.

94.

Bretzin

Esopenko

D’Alonzo

, et al. Clinical recovery timelines after sport-related concussion in men’s and women’s collegiate sports. Journal of athletic training 2022; 57(7): 678–687. https://doi.org/10.4085/601-20

95.

Covassin

Moran

Elbin

. Sex differences in reported concussion injury rates and time loss from participation: an update of the national collegiate athletic association injury surveillance program from 2004–2005 through 2008–2009. Journal of athletic training 2016; 51(3): 189–194. https://doi.org/10.4085/1062-6050-51.3.05

96.

Mehrabi

Morstatter

Saxena

, et al. A survey on bias and fairness in machine learning. ACM computing surveys (CSUR) 2021; 54(6): 1–35. https://doi.org/10.1145/3457607

97.

Tempelaar

Rienties

Nguyen

. Subjective data, objective data and the role of bias in predictive modelling: Lessons from a dispositional learning analytics application. PloS one 2020; 15(6): e0233977. https://doi.org/10.1371/journal.pone.0233977

98.

Frewing

Gibson

Robertson

, et al. Don’t fear the artificial intelligence: a systematic review of machine learning for prostate cancer detection in pathology. Archives of Pathology & Laboratory Medicine 2024; 148(5): 603–612. https://doi.org/10.5858/arpa.2022-0460-RA

99.

Briley

Moret

. Neurobiological mechanisms involved in antidepressant therapies. Clinical neuropharmacology 1993; 16(5): 387–400. https://doi.org/10.1097/00002826-199310000-00002

100.

Riley

Snell

KIE

Ensor

, et al. Minimum sample size for developing a multivariable prediction model: Part ii-binary and time-to-event outcomes. Statistics in medicine 2019; 38(7): 1276–1296. https://doi.org/10.1002/sim.7992

101.

Y-H

R-Y

Lin

Y-C

, et al. A novel missforest-based missing values imputation approach with recursive feature elimination in medical applications. BMC Medical Research Methodology 2024; 24(1): 269. https://doi.org/10.1186/s12874-024-02392-2

102.

George

Pe’er

Korem

. Distributional bias compromises leave-one-out cross-validation. Science Advances 2025; 11(48): eadx6976.

103.

Parker

Günter

Bedo

. Stratification bias in low signal microarray studies. BMC bioinformatics 2007; 8(1): 326. https://doi.org/10.1186/1471-2105-8-326

104.

Hastie

. Ridge regularization: An essential concept in data science. Technometrics 2020; 62(4): 426–433. https://doi.org/10.1080/00401706.2020.1791959

105.

Brady

Hume

Mahon

, et al. What is the evidence on natural recovery over the year following sports-related and non-sports-related mild traumatic brain injury: a scoping review. Frontiers in neurology 2022; 12: 756700. https://doi.org/10.3389/fneur.2021.756700

106.

Etemad

Yue

Barber

, et al. Longitudinal recovery following repetitive traumatic brain injury. JAMA network open 2023; 6(9): e2335804. https://doi.org/10.1001/jamanetworkopen.2023.35804

107.

Jeffrey

Elbin

Casa

, et al. Validation of a machine learning brain electrical activity–based index to aid in diagnosing concussion among athletes. JAMA network open 2021; 4(2): e2037349.

108.

König

Malley

Weimar

, et al. Practical experiences on the necessity of external validation. Statistics in medicine 2007; 26(30): 5499–5511. https://doi.org/10.1002/sim.3069

109.

Shim

Lee

S-H

Hwang

H-J

. Inflated prediction accuracy of neuropsychiatric biomarkers caused by data leakage in feature selection. Scientific Reports 2021; 11(1): 7980. https://doi.org/10.1038/s41598-021-87157-3

110.

Mahadi

Ballal

Moinuddin

, et al. Regularized linear discriminant analysis using a nonlinear covariance matrix estimator. IEEE Transactions on Signal Processing 2024; 72: 1049–1064. https://doi.org/10.1109/tsp.2024.3361715

111.

Adans-Dester

Hankov

O’Brien

, et al. Enabling precision rehabilitation interventions using wearable sensors and machine learning to track motor recovery. NPJ digital medicine 2020; 3(1): 121. https://doi.org/10.1038/s41746-020-00328-w

112.

Oride

MKH

Marutani

Rouse

, et al. Reliability study of the pierce and king-devick saccade tests. Optometry and Vision Science 1986; 63(6): 419–424. https://doi.org/10.1097/00006324-198606000-00005

113.

Tinnetti

. Performance-oriented assessment of mobility problems in elderly patients. J Am Geriatr Soc 1986; 34(2): 119–126.

114.

Guskiewicz

Riemann

Perrin

, et al. Alternative approaches to the assessment of mild head injury in athletes. Medicine and science in sports and exercise 1997; 29(7): S213–S221. https://doi.org/10.1097/00005768-199707001-00003

115.

Paul

Meeuwisse

Johnston

, et al. Consensus statement on concussion in sport–the 3rd international conference on concussion in sport held in zurich, november 2008. South African Journal of sports medicine 2009; 21(2).

116.

Cobbs

Hasanaj

Amorapanth

, et al. Mobile universal lexicon evaluation system (mules) test: a new measure of rapid picture naming for concussion. Journal of the neurological sciences 2017; 372: 393–398. https://doi.org/10.1016/j.jns.2016.10.044

117.

World Rugby . version 4 edition. World Rugby, 2022. Three-stage protocol (HIA1, HIA2, HIA3) for the identification, diagnosis, and management of concussion in elite adult rugby. Head Injury Assessment (HIA) Protocol.

118.

Raftery

Tucker

. Implementing a worldwide concussion programme. Aspetar Sports Medicine Journal 2016; 5, n.pag (Volume 5 – Targeted Topic: International Sports Federations; Sports Medicine category.

119.

Vartiainen

Holm

Lukander

, et al. A novel approach to sports concussion assessment: Computerized multilimb reaction times and balance control testing. Journal of Clinical and Experimental Neuropsychology 2016; 38(3): 293–307. https://doi.org/10.1080/13803395.2015.1107031

120.

Paul

Meeuwisse

Aubry

. British journal of sports medicine. Br. J. Sports. Med 2013; 47(5): 250–258.

121.

Paul

Meeuwisse

Dvorak

Jˇrí

, et al. Consensus statement on concussion in sport—the 5th international conference on concussion in sport held in berlin, october 2016. British journal of sports medicine 2017; 51(11): 838–847.

122.

McCrea

Kelly

Randolph

. Standardized assessment of concussion (sac): Manual for administration. Scoring, and Interpretation, Ed 2000; 2.

123.

Lovell

Collins

. Neuropsychological assessment of the college football player. The Journal of head trauma rehabilitation 1998; 13(2): 9–26. https://doi.org/10.1097/00001199-199804000-00004

124.

Elizabeth Sandel

Wang

Terdiman

, et al. Disparities in stroke rehabilitation: results of a study in an integrated health system in northern california. PM&R 2009; 1(1): 29–40. https://doi.org/10.1016/j.pmrj.2008.10.012