Abstract
Objective
To evaluate whether integrating longitudinal clinical data improves machine learning (ML)-based prediction of time to medical clearance following sport-related concussion (SRC) and to identify clinical features most strongly associated with classification of either ‘prolonged’ recovery (≥ 30 days) or ‘normal’ recovery (< 30 days).
Methods
A retrospective cohort of 217 athletes (mean age 26.94 years) from the USF Concussion Center (2021–2025) was analyzed. Six ML classifiers were trained on Visit 1 features (n = 48) and combined Visit 1 + Visit 2 features (n = 95). Internal validation was performed using Leave-One-Out Cross-Validation (LOOCV).
Results
Prolonged recovery occurred in 81.1% of the cohort. Adding Visit 2 features improved accuracy in 66% of models, with XGBoost achieving the highest accuracy (0.84, +5% gain over Visit 1). Specificity remained low (0.00–0.34) due to class imbalance. VOR Vertical Headache and its change score were the most frequent predictors of prolonged recovery, present in 81% and 100% of models, respectively. Treatment presence between visits emerged as the strongest predictor of normal recovery.
Conclusions
Longitudinal clinical data modestly improves ML-based SRC recovery predictions. Vestibulo-oculomotor symptoms - particularly headache provoked during vertical VOR testing - are robust prognostic indicators. These findings support the utility of granular VOMS subscores for early risk stratification and targeted rehabilitation. External validation is required before clinical deployment. Code: https://github.com/MeganTran6023/Sport-Related-Concussions_Machine-Learning. IRB: USF STUDY003514.
Introduction
Recent trends show an annual increase in sports participation in the United States, and subsequently an increased potential for concussions like traumatic brain injury (TBI). 1 Specifically, given the increased exposure from sports participation, current estimates suggest that over 3.8 million concussions occur in the US annually, a steady increase over previous decades (i.e., 1.7 concussions per 10,000 athlete exposures in 1988–1989 to 3.4 in 2003–2004 2 and to 4.47 in the period between 2009 and 2014. 3 This problem is further exacerbated by the fact that estimations also believe that up to 50% of concussions go unreported. 1 As such, it is essential that licensed medical practitioners (e.g., physicians, neurologists, neuropsychologists, and emergency medicine specialists) accurately diagnose TBIs following suspected incidence. To do so they utilize a combinatfion of tests, such as Balance Error Scoring System (BESS) 4 and Vestibular Ocular Motor Screening (VOMS) 5 to assess cognitive, observational, and visual outcomes related to TBI symptoms. 6 However, these preliminary assessments are typically only sufficient to diagnose TBIs without being able to adequately predict a player’s time to clearance. 7 Fortunately, the use of machine learning (ML) in the healthcare field allows for improvement to diagnostic capabilities, while also allowing for improved prediction modalities for longitudinal prognosis (e.g., time to clearance). 8 These ML methods can be implemented using data collected at the time of the original diagnosis or can combine longitudinal data to improve prognosis capabilities. 9
The objective of this study was to utilize machine learning models classifying recovery duration for medical clearance (i.e., return to sport) following sport-related TBI. This is done via the integration of data from gold-standard clinical assessments (e.g, BESS, 4 VOMS, 5 ImPACT, 10 etc.) collected across multiple time points from licensed medical practitioners. Specifically, the study aimed to evaluate whether the predictive accuracy of time to clearance improves given longitudinal data, while also identifying and quantifying the specific assessment features that most strongly drive predicting time to clearance. By identifying assessment features that demonstrate a predictive signal, this work provides exploratory insights that could inform the future development of evidence-based protocols for clinical and applied contexts. Ultimately, this study will support a framework for longitudinal monitoring of TBI related to both individualized return-to-play decisions and broader clinical guidelines. This study is primarily comparative in scope: rather than proposing a single deployable clinical tool, it benchmarks six ML classifiers across longitudinal data to identify which model architectures and clinical features best support future development of a validated prediction tool. External validation and prospective testing are required before any model described here could be considered for clinical deployment. The intended target population for a future validated version of this model consists of athletes diagnosed with sport-related concussion (SRC) presenting to a sports medicine or concussion specialty clinic within one year of injury. The intended users are licensed clinicians experienced in concussion management (e.g., sports medicine physicians, neurologists, athletic trainers), who would use model output to supplement—not replace—clinical judgment regarding return-to-play timelines (TRIPOD+AI Item 3b).
Related works
Gold standard concussion assessments
In sports-related activities, athletes are at risk of the acute effects of concussion. 11 This includes decreased verbal/visual memory and processing speed during the acute time period defined as 1 – 14 days post-concussion. 12 Consequently, these impairments translate to declines in both cognitive outcomes and athletic performance. 11 Because TBI affects several aspects of an athlete’s mental and physical abilities, multiple cognitive assessments have been used for screening/diagnosing TBIs. These tools are used to evaluate cognitive, observational, and visual outcomes related to TBIs (Appendix Table 10).
While it is recommended that a variation of the Sport Concussion Assessment Tool (SCAT) is used in any TBI assessment, given its multi-modal approach to concussion screening, 13 it is also common for other tests (e.g., King Devick for oculomotor assessment) to be administered in conjunction with the SCAT, so as to increase the overall feature set. 13 Similarly, while there are several versions of these assessments, many of them have modifications (e.g., Balance Error Scoring System (BESS) and modified BESS (mBESS)) which may provide changes in sensitivity for the assessment of different populations/conditions. 14 Interestingly, the ImPACT Test battery also uses Post-Concussion Symptom Scale (PCSS), which explains why some administration cases do not use both tests together for concussion diagnosis and evaluation.15,16 Neurocognitive testing is incorporated to better identify concussion in athletes after an injury as solely depending on symptoms is not sufficient enough for proper diagnosis. 17 PCSS is a computerized neurocognitive test many health care professionals use to determine the number and severity of symptoms an athlete experiences following a concussion. 18 Langevin et al. found that the assessment only indicates a low to moderate degree of correlation between the frequency of symptoms reported by a concussed athlete using the PCSS test and the Dizziness Handicap Inventory, Headache Disability Inventory, and Neck Disability Index. 19 Symptom assessments are crucial in concussion testing, as understanding an individual’s specific profile enables clinicians to implement a personalized recovery plan. 20 Common physical symptoms, such as headaches, dizziness, and light sensitivity, often appear immediately following the injury. It is imperative to record a patient’s symptoms for multiple reasons including establishing a baseline and tracking its progression if recorded longitudinally.21–23
Machine learning for health prediction
Given the increased application of ML in healthcare, supervised learning has been extensively deployed as a means for processing neurological disease-based datasets to output explainable results. 24 This is possible due to the large amount of digitally-available data (e.g., from gold standard assessments, digital devices, patient reported outcomes, etc.) which relate to different neuro-cognitive functions (e.g., motor, memory, and executive function) of interest. 25 The presence of these types of data are extensively impactful for the standardization of diagnosing TBIs. 26 However, many of these tools, focused on diagnosing TBIs, only account for a single time point of data which makes it difficult to predict future outcomes. 27 Further, as there are various types of features (e.g., binary responses, continuous values, etc.) across multiple neuro-cognitive domains, ML is necessary to employ for analyzing all underlying patterns. 24 Consolidating ML with existing evaluation tools allows for clinicians to capture complex relationships between various assessment features that are not easily identifiable. 28 This integration can improve clinical and patient outcomes by enabling earlier, more accurate diagnosis, personalized prognosis, and data-driven treatment planning that supports timely interventions and informed clinical decision-making. 29 Recent methodological literature in the TBI and concussion domain has increasingly focused on enhancing model robustness through advanced deep learning frameworks and validation strategies. For instance, Ref. 30 demonstrated how convolutional neural network (CNN) architectures and robust preprocessing—such as addressing class imbalance through oversampling—can improve classification accuracy across temporal stages of pathology, offering a valuable perspective on the validation required for clinical translation.
Bergeron et al. found that models Naive Bayes and Random Forest were top performers in predicting concussion resolution in high school athletes within 7, 14, or 28 days. 31 Top features that drove this high model performance includes difficulty concentrating, sensitivity to light/noise, and balance issues. 31 Chu et al. reported that the CatBoost model outperformed traditional statistical methods in both predictive and discriminative aspects when predicting concussion recovery time and protracted recovery after using clinical data involving the Vestibular Ocular Motor Screening (VOMS), King-Devick Test, and the C3 Logix Trails Test. 32 Thomas and Arnett found that their Random Forrest model performed the best in classifying concussed college athletes recovery timeframes as typical (≤ 28 days) or prolonged (>28 days). 33 These recent, impactful studies add to existing literature with the use ML models to capture nonlinear relationships inside complex datasets to properly determine recovery timeframes for concussed patients. 31 Thomas and Arnett’s study highlight an issue from human-driven statistical method that ML models overcome which is to use past data for prospective concussion recovery predictions. 33 Furthermore, Chu et al. models required less features than traditional methods to accurately predict concussion recovery time—specifically they used 11 features for prediction while traditional models used 25–27 features. 32
Methodology
This study was reported in accordance with the TRIPOD+AI checklist. 34 The completed checklist is provided in the Appendix, with locations indicated by page number.
Cohort
The dataset utilized in this study was provided from the University of South Florida (USF) Concussion Center via the USF Research Electronic Data Capture (REDCap) server. The study population consisted of 3,038 patients diagnosed with a concussion between 2017 and 2026 at USF facilities. This dataset includes patient data with multiple visits collected from 2021 to 2025. Within this dataset, rows represent each patient with respective visit information while columns represent a combination of patient intake and examination data. Within the full USF Concussion Center database, patients were categorized based on one of four mechanisms of injury causing concussion (i.e., Sports Related Concussions (this study), Motor Vehicle Accidents, Falls, and Other – represented by assaults, collision with random impediments, etc. Initially, the full dataset included 3038 unique patient records. Following data preprocessing (as further depicted in the subsequent section), 217 unique patients with concussion remained for ML analysis. Of the 217, 80 (36.9%) are males and 137 (63.1%) are females, with an average age of 26.94 years. Treatment Present is a binary variable (0/1) indicating whether a patient received any form of treatment following their initial visit. Treatment types span both pharmacological and non-pharmacological approaches: Selective Serotonin Reuptake Inhibitors (SSRI), amantadine, stimulants, preventative headache medication, vestibular therapy, physical therapy, chiropractic care, psychological treatment, neuropsychological treatment, neurology treatment, and cognitive therapy. Of the 217 patients, 168 received at least one form of treatment Treatment Present = 1) and 49 did not (Treatment Present = 0). Anxiety and depression are the most common mood disorders in the dataset, where the prevalence of depression ranges from 6% to 34% 35 and the prevalence of anxiety is 46.72% in youth athletes. 36 GAD-7 Score (General Anxiety Disorder-7): This was used to quantify anxiety status at the time of the visit. PHQ-9 Score (Patient Health Questionnaire-9): This was used to quantify depression/mood status at the time of the visit.
To account for the ‘treatment_present’ variable, we included features for Selective Serotonin Reuptake Inhibitors (ssri_tx), psychological treatment (psych_tx), neuropsychological evaluation or treatment (neuropsych_tx), and cognitive therapy (cognitive_tx), each of which was coded as a binary value (0/1) to indicate whether the patient received the treatment during their single intake visit.
Finally, clearance is determined by being either asymptomatic or back to baseline levels of symptoms that were present pre-injury in addition to normal functional measures based on VOMS, BESS, CNS vital signs, return to school. Physicians experienced in the diagnosis and management of concussions make that determination.
Data preprocessing
To clean the dataset with 3038 unique patients, a multi-step iterative process was employed, beginning with the establishment of specific cleaning standards: a column missingness threshold of 0.9, a row missingness threshold of 0.8, and a threshold step of 0.1 for subsequent rounds. During the step-by-step cleaning phase, the percentage of missing values was calculated for every row and column, leading to the removal of any records or features that exceeded the established thresholds. Following each removal cycle, the standards were adjusted by the threshold step and the process was repeated until the data stabilized, resulting in an intermediate dataset of 2,338 rows and 49 columns. Finally, outliers were addressed by filtering for clinically relevant timelines, specifically retaining records where the duration to the first visit was between 0 and 365 days and the duration to clearance was at least 1 day. This rigorous refinement process produced a final dataset of 1,865 rows and 49 essential columns such that the full data frame includes no null values. This process was chosen to avoid the use of data imputation, as imputation in the medical data tends to lead to bias and it is noted that there are still no optimal imputation solutions in the medical domain.37,38 After all preprocessing steps filtering for patients with two hospital visits, this resulting dataset included 1201 unique patients with 217 of the total with sports-related concussions.
Subsequently, both feature inclusion and engineering was performed to acknowledge clinically relevant variables related to either time instances and/or longitudinal changes experienced by patients. For feature inclusion ‘prior head injury’ and ‘history of mood disorders’ were selected for inclusion for Visit 1, whereas ‘prior head injury’, ‘history of mood disorders’, and ‘treatment presence’ were selected for inclusion for Visit 2 (i.e., as treatment was not formally administered until after the collection of data in the first visit). In addition, difference features were engineered for Visit 2 by subtracting Visit 1 values from corresponding Visit 2 values for all shared base features, excluding ‘prior head injury’, ‘history of mood disorders’, and ‘treatment presence’, to provide the models with information related to longitudinal changes between visits. This process resulted in 49 features for Visit 1 and 95 features for Visit 2. Data for patients with 2+ visits was included–highlighting only the baseline and second visit–as there were not enough patients with 3 or more visits within the normal one month time to clearance classifier.
Machine learning
Each model were trained separately for each visit using the same hyperparameter tuning, training, and testing pipeline. Visit 1 models used the clinical dataset that included new features prior head injury and history of mood disorders in addition to the original features remaining from preprocessing. Visit 2 models used the clinical dataset that included new features prior head injury, history of mood disorders, and treatment presence, the original features remaining from preprocessing, as well as the difference variants of the original features. The model’s prediction calculations are expressed in this section via Equations 1-9 (TRIPOD+AI Item 12g,22).
Light Gradient Boosting Machine (LGBM) is a Gradient Boosting Decision Tree that incorporates techniques Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). This ML model optimizes the number of features to focus on to quickly and accurately make predictions using a small dataset while reducing memory usage.
39
Similar to other gradient boosting frameworks, LGBM constructs an additive model by minimizing a regularized objective function of the form
In the objective function, n denotes the number of training samples, y
i
represents the true label for the ith sample, and
Decision Tree Classifier applies a divide and conquer algorithm for each feature to determine the most optimal order of splits that best capture nonlinear patterns in a dataset.
40
Partitioning the problems into binary sub-outputs makes it easy to track what order of features leads to the prediction of a patient’s time to clearance that results in the highest predictive accuracy.
41
Three splitting criteria choices used for this model, along with their respective mathematical formulations, are as follows: a) b) c)
Random Forest Random Forest combines multiple decision trees to improve the generalizability of the model’s outputs. 41 Given that our dataset has 48 features and 97 features for initial and second visits, respectively, combining multiple decision trees provides a holistic understanding of feature influence on the model’s high accuracy rather than relying on a single order of splits.
Formally, Random Forest predictions can be analyzed using the margin function, defined as
The corresponding generalization error is given by
which represents the probability that the ensemble misclassifies an input, linking model accuracy to the strength and diversity of the individual trees. In the margin function, X represents the input feature vector, Y denotes the true class label, and h(X, Θ) is the prediction of an individual tree parameterized by random variables Θ, which control feature selection and bootstrapped sampling. The probability PΘ(·) is taken over the ensemble of trees. The generalization error PE measures the likelihood that the ensemble predicts an incorrect class for a randomly drawn input-output pair (X, Y).
Support Vector Classifiers (SVCs) SVCs perform well with small datasets since its time and space complexity is proportional to the input dataset size.
42
SVCs are a specific type of Support Vector Machines (SVMs). For classification problems, SVMs attempt to linearly separate points from a dataset into distinct groups on a feature space of higher dimension.
43
This separation is achieved by solving the following optimization problem:
In the optimization problem,
XGBoost This model is a widely used implementation of gradient tree boosting that is designed for scalability, efficiency, and strong performance, even when working with sparse datasets. Its effectiveness stems from greedy optimization in which each new tree is trained to model the residual errors left by the previous ensemble of trees. By iteratively reducing residuals, the model progressively improves its predictions, leading to a robust and highly accurate boosting framework.
44
Formally, XGBoost minimizes the following regularized objective function where n denotes the number of training samples, y
i
is the true label, and
The loss function l(·) measures prediction error, while f k denotes the kth tree in the ensemble, with K being the total number of trees. The regularization term Ω(f k ) penalizes model complexity by incorporating constraints on tree structure and leaf weights, improving performance.
Ridge Regression This is useful for addressing the collinearity problem that occurs in general linear regression analyses without dropping any of the features from the original set of independent variables.
45
The equation used introduces an ℓ2 regularization term into the standard least-squares objective and is defined as
In the optimization expression,
Study design
As ML models are intended to predict whether each patient’s time to clearance is within a month or over a month we used a binary encoding method to classify these states. Specifically, this study treats recovery prognosis as a binary classification task of recovery duration, where the models differentiate between ‘normal’ (< 30 days) and ‘prolonged’ (≥ 30 days) recovery timelines. Class 0 denotes recovery within a month and Class 1 denotes recovery over a month. One month was selected as the threshold since this is the general clinical recovery timeframe for sports-related concussions. 46 This was completed to not only evaluate if the models can predict the time to clearance, but also demonstrate increases in prediction accuracy comparing between first and second visit datasets (TRIPOD+AI Item 8a).
Leave One Out Cross Validation (LOOCV) was utilized to determine the model’s accuracy for predicting the patients’ time to medical clearance for both initial and second visits. This method was selected due to the dataset’s small size. 47 For LOOCV with a dataset of size N, one data point is used in the training set while the remaining N - 1 data points is the testing set. This procedure is repeated until each data point has been used once as the training instance. 48
No patients or members of the public were involved in the design, conduct, reporting, or dissemination plans of this research.
ML Model Hyperparameter Tuning LightGBM, Decision Tree, Random Forest, SVC, and XGBoost were hyperparameter tuned using an 80:20 train–test split. For these models, hyperparameters were optimized using randomized search via ParameterSampler, 49 where 50 distinct parameter combinations were randomly sampled from predefined distributions; the configuration achieving the highest validation accuracy was selected. This number of combinations was selected given a performance plateau. In contrast, the Ridge classifier was tuned using k-fold cross-validation, where the data were split into k folds, the model was trained on k − 1 folds and validated on the remaining fold. This process was repeated k times, the validation scores were averaged across folds for each candidate alpha (α). Subsequently, the α with the highest average score was selected and the final Ridge model was re-fit on the full dataset using this optimal α. Since the Ridge model only had one hyperparameter to tune (i.e. α), its hyperparameter tuning process differed from those of the other models implemented in the paper. 50 Details for hyperparameter tuning for each model are explained in the following paragraphs.
Class Imbalance The dataset exhibits substantial class imbalance: 176 patients (81.1%) belong to the ‘prolonged’ recovery class (Class 1) versus 41 patients (18.9%) in the normal recovery class (Class 0). No resampling techniques (e.g., SMOTE, random oversampling) were applied in the primary analysis, as the authors judged that synthetic data generation could misrepresent the true clinical distribution. 51 Instead, class weighting was incorporated into the pipeline to penalize misclassification of the minority class (Class 0) proportionally to class frequency. The effect of this imbalance is reflected in near-zero specificity values in the unweighted pipeline and is discussed as a primary limitation of the reported accuracy and recall metrics (TRIPOD+AI Item 13).
Feature Importance (FI) To identify and quantify which assessment features most strongly influence predictions of time to clearance, FI analyses were conducted using methods appropriate to each model type on a final refitted model. For tree-based models (i.e., Decision Tree, Random Forest), importance was derived from the reduction in Gini impurity attributable to each feature. Specifically, these FI values in addition to their respective Gini impurity values were extracted from each trained tree-based model.52,53 Conversely, for gradient-boosted models (i.e., LightGBM, XGBoost) importance was based on gain, defined as the total reduction in loss from splits using a given feature. 54 For linear SVC, FI was quantified using the absolute magnitude of the learned coefficients, reflecting each feature’s influence on the decision function. 55 For Ridge, permutation importance was derived by measuring the decrease in predictive accuracy resulting from random permutation of individual features. 56 Appendix Table 11 denotes formal feature names related to specific features of importance.
Results
All six classifiers produce binary class labels as their primary output: Class 0 denotes predicted ‘normal’ recovery (<30 days to medical clearance) and Class 1 denotes prediction of ‘prolonged’ recovery (≥30 days). Predicted class probabilities are available for probabilistic models (LightGBM, XGBoost, Random Forest, Decision Tree, Ridge) and can be inspected via the public repository. Classification thresholds were set at 0.5 for all models; threshold optimization (e.g., via ROC analysis) was not performed at this stage given the exploratory scope of the study (TRIPOD+AI Item 15).
Visit 1 – model performance with 95% bootstrap confidence intervals.
Note: Values within brackets represent the 95% confidence intervals.
Notably, the integration of class weighting yielded superior specificity across all models compared to the initial machine learning implementation. By assigning greater weights to minority class observations, this strategy mitigated the inherent bias toward the majority class, ensuring that class imbalance did not impede the predictive integrity of the models.
Counts of specific treatment received.
Visit 2 – model performance with 95% bootstrap confidence intervals.
Note: Values within brackets represent the 95% confidence intervals.
Statistical significance of accuracy performance gains across visits per model.
Further, Figures 1–3 present the average predictive effect for each of the top 20 features per model and visit. In these figures, the blue bars predict towards Class 0 and red bars predict towards Class 1. For the identified features from Visit 1 (i.e., NPC Headache, VMST Dizziness, VMST Headache, and VOMS Headache), Figures 1–3 highlight that these features tend to predict towards Class 1 with 19 of 24 instances (79.17%) of these features being highlighted in red. Similarly, for Visit 2, the VOR Vertical Headache and VOR Vertical Headache Difference features predicted towards Class 1 (i.e., red) across 11 of 12 instances (91.67%). Conversely, for Visit 2, the treatment present feature predicted towards Class 0 (i.e., black) in all 6 models. Data preprocessing flowchart. Average effects – Visit 1 vs Visit 2. Blue bars predict towards Class 0 and red bars predict towards Class 1. Average effects – Visit 1 vs Visit 2. Blue bars predict towards Class 0 and red bars predict towards Class 1.


Discussion
The primary objective of this study was to evaluate the utility of machine learning (ML) models in predicting time to medical clearance via binary classification following sport-related traumatic brain injury (TBI), with a secondary aim of assessing whether the integration of longitudinal clinical data across two visits improves predictive accuracy compared to a single-visit model. To accomplish these goals, six ML models—LightGBM, Decision Tree, Random Forest, XGBoost, Support Vector Classifier, and Ridge Regression—were trained and evaluated using gold-standard clinical assessment data from the USF Concussion Center. Traditional statistical approaches commonly applied in concussion research, such as linear or logistic regression, are typically constrained by assumptions of linearity, independence, and normally distributed errors, which may not adequately capture the complex, nonlinear interactions inherent in multidimensional clinical datasets32,33; In contrast, ML models are well-suited to handle high-dimensional feature spaces, mixed data types, and nonlinear relationships, while also offering the advantage of feature importance quantification to support interpretable, evidence-based clinical decision-making. 31 Furthermore, the feasibility of implementing ML in SRC research has been demonstrated across a growing body of work, including studies predicting concussion resolution within 7, 14, and 28 days using clinical symptom profiles, 31 recovery time using vestibular and oculomotor screening data, 32 and recovery trajectories in collegiate athletes. 33 Building upon this foundation, the present study extends prior work by incorporating longitudinal data across two clinical visits and applying feature engineering—specifically, difference features capturing symptom change between visits—to further enhance the predictive framework and identify clinically actionable assessment targets for personalized return-to-play decision-making.
Model Performance Focusing on accuracy is important because it informs clinicians on properly diagnosing and treating patients. 58 As such, highlighting improved accuracy when including secondary visit features is of interest. Following this inclusion 66% of the models demonstrated an increase in accuracy when including Visit 2 features with XGBoost yielding both the highest change in accuracy (i.e., from Visit 1 to Visit 2 prediction) as well as second visit accuracy alone (84%).
A reason behind the upwards trend in accuracy between initial and second visits in 66% of the ML models is attributed to applying domain specific feature engineering. Deriving new features from accounting for the interactions between multiple original features to enhance the models’ predictive performance as they provide additional context to said field. 59 This is evident in the usage of difference features for the second visits dataset, which represents the difference between the second visit and initial visit values of the original features list and allow the model to make more correct predictions for both classes
(Figures 4–6). It is interesting to note that for Visit 2, models show an increase of correctly predictions for Class 0 compare to initial visits which may be due to the additional difference features and treatment presence in the dataset. Thus, this hints about these added features’ potential to be clinically significant. Also, the dataset consisted of 176 patients belonging in Class 1 and 41 patients in Class 0. From this, the model will have a high chance of correct predictions since it will overpredict for Class 1 (Figures 4–6). Table 5 supports this as the recall values are close to 1 while that of specificity is close to 0. While there is support for why specificity is close to 0, from a clinical standpoint, it is important to highlight the need for accurately identifying true negatives (e.g. Class 0) rather than only correctly classifying true positives (e.g. Class 1). By keeping TBI patients under medical treatment longer than needed, aspects such as cognitive and physical abilities may decline.
60
Average effects – Visit 1 vs Visit 2. Blue bars predict towards Class 0 and red bars predict towards Class 1. Confusion matrix – Visit 1 vs Visit 2. Confusion matrix – Visit 1 vs Visit 2. Description of dataset between outcome groups.


XGBoost has the highest accuracy change between its Visit 1 and Visit 2 accuracy scores as well as has the highest accuracy for Visit 2 alone. This is attributed to the model’s majority voting combination method of base models and boosting ensemble learning method. 61 The accuracy score for XGBoost’s second visits aligns with work completed in Thomas et al. of 82.5% 62 regarding the differentiation of concussed versus healthy cohorts. Also, as overfitting is minimized for this model, this means it can generalize well to unseen data. These model characteristics allow XGBoost to successfully output accurate classifications. 61
The lack of improvement in accuracy across both visits for SVC and Ridge can be attributed to their shared limitation as linear models. While the linear kernel in SVC was applied due to its simple implementation on limited dataset sizes, 63 and Ridge regression is inherently a linear method, 45 neither is capable of capturing the complex, non-linear relationships between the input assessment features and the target classes.64,65 The SVC accuracy scores observed in this study align with the approximately 81% accuracy reported by 66 for detecting prior concussions in retired athletes, and the Ridge classifier’s predictive accuracy-measured by Mean Squared Error (MSE) and Mean Absolute Error (MAE)-is comparable to LASSO, identified as the top-performing model for predicting sports-related TBI in NCAA athletes. 67 Despite these encouraging benchmarks, the additional difference-based features derived from the second visit provided little new information beyond what was already encoded in the original variables, yielding no meaningful gain in predictive accuracy across visits for either model.
Feature Importance and Average Effect The evident cross-model agreement (e.g., showing convergence of similar features in top 20 importance lists as well as their average effect on prediction for days to clearance) is present given the nature in which these rely on similar supervised learning methods. 68 Not only is the feature similarity due to the model’s learning method, but there is a possible true relationship present between those features and the prediction of a patient’s time to clearance. 69
Top 20 feature scores for machine learning models – Visit 1.
Bold text and colored cells denote features that show up across all ML models.
Top 20 feature scores for machine learning models – Visit 2.
Bold text and colored cells denote features that show up across all ML models. Olive and forest green colored cells denote base feature and difference variant that sum up to 9 or more for occurence total.
Feature frequency across models — Visit 1.
Green colored cells denote features that show up across all models.
Feature frequency across models — Visit 2.
Green colored cells denote features that show up across all models. Cells in orange and bold identify variable pairs (base feature and difference variant) whose combined occurrences show up at least 75% of the time.
The high frequency of Vestibular Ocular Motor Screening (VOMS)-related features across top-performing models warrants a nuanced discussion regarding potential redundancy. Measures such as Near Point of Convergence (NPC) Headache, Visual Motor Stability (VMST) Headache, and Vestibulo-Ocular Reflex (VOR) Vertical Headache appear as primary predictors in nearly all model architectures. While these features originate from the same clinical battery and share variance related to global symptom provocation, their independent selection suggests they capture distinct physiological stressors. Unlike an aggregate VOMS composite score, which serves as a general diagnostic marker, the individual sub-assessments isolate specific vestibular and oculomotor pathways; for instance, the identification of VOR Vertical Headache as more prognostic than horizontal variants highlights the importance of keeping granular data even when features appear collinear.
For the first visit, the patient records for Near Point of Convergence headache (i.e., npc_headache), Visual Motor Stability dizziness (i.e. vmst_dizziness), Visual Motor Stability headache (i.e. vmst_headache), and Vestibular Ocular Motor Screening headache (i.e. voms_headache) appear on every ML models’ top 20 features list. This predicted favorably for a time to clearance of over a month for 81% of models for npc_headache, vmst_headache, and voms_headache and for 66% of the models for vmst_dizziness. The model where the features did not have an average effect for over a month time to clearance prediction is for SVC. This discrepancy is attributed to the type of kernel used as a linear kernel does not properly capture complex relationships between the input and target output. 64 Vestibular Ocular Motor Screening (VOMS) is a well-established physical examination tool for concussion with evidence supporting both its diagnostic sensitivity and prognostic value.70,71 The composite score produced by the VOMS assessment has been identified as the most accurate data point in the diagnosis of concussion with vertical saccades and horizontal vestibular/ocular reflex testing demonstrating the greatest diagnostic impact. 70 In addition, 40% of the total unique patients in the data received vestibular therapy (vestibular_tx). This includes various exercises that improve overall life, vertigo, gaze, and posture. 72 However, the presence of dizziness during visual motor stability testing and headache symptoms overall - particularly during near point convergence (NPC) and visual motor stability (VMST) testing at the initial assessment - has not been previously identified as having independent clinical significance beyond the overall VOMS score. Similarly, headache produced during vertical vestibular ocular reflex testing and its difference from between first and second visit has not been previously identified as unique in its significance. It is important to mention that variants of the general symptom follow the same trend of increasing or decreasing across VOMS sub-assessments. This may appear redundant, as the individual components — Near Point of Convergence (NPC), Visual Motor Sensitivity Test (VMST), Vestibulo-Ocular Reflex (VOR) horizontal and vertical, Smooth Pursuits, and Saccades — all contribute to the same overarching symptom domains of headache, dizziness, and fogginess. Because the VOMS is administered as a sequential battery, a patient presenting with an elevated baseline headache prior to testing will likely carry that symptom load across every sub-component, making it difficult to disentangle component-specific provocation from general symptom burden without intra-assessment change-from-baseline scoring.
An attempt to represent inter-visit change is made using a _diff variant of each feature, where the difference value reflects the recorded score at the second visit minus the value at the initial visit. However, these difference features still capture changes in specific VOMS components between clinical visits rather than isolating symptom provocation within a single assessment. Within the VOMS battery specifically, this distinction is clinically meaningful: VOR Vertical Headache and its difference variant (VOR Vertical Headache Difference) both appeared across all six Visit 2 models, suggesting that headache provoked during vertical vestibulo-ocular reflex testing and its change over time carries independent prognostic signal beyond what is captured by the composite VOMS score or by horizontal-plane VOR testing alone. The directional asymmetry between vertical and horizontal VOR headache frequency across models further supports retaining granular sub-assessment data rather than collapsing to an aggregate score.
To better identify which VOMS components carry genuine clinical significance beyond global symptom burden, a conservative criterion is proposed: a VOMS feature should be considered clinically significant only when both its original variant and its difference variant appear in the Top 20 Features list across models. For example, VMST Headache and VMST Headache Difference both appearing would indicate that the specific physiological stressor isolated by visual motor stability testing distinct from the NPC or VOR components contributes independently to recovery prediction at both the initial and follow-up visit. This paired-feature criterion helps separate true component-level signal from the shared variance attributable to overall symptom severity, and provides a more principled basis for prioritizing which VOMS sub-assessments to emphasize in longitudinal clinical monitoring protocols.
Treatment presence (i.e. treatment_present) appeared in all six ML models’ top 20 features list. It also ranks in the top 3 features for 81% of the models and has the highest average effect for Class 0 (Figures 1–3). Provided this, observing patient TBI recovery longitudinally using this feature is imperative since prescribing some kind of treatment to a patient after the first visit also holds clinical importance in the prediction of time to medical clearance. Treatment presence consists of having one of the 11 specific types which can be generalized as a mix of both pharmacological and non-pharmacological treatments: Selective Serotonin Reuptake Inhibitors (SSRI), amantadine, stimulant, preventative headache, vestibular, physical therapy (pt), chiropractic, psychological, neuropsychological, neurology, and cognitive. Various approaches to TBI rehabilitation is imperative following a patient’s diagnosis for restoring one’s capabilities to eventually return to play. 73 From this interpretation, it can be unclear whether the presence of treatment is directly related to recovery time or if there is another confounding variable that results in this finding. Providing treatment after initial visit reduces a concussed athlete’s perception of pain and improves performance on both cognitive and physical tests measured by gold standard assessments. The improvement in these individual components would then link with the time to medical clearance, and not solely on the treatment. Presence of treatment appears with VOR Horizontal Dizziness Difference, VOR Vertical Headache Difference, Saccades Horizontal Headache Difference, and VMST Dizziness Difference for the four models that exhibited improvement in accuracy (Table 1). This could mean that the difference values for the listed assessed features play a role in recovery time along with treatment presence. 74
Additionally there is a large indication of importance for vestibulo-ocular reflex (VOR) - vertical headache (e.g. vor_vert_headache) as it appeared the most across all features when accounting for both initial and secondary visits. Specifically, in 81% of initial visit models and 100% of second visit models. Based on the output this specific feature may be considered the most important, especially as its difference variant (e.g., vor_vert_headache_diff) also appears in all six models for the second visits. Also, this is important to identify as this pair of features commonly predicts towards Class 1 (i.e., time to clearance greater than a month). The addition of difference features provided all models additional information to learn from and establish a true relationship between the two inputs vor_vert_headache and vor_vert_headache_diff with the predicted time to clearance. 59 Given this relationship, it is imperative to longitudinally assess this feature as it serves to be of clinical importance in the prediction of time to medical clearance. Disruptions to the visual and vestibular signals occur in a patient experiencing TBI from sports-related activities. 75 It has been found that headaches correlate with visual-vestibular mismatch, which prolongs recovery if left untreated. 76 Thus, medical personnel must create treatment plans that directly address headaches stemming from injuring the vestibulo-ocular reflex to ensure for effective TBI recovery.
Assessment features vestibulo-ocular reflex (VOR) - horizontal dizziness (i.e. vor_horiz_dizziness) and vestibulo-ocular reflex (VOR) - horizontal dizziness difference (i.e. vor_horiz_dizziness_diff) sum to 9 total occurrences based off of their respective values in Table 4. For second visits, 81% of models and 100% of models display that the base feature and its difference variant (i.e. vestibulo-ocular reflex (VOR) - horizontal dizziness and vestibulo-ocular reflex (VOR) - horizontal dizziness difference) drive for Class 1 predictions respectively. Stemming from these findings, it can be deduced that the ML models have extensive information about vestibulo-ocular reflex (VOR) - horizontal dizziness after accounting for both that and its difference variants which allows for a strong relationship between these inputs with the target prediction outcome time to clearance. 69 Moreover, these need to be longitudinally assessed in TBI patients as they are important to the prediction of time to medical clearance. Dizziness is a core symptom of vestibular migraine 76 which can result from activities such as concussions. Patients with TBI report more instances of headaches compared to dizziness, 77 which explains the difference between its frequency of occurring across the models as a clinically important assessment feature with that of vor_vert_headache along with its difference variant.
Features that are not as frequently occurring across ML models as the ones mentioned earlier still hold some degree of significant clinical importance (Tables 1 and 2). Since they appeared in the top 20 features list, the ML models highlight that feature to be essential for predicting time to medical clearance following concussion.
Thus, personalized treatment plans can involve any combination of the features in the top 20 features scores list (Table 2) as combining multi modal approaches translates to improved sports-related TBI recovery. 78 It is highly encouraged to first account for VOR Vertical Headache, VOR Vertical Headache Difference, Treatment Present, VOR Horizontal Dizziness, VOR Horizontal Dizziness Difference, VMST Dizziness, and VMST Dizziness Difference before including additional assessment features as these were found to be of highest clinical significance due to their frequency across ML models. The choice of combination will depend on the patient’s attributes (past records, condition, etc.).
While XGBoost achieved the highest overall accuracy (0.84) with Visit 2 features and statistically significant improvement across visits in the initial method implementation, we do not designate a single final deployable model at this stage. The six models serve as a comparative framework for identifying robust clinical predictors rather than as ready-for-deployment tools. Model specification details and code are available at https://github.com/ MeganTran6023/Sport-Related-Concussions_Machine-Learning to support future replication and external validation (TRIPOD+AI Item 22).
Comparison with Previous Studies Previous research studied concussion diagnosis at a single time point with gold standard assessments using ML models such as XGBoost and Random Forest.62,79,80 A different study on predicting sports injuries in football uses multiple time points and extracts important features that drive models’ predictions; however, it is not specific to concussions from sports related activities. 81 This current study utilizes not only multiple time points for analysis, but also determines essential assessment features that drive models to demonstrate an improvement in accuracy between initial and second visits.
To extend the detection of clinically important features determined by the models, this study incorporates feature engineering to generate new features, mainly the difference variants of the base original features. These new features provide models more information about the existing features and their interactions to produce informed results. 82
Through these novel implementations, this study paves the way to optimize assessments for clinicians to provide high quality personalized rehabilitation protocols for athletes with sports-related TBI injuries.
Clinical Relevance The intended users of a future validated version of this model are licensed clinicians experienced in concussion management (e.g., sports medicine physicians, neurologists). Input features are derived from standardized clinical assessments (VOMS, BESS, ImPACT) already routinely administered in concussion care. Clinicians would need to input structured assessment data at one or two time points; no specialized computing expertise is required beyond use of a provided interface. However, poor-quality or missing input data — particularly VOMS subscores — should prompt clinical judgment rather than model reliance, as the current model was not trained on imputed data and may underperform when key features are absent (TRIPOD+AI Item 27b).
The trend of an increase in concussion can be primarily attributed to increased awareness in the diagnosis of concussions by a trained medical professional83,84 This may have led to more consistent reports of concussion cases that may have gone unnoticed in previous studies/findings. 85 Clinicians who treat concussion patients may benefit from predictive models in several important ways. 86 The results explore the feasibility of identifying individuals at risk for prolonged recovery during initial assessment, which could eventually serve as a preliminary signal for investigating targeted interventions. For example, identifying aspects of the VOMS assessment as discussed above early may inform clinicians and result in earlier and/or more targeted physical therapy to address these higher risk features. Early initiation of physical therapy has already been identified as beneficial in recovery 87 and ML models may serve as a tool to identify patients most likely to benefit. Furthermore, enhancing recovery efficiency not only improves quality of life, but also facilitates a quicker return to normal function or athletic participation. 88 In sports settings, recognizing athletes at risk for extended recovery can inform individualized treatment strategies and help establish realistic expectations for both the athlete and the team.
Limitations Although the results indicate a positive direction for concussion screening and treatment, there are limitations to this study. First, the relatively small dataset size represents an important limitation of this work as the limited number of samples could affect the model’s ability to generalize to TBI patients with diverse demographic backgrounds and clinical profiles. As a result, the findings may not fully reflect the variability observed in broader patient populations. 89 In addition, having a greater number of data points belonging to Class 0 would be beneficial, as this would allow for a more robust evaluation of the ML model’s performance, particularly in assessing its ability to classify and predict time to medical clearance accurately. This is a common issue when it comes to handling ML problems with clinical data where classifiers learn best on the class with most data points because they attempt to optimize a single aggregate metric while overlooking the distribution of the data across the target classes. 57
From the reported results, the small and uneven balance between outcome classes 0 and 1 (176 patients and 41 patients respectively) resulted in low specificity scores. Additionally, the inflated both accuracy and recall scores could potentially be attributed to overpredicting for the dominant class. Balancing techniques such as class weighting, resampling, or a combination of other methods should be implemented to ensure the reported results are not skewed by this limitation. Another aspect to note is that overprediction for a class could result in false positives and/or negatives. Clinically speaking, incorrectly predicting a patient’s time to recover earlier than the actual time to clearance leads to undesired effects such as worsening existing concussive symptoms 90 and reduced responsiveness to brain-derived neurotrophic factor (BDNF). 91 Conversely, mistakenly predicting a patient to recover longer than normal would hinder the concussed patient’s physical and psychological state, reducing their quality of life. 92
Formal statistical filtering, initially piloted using independent t-tests and Mann-Whitney tests, was excluded from the final preprocessing workflow as it failed to enhance longitudinal predictive performance. Instead, feature selection prioritized iterative cleaning based on missingness thresholds and clinical relevance to maintain a comprehensive feature space for the machine learning models. While the omission of formal statistical testing during preprocessing is acknowledged as a limitation to analytical rigor, future research should investigate alternative statistical frameworks and structured dimensionality reduction to optimize feature utility and address the event-per-variable ratio.
A notable limitation of this analysis is the absence of formal statistical testing to compare performance differences between visits and the lack of reported confidence intervals for key metrics. While 66% of the models showed numerical improvements in accuracy with the addition of Visit 2 data, including a 5% increase for the XGBoost model, the statistical significance of these changes was not formally evaluated. Additionally, without confidence intervals, the precision of performance values-such as the peak accuracy of 0.84 -remains unquantified. 93 This lack of statistical rigor is underscored by the small cohort of 217 patients and the potential for distributional bias inherent in the Leave-One-Out Cross-Validation (LOOCV) method.
A supplementary factor to consider is the gender imbalance in the dataset used. Of the 176 total number of patients in Class 1, 113 (64.2%) are female and 63 (35.8%) are male. Of the 41 total number of patients in Class 0, 24 are female and 17 are male. From this, these findings may be biased to a female population. More specifically, that means that there is a possibility that for some or all models, the correctly classified Class 0 and 1 predictions are mainly females. This is important to note as previous works show that females take longer to recover from sports-related TBI than males do across different sports related activities (i.e., basketball, rugby, soccer, and squash) 94 (TRIPOD+AI Item 3c). As our dataset does not specify the sports for each sports-related concussions, the reason why more females are in Class 1 compared to Class 0 may be due to the nature of the sports each participated in. Another study that also used a small dataset size found that female collegiate soccer athletes who sustained a sports-related TBI experienced a longer recovery time than that of male collegiate soccer athletes, 95 but this was limited by its small dataset size as well as having an gender imbalance.
Another limitation relates to the features utilized during the ML stage. Although a feature selection method was applied to identify the most essential predictors, some of the retained features were based on self-reported data. Self-reported measures are inherently subject to bias, which can introduce uncertainty into the model.96,97 This bias may reduce the reliability of the model’s predicted class outcome and perceived clinical importance of certain features, potentially affecting their true relationship with days to medical clearance.
The lack of improvement in accuracy across visits can largely be attributed to the reliance on linear models. In the case of the SVC, the use of a linear kernel-while appropriate given the limited dataset size and its straightforward implementation-likely restricted the model’s ability to capture complex, non-linear relationships between assessment features and class labels. 64 As a result, the model may have failed to leverage additional information introduced at the second visit, leading to unchanged performance. Similarly, the Ridge regression model’s inherently linear nature limits its capacity to model non-linear interactions in the data. 45 Consequently, the difference-based features derived from the second visit may not have contributed meaningful new information beyond what was already represented in the first-visit features, resulting in comparable predictive accuracy across visits.
The current prediction method of binary classification oversimplifies the prediction of patients. The assignment of classes ‘under a month’ and ‘equal to or greater than 1 month’ sacrifices granular clinical utility. A reformulation of this method would be to utilize granular, multiclass buckets. To ensure nuanced predictions, the buckets could be divided into acute, typical, and prolonged recovery for both improved methodological accuracy and clinical significance. 98
Also, the current dataset lacks neurocognitive data outside of symptom reporting. This type of data is useful regarding concussion assessments. For instance, tests for this specific aspect accounts for specific trends in a patient’s symptoms during recovery as recording whether a patient is taking some neurocognitive treatment is not sufficient enough to conclude that the outcome predictions are of high accuracy. 17
Treatment presence does not account for the varying time for each treatment’s therapeutic effect (e.g. Some Selective Serotonin Reuptake Inhibitor (SSRI) treatment takes 8 weeks to have a full therapeutic effect 99 If recording patient’s data before the time frame for certain treatments a patient received, this would then skew the interpretation of what features are most significant to their predicted concussion recovery. To resolve this, it would be beneficial to collect data on features associated with the patient that account for each treatment’s time-frame to have a more accurate depiction of the results.
Although the filtering process did eliminate the majority of features not significant for concussion days to clearance classification, this still resulted in redundant features kept (ie. headache, dizziness, etc.). An attempt to differentiate the specific components of the exams was done when calculating the difference between the second and initial visits. Furthermore, the acquired sample is reduced given the number of features the dataset has. This is prone to overfitting and instability in feature importance listings. 100 A possible future implementation is recursive feature elimination (RFE) for feature selection. This method should be used in relation to the inclusion of more patients and datatypes collected over time, resulting in a dataset that will include a more robust, objective set of measures. In this setting RFE would be able to systematically removes the least important features/types from a dataset to improve model performance, reduce overfitting, and enhance interpretability.
RFE-MF (MissForest) was found to outperform four of the classic data imputation methods (mean/mode imputation, kNN, MICE, and MF) in addressing the critical need for data accuracy in medical research, where it helps mitigate challenges that can impair clinical decision-making and ultimately affect the quality of patient care. 101
While LOOCV is recommended for small datasets such as the one utilized in the study, using this method leads to distributional bias. In turn, this leaks information pertaining to the removed item as the test set to the model as well as reduces performance of commonly-used ML models. 102 Therefore, stratified repeated holdout and a modified version of k-fold cross-validation is recommended to avoid this bias 103 While Leave-One-Out Cross-Validation (LOOCV) is a standard validation strategy for small datasets, its results remain highly contingent upon the specific characteristics of the cohort. Although LOOCV maximizes data utility to minimize estimation bias, these unbiased outputs do not inherently ensure clinical generalizability. This limitation is often a consequence of the model overfitting to the unique noise within the sample. Therefore, while the methodology may yield stable internal performance metrics, it does not guarantee that the model will demonstrate comparable accuracy across external, heterogeneous datasets.
Despite its limitations, both the point estimate (accuracy) and variance are constant when using LOOCV. The lack of fluctuation in both of these measures allows for reproducible and deterministic findings for analysis.
The small sample size used is prone to overfitting. This was primarily addressed through strategic model selection and the implementation of regularization techniques. Specifically, the Ridge classifier employs L2 regularization to shrink the coefficients of features with weaker associations to the outcome groups, which reduces variance and stabilizes the model’s estimates. 104
Future Work The focus of this study is to ensure that the ML models accurately predict concussed patients’ time to clearance from sport-related TBI with the integration of longitudinal data and gold-standard assessments per our objectives. While we are optimistic about the acquired results, there are ways to expand upon this current study.
Future work should prioritize increasing the overall dataset size. A larger dataset would allow the ML models to learn more stable and representative patterns, 89 which in turn would improve their ability to identify the most important features for accurately predicting time to medical clearance. With more data, the influence of noise and bias would be reduced, leading to clearer insights into which variables truly contribute to prediction performance and improving confidence in the model’s results. Moreover, having a balanced dataset between both genders and target Class classifications (Class 0 and Class 1) would allow the results to best describe a generalized set of athletes which will be applicable to patients who experienced other methods of TBI.
Expanding the analysis beyond athletic-based injuries is another important direction. While the current methodology has shown success in TBI cases resulting from athletic concussions, applying this approach to a broader range of TBI patients would improve its clinical relevance. 105 Extending the framework to non-athletic injuries would support the development of more general clinical guidelines and help determine whether the same assessment features and prediction strategies remain effective across different injury mechanisms. This broader application would also motivate further research into TBI individualized treatment and recovery within the wider medical field.
Including patients with more than two clinical visits could further strengthen analysis. Additional visits provide more longitudinal clinical data, allowing models to better capture changes in patients’ symptoms over time. 106 With richer temporal information, ML models would be better positioned to accurately predict a TBI patient’s time to clearance and reflect the progression of recovery more reliably.
Additional features that incorporates objective methods of evaluating concussed patients should also be included. These will not only allow clinicians to more accurately diagnose an injured athlete, they also can diagnose and provide personalized treatment in a timely manner. This is imperative if the injured athlete sustains the concussion before adolescence since this is the period of major brain development. 107
Currently, the LOOCV method done on the current dataset lacks external validation, temporal validation, optimism correction and calibration assessment, which is necessary for determining gernalizability. 108 This is a given especially since the small dataset and the distribution for the two outcome classes are prone to overfitting. Future works will run the listed elements to evaluate its clinical applicability in the concussion domain.
Running the feature selection method prior to the modeling instead of within the modeling’s cross-validation folds increases the risk for data leakage. 109 Future works will repeat the analysis where the variable selections are performed nested within the validation process to account for internal validity.
Future work could also focus on incorporating algorithmic modifications to the existing SVC and Ridge models utilized in this study 110 or using more nonlinear ML models. This will enable a more informed listing of clinically important assessment features given the model’s demonstration of an increase in accuracy provided longitudinal patient sports-related concussion data. From this, proper personalized protocols can be put together for athletes.
Finally, applications in digital health should be explored. Integrating ML models into technologies such as wearable devices would help translate the study’s findings into real-world clinical use. These tools could allow patients to monitor their condition outside of clinical settings, while simultaneously providing clinicians with real-time data to support decision-making. 111 Such integration would enhance continuous assessment, improve personalized treatment planning, and extend the practical impact of the proposed methodology.
Conclusion
This study is primarily comparative in scope: rather than proposing a single deployable clinical tool, it benchmarks six ML classifiers across longitudinal data to identify which model architectures and clinical features best support future development of a validated prediction tool. External validation and prospective testing are required before any model described here could be considered for clinical deployment.
Utilizing longitudinal clinical assessments to predict time to medical clearance represents a preliminary approach that offers exploratory insights into diagnosis and return-to-play decision-making. By applying ML models to longitudinal data, patterns not readily detectable by licensed medical practitioners were identified. These models achieved high predictive accuracy and highlighted specific assessment features that significantly influence clearance timelines. Collectively, these exploratory findings highlight assessments that may warrant investigation in future validation studies and suggest the potential for a longitudinal monitoring framework to inform return-to-play decisions and guidelines. The findings can be applied to future return-to-play decisions by enabling clinicians to better anticipate recovery trajectories and focus on the most informative assessments to best formulate personalized sports- related TBI rehabilitation protocols.
Footnotes
Acknowledgements
We acknowledge those at the University of South Florida (USF) Concussion Center for their assistance with data collection and all subjects who were involved in this study.
Ethical considerations
This study was approved by the University of South Florida (USF) Institutional Review Board (IRB STUDY003514). This approval explicitly covers the retrospective review and analysis of patient data stored within the REDCap database.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study received funding from the Florida Department of State Center for Neuromusculoskeletal Research.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
Guarantor
Megan Tran and John M. Templeton.
Contributorship
The authors confirm contribution to the paper as follows: conception, design, manuscript preparation: M. Tran; review and editing: J. Holler, B. Moran, N. Schilaty, and J. M. Templeton. All authors reviewed and approved the final version of the manuscript.
Appendix
Gold standard tests summary table. Feature naming mapping.
Assessment
Ref.
Admin. time
Scoring
Assessors
What it Tests
Instructions
Vestibular Ocular Motor Screening (VOMS)
5
5–10 minutes
Max score 10
Clinicians
Assesses vestibular and ocu-lomotor symptoms, includ-ing dizziness, gaze stabi-lization, eye-tracking ability, and visual motion sensitiv-ity.
Smooth Pursuits (Horizontal & Vertical): Patient seated 3 ft. from examiner, follows fingertip horizontally (±1.5 ft) and vertically (±1.5 ft), 2 repetitions each. Rate symptoms.
Saccades (Horizontal & Vertical): Patient moves eyes quickly between two fingertips horizontally or vertically (10 reps each). Rate symptoms.
Convergence: Patient focuses on target at arm’s length, brings it to nose. Measure distance at diplopia or eye deviation, 3 trials. Rate symptoms.
Vestibular-Ocular Reflex (VOR) Horizontal & Vertical: Patient rotates head ±20° while focusing on target, 10 reps each at 180 bpm. Rate symptoms.
Visual Motion Sensitivity (VMS): Patient stands, rotates head, eyes, and trunk ±80° focusing on thumb, 5 reps at 50 bpm. Rate symptoms.
Balance Error Scoring Sys-tem (BESS)
4
2 minutes
Max score 60
Clinicians and Non-Clinicians
Assesses static postural stability under controlled stance conditions.
Patient performs three 20-second balance stances on various surfaces.
King Devick Test (KD)
13
2 minutes
Tally of errors
Clinicians and Non-Clinicians
Measures rapid eye movements, attention, and language function by timing number-reading performance as an index of saccadic speed and visual tracking.
Read single-digit numbers aloud from three cards, left to right, as quickly and accurately as possible.
Includes one demonstration card.
Gait Initiation (GI)
14
10–15 minutes
Max score 14
Clinicians and Non-Clinicians
Evaluates gait initiation from a stationary position to assess balance control, coordination, and lower-limb motor planning.
Patient walks across room with therapist support (if needed) at usual pace, then rapid pace.
Sensory Organization Test (SOT)
15
2 minutes
Max score 100
Clinicians
Measures vertical reaction forces generated as the body’s center of gravity moves over a fixed base of support.
Patient completes three 20-second trials under varying visual (eyes open, eyes closed, sway-referenced) and surface (fixed, sway-referenced) conditions, standing shoulder-width apart.
Stay as motionless as possible.
Tandem Gait (TG)
16
Time taken for patient to walk to end of line and back
Recorded time to complete test
Clinicians
Tests dynamic balance and coordination by having the individual walk heel-to-toe in a straight line.
Patient stands with feet together at a line, walks heel-to-toe to end line and back, 4 trials.
Record fastest time.
Gait Termina-tion (GT)
14
10–15 minutes
Max score 14
Clinicians and Non-Clinicians
Assesses anticipatory postu-ral control and balance dur-ing gait termination.
Patient walks across room with therapist support (if needed) at usual pace, then rapid pace.
Mobile Universal Lexicon Evaluation System (MULES)
17
Duration taken to recite all images
Recorded duration to name all pictures
Clinicians and Non-Clinicians
Measures rapid picture nam-ing to assess visual process-ing speed, attention, and lan-guage function.
Name pictures aloud from left to right, top to bottom, as quickly as possible without errors.
Record total duration.
Head Injury Assessment Version 1 (HIA01)
18
12–17 minutes
Composite scoring includes Immediate Memory (max 30), Maddock’s questions (5 items), digits backward performance, balance-error counts, symptoms (9-item checklist), clinical signs (3 items), and Delayed Memory (max 10)
Clinicians and Non-Clinicians
Structured sideline concussion assessment used in rugby, evaluating symptoms, balance, cognition, and coordination immediately after injury.
Assessment via observation, video, or instru-mented mouthguard.
Includes Criteria 1 indications, head acceleration data, off-field assessment, pitch-side video review, and clinical evaluation.
Pitch-Side Concussion Assessment Version 1 (PSCA1)
19
5 minutes
Symptom Checklist: Pres-ence/absence (0/1).
Maddock’s Questions: Cor-rect/incorrect (0/1).
Balance: Number of errors recorded.Clinicians and Non-Clinicians
Early version of the Head Injury Assessment (HIA1) used for sideline concussion screening.
Complete symptom checklist, Maddocks Ques-tions, and tandem stance in medical room or agreed location.
Temporary 5-min replacement allowed if Criteria 2 indications met.
Pitch-Side Concussion Assessment Version 2 (PSCA2)
19
5 minutes
Max score 132
Clinicians and Non-Clinicians
Updated pitch-side concus-sion assessment including refined symptom and cog-nitive measures to improve diagnostic sensitivity.
Same as PSCA1, updated with 5 Criteria 1 indica-tors including suspected loss of consciousness and obvious ataxia.
5-min temporary replacement retained.
Motor Cognitive Test Battery (MotCoTe)
20
30 minutes
Recorded reaction time
Clinicians and Non-Clinicians
Measures multilimb reaction times and tapping speed, integrating motor and cog-nitive demands that progress from simple to complex tasks for concussion assess-ment.
Reaction Time Tests: Press arrow-indicated switch as quickly as possible under six condi-tions (Simple, Choice, Inhibition, Conflict, Sin-gle/Double Limb).
Tapping Speed Tests: Tap switches as fast as possible for 10 sec under Single/Double Limb conditions.
Sport Concussion Assessment Tool Version 2 (SCAT2)
16
10–15 minutes
Max score 100
Clinicians
Evaluates seven domains including symptoms, physical signs, Glasgow Coma Scale, Maddocks questions, cognition, balance, and coordination.
Symptom Evaluation: The participant reports the presence and severity of 22 common concussion symptoms.
Physical Signs: The examiner observes and records any overt signs of concussion.
Glasgow Coma Scale (GCS): Standard assess-ment of eye, verbal, and motor responses.
Orientation (Maddocks Questions): Participants answer five standardized questions to assess orientation and memory.
Cognition: Immediate memory is tested by asking participants to recall a list of five words over three trials. Concentration is evaluated using number sequences and backward recitation tasks. Delayed recall is assessed after a short interval.
Balance and Coordination: Balance is tested via the modified Balance Error Scoring System (mBESS), and coordination is assessed with simple physical tasks such as finger-to-nose.
Sport Concussion Assessment Tool Version 3 (SCAT3)
21
15–25 minutes
Max score 132
Clinicians
Updated SCAT version incorporating expanded cognitive and balance assessments for tracking concussion recovery.
Symptom Evaluation: Participant reports presence and severity of 22 common concussion symptoms.
Physical Signs and GCS: Examiner records observable signs of concussion; Glasgow Coma Scale assessed immediately after suspected injury.
Orientation (Maddocks Questions): Five standard-ized questions to assess orientation and immediate memory at the time of injury.
Cognition: Immediate memory assessed with three trials of a five-word list; concentration and delayed recall evaluated using standard tasks.
Balance and Coordination: Modified BESS including foam surfaces; coordination assessed with simple physical tasks.
Sport Concussion Assessment Tool Version 5 (SCAT5)
22
10 minutes
Symptom Number: score out of 22. Symptom Severity: score out of 132.
Orientation: score out of 5. Immediate Memory: score out of 15 (trial 1) + 30 (trials 2–3), total 45. Concentra-tion: score out of 5.Clinicians
Standardized concussion assessment incorporating symptom scoring, cognitive screening, balance testing, and coordination measures.
Symptom Evaluation: Participant reports presence and severity of 22 symptoms.
Cognitive Assessment: Orientation (Maddocks questions), immediate memory (three trials), concentration, and delayed recall are measured.
Balance and Coordination: Balance tested via modified BESS (including foam stances) and simple coordination tasks.
Immediate Post-Concussion and Cognitive Testing (ImPACT)
23
20–25 minutes
Max score 100
Clinicians
Computerized neurocogni-tive test assessing memory, attention, reaction time, and processing speed following concussion.
Six computerized modules assessing memory, attention, reaction time, and processing speed. Follow instructions for each module.
Standardized Assessment of Concussion (SAC)
24
5 minutes
Max score 30
Clinicians and Non-clinicians
Brief sideline cognitive assessment measuring orientation, immediate memory, concentration, and delayed recall.
Orientation (Maddocks Questions): Participant answers a set of standardized questions to assess awareness of time, place, and event.
Immediate Memory: Examiner reads a list of five words; participant recalls as many as possible immediately, repeated over three trials.
Concentration: Participant completes number sequence tasks backward and recites months of the year in reverse order.
Delayed Recall: After a short interval, participant is asked to recall the same five words from the immediate memory task.
Neurologic Function: Examiner notes any clinical signs relevant to concussion.
Post-Concussion Symptom Scale (PCSS)
25
Not specified, but noted as ‘Relatively short time to administer’
Max score 132
Clinicians and Non-clinicians
Self-report checklist quanti-fying the severity of com-mon concussion symptoms such as headache, dizziness, fatigue, and irritability.
Self-report severity of each symptom using 7-point Likert scale.
Modified Balance Error Scoring System (mBESS)
26
1 minute
Max score 30
Clinicians and Non-clinicians
Abbreviated version of the BESS incorporating simpli-fied balance tasks for rapid field-based assessment.
Perform three 20-second balance stances; count errors after starting.
Feature (Underscore format)
Formatted feature name
bess_double_ec
Balance Error Scoring System - Double-Leg Stance with Eyes Closed
bess_single_ec
Balance Error Scoring System - Single-Leg Stance with Eyes Closed
bess_tandem_ec
Balance Error Scoring System - Tandem Stance with Eyes Closed
cerv_ext
Cervical Extension
cerv_flex
Cervical Flexion
hx_mood_disorder
History of Mood Disorder
import_gad7_score
General Anxiety Disorder-7 Score
import_phq9_score
Patient Health Questionnaire-9 Score
l_cerv_rot
Left Cervical Rotation
l_lat_flex
Left Lateral Flexion
npc_dizziness
Near Point of Convergence Test - Dizziness
npc_fogginess
Near Point of Convergence Test - Fogginess
npc_headache
Near Point of Convergence Test - Headache
npc_measure
Near Point of Convergence Test - Measurement
npc_nausea
Near Point of Convergence Test - Nausea
prev_head_injury
Presence of a Previous Head Injury
r_cerv_rot
Right Cervical Rotation
r_lat_flex
Right Lateral Flexion
saccades_horiz_dizziness
Saccades Horizontal Dizziness
saccades_horiz_fogginess
Saccades Horizontal Fogginess
saccades_horiz_headache
Saccades Horizontal Headache
saccades_horiz_nausea
Saccades Horizontal Nausea
saccades_vert_dizziness
Saccades Vertical Dizziness
saccades_vert_fogginess
Saccades Vertical Fogginess
saccades_vert_headache
Saccades Vertical Headache
saccades_vert_nausea
Saccades Vertical Nausea
smoothpursuits_dizziness
Smooth Pursuits Dizziness
smoothpursuits_fogginess
Smooth Pursuits Fogginess
smoothpursuits_headache
Smooth Pursuits Headache
smoothpursuits_nausea
Smooth Pursuits Nausea
subocc_ext
Suboccipital Extension
subocc_flex
Suboccipital Flexion
vmst_dizziness
Visual Motion Sensitivity Test Dizziness
vmst_fogginess
Visual Motion Sensitivity Test - Fogginess
vmst_headache
Visual Motion Sensitivity Test - Headache
vmst_nausea
Visual Motion Sensitivity Test - Nausea
voms_dizziness
Vestibular Ocular Motor Screening - Dizziness
voms_fogginess
Vestibular Ocular Motor Screening - Fogginess
voms_headache
Vestibular Ocular Motor Screening - Headache
voms_nausea
Vestibular Ocular Motor Screening - Nausea
vor_horiz_dizziness
Vestibulo-Ocular Reflex - Horizontal Dizziness
vor_horiz_fogginess
Vestibulo-Ocular Reflex - Horizontal Fogginess
vor_horiz_headache
Vestibulo-Ocular Reflex - Horizontal Headache
vor_horiz_nausea
Vestibulo-Ocular Reflex - Horizontal Nausea
vor_vert_dizziness
Vestibulo-Ocular Reflex - Vertical Dizziness
vor_vert_fogginess
Vestibulo-Ocular Reflex - Vertical Fogginess
vor_vert_headache
Vestibulo-Ocular Reflex - Vertical Headache
vor_vert_nausea
Vestibulo-Ocular Reflex - Vertical Nausea
Treatment_present
Presence of Administered Treatment (e.g., either pharmacological treatment such as selective serotonin reuptake inhibitors, amantadine, or stimulants, etc. and/or non-pharmacological treatments such as physical therapy, chiropractic, psychological, neuropsychological, neurology, and cognitive therapies.)
Confusion matrix – Visit 1 vs Visit 2.
TRIPOD - AI checklist.
Item
Dev/Eval
Checklist item
Reported on page
Notes from manuscript
1
D;E
Identify the study as developing or evaluating the performance of a multivariable prediction model, the target population, and the outcome to be predicted
p. 1 (Title)
Title states: ’Predicting Time to Clearance of Sport-Related Concussions Using Machine Learning’. Identifies target population (athletes with SRC) and outcome (time to medical clearance).
2
D;E
See TRIPOD+AI for Abstracts checklist
p. 1 (Abstract)
Abstract reports objective, methods (217 athletes, 6 ML classifiers, LOOCV), results (XGBoost 0.84 accuracy), and conclusions including external validation caveat.
3a
D;E
Explain the healthcare context and rationale for developing or evaluat-ing the prediction model, including references to existing models
pp. 1–3
Introduction describes rising SRC rates, clinical assessment limitations, and prior ML studies (Bergeron et al., Chu et al., Thomas & Arnett) that motivate the present work.
3b
D;E
Describe the target population and intended purpose of the prediction model in the context of the care pathway, including its intended users
p. 2
Explicitly states (TRIPOD+AI Item 3b): target population = athletes with SRC presenting to sports medicine or concussion clinic; intended users = licensed clinicians (sports medicine physicians, neurologists, athletic trainers) to supplement clinical judgment.
3c
D;E
Describe any known health inequalities between sociodemographic groups
p. 18
Discusses gender imbalance in dataset (64.2% female in Class 1); cites literature that females take longer to recover from SRC than males. Noted as limitation (TRIPOD+AI Item 3c).
4
D;E
Specify the study objectives, including whether the study describes the development or validation of a prediction model (or both)
pp. 1–2
Objectives state:
124
evaluate whether longitudinal data improves ML accuracy;
89
identify features most strongly associated with prolonged vs. normal recovery. Study is model development with internal validation only; explicitly states external validation is required.
5a
D;E
Describe the sources of data sepa-rately for the development and eval-uation datasets, the rationale for using these data, and representa-tiveness of the data
p. 3
Data from USF Concussion Center via REDCap (2021–2025). Single-site retrospective cohort. No separate evaluation dataset; internal validation via LOOCV. Rationale for data source described.
5b
D;E
Specify the dates of the collected participant data, including start and end of participant accrual; and, if applicable, end of follow-up
p. 3
Multi-visit data collected 2021–2025; original database spans 2017–2026. Clearance date used as end of follow-up per patient.
6a
D;E
Specify key elements of the study setting including the number and location of centres
p. 3
Single centre: USF Concussion Center, University of South Florida. Secondary care/concussion specialty clinic setting.
6b
D;E
Describe the eligibility criteria for study participants
pp. 3–4
Inclusion: sports-related concussion diagnosis, ≥ 2 clinical visits, first visit within 0–365 days of injury, clearance ≥ 1 day. Exclusion: non-sports mechanisms (MVA, falls, other), missing data exceeding thresholds.
6c
D;E
Give details of any treatments received, and how they were han-dled during model development or evaluation, if relevant
pp. 3–4
Treatment types detailed (11 categories, pharmacological and non-pharmacological). ’Treatment present’ binary variable included only in Visit 2 feature set, as treatment was not administered until after Visit 1 data collection.
7
D;E
Describe any data pre-processing and quality checking, including whether this was similar across relevant sociodemographic groups
pp. 3–4
Multi-step iterative cleaning: column missingness threshold 0.9, row missingness 0.8, step 0.1. Outlier filtering for clinically relevant timelines. No imputation used. Figure 1 shows preprocessing flowchart. Sociodemographic-stratified preprocessing
8a
D;E
Clearly define the outcome that is being predicted and the time horizon, including how and when assessed, the rationale for choosing this outcome, and whether the method of outcome assessment is consistent across sociodemographic groups
pp. 5–6
Outcome: binary classification of time to medical clearance (< 30 days = ’normal’; ≥ 30 days = ’prolonged’). Threshold rationale: general clinical recovery timeframe for SRC. Clearance determined by experienced concussion physicians assessing symptoms and functional measures (VOMS, BESS, CNS vital signs, return to school). (TRIPOD+AI Item 8a)
8b
D;E
If outcome assessment requires subjective interpretation, describe the qualifications and demographic characteristics of the outcome assessors
p. 3
Clearance determined by physicians experienced in diagnosis and management of concussions. Demographic characteristics of assessors
8c
D;E
Report any actions to blind assess-ment of the outcome to be predicted
N/A
Not reported. Retrospective study design; blinding of outcome assessors not described.
9a
D
Describe the choice of initial pre-dictors and any pre-selection of pre-dictors before model building
pp. 3–4
Predictors retained from preprocessing based on missingness thresholds and clinical relevance. Feature inclusion of ’prior head injury’, ’history of mood disorders’ (Visit 1) and ’treatment presence’ (Visit 2) explicitly justified. No formal statistical pre-selection.
9b
D;E
Clearly define all predictors, including how and when they were measured
pp. 3–4, Appendix Tables 10 and 11
All predictors defined with feature naming mapping (Table 11). Assessment tools described in Appendix Table 10 (VOMS, BESS, ImPACT, etc.) with administration time, scoring, and assessors. Visit 1 vs Visit 2 collection timing specified.
9c
D;E
If predictor measurement requires subjective interpretation, describe the qualifications and demographic characteristics of the predictor assessors
Appendix Table 10
Appendix Table 10 lists assessors (Clinicians vs. Clinicians and Non-Clinicians) per assessment. Demographic character-istics of assessors
10
D;E
Explain how the study size was arrived at and justify that the study size was sufficient to answer the research question
pp. 3–4
Final N=217 after preprocessing (from 3,038). LOOCV selected due to small dataset size. Event-per-variable (EPV) ratio reported: 0.84 (Visit 1), 0.43 (Visit 2). No formal a priori sample size calculation; small sample acknowledged as primary limitation.
11
D;E
Describe how missing data were handled. Provide reasons for omit-ting any data
pp. 3–4
No imputation used; rationale given (imputation in medical data leads to bias; no optimal solutions exist). Iterative row/column removal based on missingness thresholds. Final dataset contains no null values.
12a
D
Describe how the data were used in the analysis, including whether the data were partitioned, considering any sample size requirements
pp. 5–6
No train/test partition for final LOOCV evaluation. Hyper-parameter tuning used 80:20 split prior to LOOCV. LOOCV justified by small dataset size.
12b
D
Describe how predictors were han-dled in the analyses (functional form, rescaling, transformation, or standardisation)
pp. 4–6
Binary features coded 0/1. Difference features engineered (Visit 2 - Visit 1). Continuous features (PHQ-9, GAD-7, cervical range of motion) used as-is. No explicit standardization/normalization reported.
12c
D
Specify the type of model, rationale, all model-building steps, including any hyperparameter tuning, and method for internal validation
pp. 4–7
Six models: LightGBM, Decision Tree, random Forest, XGBoost, SVC, Ridge regression. Mathematical formulations provided (Equations (1)–(9)). Hyperparameter tuning via random-ized search (n_iter=50) for all except Ridge (k-fold CV over alpha candidates). Internal validation: LOOCV.
12d
D;E
Describe if and how any het-erogeneity in estimates of model parameter values and model perfor-mance was handled across clusters
N/A
Single-centre study; no clustering or multi-site analysis. Not applicable.
12e
D;E
Specify all measures and plots used to evaluate model performance
pp. 7–8
Metrics: accuracy, balanced accuracy, precision, recall, F1, specificity, MCC, Brier score. Bootstrap 95% CI (1,000 resamples) reported for all metrics. Confusion matrices (Figures 5–7), average effect plots (Figures 2–4). No decision curve analysis (exploratory scope). (TRIPOD+AI Item 12e)
12f
E
Describe any model updating aris-ing from the model evaluation
N/A
No external validation performed; model updating not applicable in this development study.
12g
E
For model evaluation, describe how the model predictions were calculated
p. 4
Model prediction calculations expressed via Equations (1)–(9); code available at GitHub repository (TRIPOD+AI Item 12g, 22).
13
D;E
If class imbalance methods were used, state why and how this was done, and any subsequent methods to recalibrate the model or predictions
p. 6
Class imbalance: 176 (81.1%) prolonged vs. 41 (18.9%) normal recovery. No SMOTE/resampling (rationale: synthetic data misrepresents clinical distribution). Class weighting applied to penalize minority class misclassification. Effect reflected in near-zero specificity; discussed as primary limitation. (TRIPOD+AI Item 13)
14
D;E
Describe any approaches that were used to address model fairness and their rationale
p. 18
Gender imbalance acknowledged (63.1% female). No formal fairness-aware algorithms implemented. Discusses potential female-biased predictions and cites literature on sex differ-ences in SRC recovery. Identified as limitation requiring future work.
15
D
Specify the output of the prediction model. Provide details and rationale for any classification and how thresholds were identified
pp. 5, 7
Output: binary class labels (0 = ’normal’ recovery <30 days; 1 = ’prolonged’ recovery ≥ 30 days). Threshold 0.5 for all probabilistic models; ROC-based threshold optimization not performed (exploratory scope). (TRIPOD+AI Item 15)
16
D;E
Identify any differences between the development and evaluation data in healthcare setting, eligibility criteria, outcome, and predictors
N/A
Internal validation only (LOOCV on same dataset). No separate external evaluation dataset. Difference between Visit 1 and Visit 2 feature sets described (pp. 4–5).
17
D;E
Name the institutional research board or ethics committee that approved the study and describe participant-informed consent or ethics committee waiver
p. 20
IRB approval: USF STUDY003514, University of South Florida Institutional review Board. Explicitly covers retro-spective review and analysis of patient data in REDCap database.
18a
D;E
Give the source of funding and the role of the funders for the present study
p. 20
Funded by the Florida Department of State Center for Neuromusculoskeletal research. Role of funders not explicitly described.
18b
D;E
Declare any conflicts of interest and financial disclosures for all authors
p. 20
All authors declare no conflicts of interest in the authorship nor publication of this contribution.
18c
D;E
Indicate where the study protocol can be accessed or state that a protocol was not prepared
1
Github code repository included
18d
D;E
Provide registration information for the study, including register name and registration number, or state that the study was not registered
p. 20
In section “Ethical approval” - IRB STUDY003514
18e
D;E
Provide details of the availability of the study data
p. 20
Dataset not publicly available (part of clinical database within USF Health). Statement provided in Data Availability section.
18f
D;E
Provide details of the availability of the analytical code
pp. 1, 17, 20
Code available at: https://github.com/MeganTran6023/Sport-Related-Concussions_Machine-Learning
19
D;E
Provide details of any patient and public involvement during the design, conduct, reporting, interpre-tation, or dissemination of the study or state no involvement
p. 6
Explicitly stated: ’No patients or members of the public were involved in the design, conduct, reporting, or dissemination plans of this research.’
20a
D;E
Describe the flow of participants through the study, including the number of participants with and without the outcome and, if appli-cable, a summary of the follow-up time
pp. 3–4
Figure 1 (Data Preprocessing Flowchart) shows participant flow: 3,038 → 2,338 → 1,865 → 1,201 (2 visits) → 217 (sports-related). Table 5 shows outcome group breakdown: 41 normal (Class 0), 176 prolonged (Class 1). Mean days to clearance reported per group.
20b
D;E
Report the characteristics overall and, where applicable, for each data source or setting, including key dates, key predictors, treatments received, sample size, number of outcome events, follow-up time, and amount of missing data
pp. 3–4, Tables 2 and 5
Table 5 reports characteristics by outcome group (sex, treatment, days to clearance, days from injury to first visit). Table 2 reports treatment counts. Demographics: 80 male (36.9%), 137 female (63.1%), mean age 26.94 years.
20c
E
For model evaluation, show a comparison with the development data of the distribution of important predictors
N/A
Internal validation only; no separate evaluation dataset to compare against.
21
D;E
Specify the number of participants and outcome events in each analysis
pp. 3, 7
N=217 total; 41 Class 0, 176 Class 1 for both Visit 1 and Visit 2 analyses. LOOCV uses N-1 samples per fold.
22
D
Provide details of the full predic-tion model to allow predictions in new individuals and to enable third-party evaluation and implementa-tion
pp. 4–7, GitHub
Mathematical formulations for all 6 models provided (Equations (1)–(9)). Code and model objects available at GitHub repository. (TRIPOD+AI Item 22)
23a
D;E
Report model performance esti-mates with confidence intervals, including for any key subgroups
pp. 7–8, Tables 1–4
Tables 1 and 3 report accuracy, balanced accuracy, precision, recall, F1, specificity, MCC, Brier score for all 6 models at both visits. Table 4 reports statistical significance of accuracy gains across visits. No subgroup analysis by demographics.
23b
D;E
If examined, report results of any heterogeneity in model perfor-mance across clusters
N/A
Single-centre study; no cluster analysis performed.
24
E
Report the results from any model updating, including the updated model and subsequent performance
N/A
No external validation or model updating performed.
25
D;E
Give an overall interpretation of the main results, including issues of fairness in the context of the objectives and previous studies
pp. 14–17
Discussion interprets accuracy gains, feature importance findings (VOR Vertical Headache, treatment presence), and compares to prior studies. Gender imbalance and potential bias toward female population discussed as fairness concern.
26
D;E
Discuss any limitations of the study and their effects on any biases, statistical uncertainty, and generalizability
pp. 17–19
Extensive limitations section: small/imbalanced dataset, low specificity from class imbalance, gender imbalance, self-reported features, linear model limitations, binary outcome oversimplification, no neurocognitive data, LOOCV distribu-tional bias, overfitting risk, lack of external validation, feature selection leakage risk.
27a
D
Describe how poor quality or unavailable input data should be assessed and handled when imple-menting the prediction model
p. 17
States that poor-quality or missing VOMS subscores should prompt clinical judgment rather than model reliance, as model was not trained on imputed data. (TRIPOD+AI Item 27b)
27b
D
Specify whether users will be required to interact in the handling of the input data or use of the model, and what level of expertise is required
p. 17
Intended users are licensed clinicians experienced in concus-sion management. Input requires structured assessment data at 1–2 time points from standardized assessments already routinely administered. No specialized computing expertise required beyond use of a provided interface. (TRIPOD+AI Item 27b)
27c
D;E
Discuss any next steps for future research, with a specific view to applicability and generalizability of the model
pp. 19–20
Future work: larger/balanced dataset, non-athletic TBI popula-tions, 3+ visit longitudinal data, objective biomarkers, exter-nal/temporal validation, optimism correction, nested feature selection, wearable integration, nonlinear modifications to SVC/Ridge.
