Abstract
Background
Rehabilitation is crucial to recovering patients’ dysfunction, improving their life quality, and promoting an early return to their family and society. In China, most patients in rehabilitation units are patients transferred from neurology, neurosurgery, and orthopedics, and most of these patients face problems such as continuously bedridden or varying degrees of limb dysfunction, all of which are risk factors for deep venous thrombosis. The formation of deep venous thrombosis can delay the recovery process and result in significant morbidity, mortality, and higher healthcare costs, so early detection and individualized treatment are needed. Machine learning algorithms can help develop more precise prognostic models, which can be of great significance in the development of rehabilitation training programs. In this study, we aimed to develop a model of deep venous thrombosis for inpatients in the Department of Rehabilitation Medicine at the Affiliated Hospital of Nantong University using machine learning methods.
Methods
We analyzed and compared 801 patients in the Department of Rehabilitation Medicine using machine learning. Support vector machine, logistic regression, decision tree, random forest classifier, and artificial neural network were used to build models.
Results
Artificial neural network was the better predictor than other traditional machine learnings. D-dimer levels, bedridden time, Barthel Index, and fibrinogen degradation products were common predictors of adverse outcomes in these models.
Conclusions
Risk stratification can help healthcare practitioners to achieve improvements in clinical efficiency and specify appropriate rehabilitation training programs.
Introduction
Deep vein thrombosis (DVT), characterized by clot formation inside blood vessels in the leg, restricts daily life by causing pain or swelling and can lead to fatal pulmonary embolism (PE).1–4 These two conditions frequently result in venous thromboembolism (VTE). The annual global incidence of DVT in the leg is 1.6 per 1000. 5 Approximately 6% of patients with DVT die within 30 days, mostly through PE, and about 13% of patients with PE. 6 The manifestations of DVT vary among different inpatients, thus making the identification of DVT clinically more difficult. 7 Therefore, it is important to identify suspected DVT accurately and quickly so that patients can be treated promptly to prevent thrombus extension or embolization.
Risk prognostication is important in the process of clinical decision-making. 8 Effective risk stratification is beneficial to early diagnosis and treatment and may reduce invasive testing. Unfortunately, standard scoring systems for DVT risk stratification often do not adequately stratify inpatients and cannot accurately predict which inpatients are most likely to develop DVT. Compression ultrasonography of the leg can accurately confirm or refute the diagnosis, but requires expertise, is time-consuming and has associated financial costs. 9 With the development of big data and machine learning (ML) techniques, disease prediction is becoming increasingly accurate. Consequently, there is a clear need to develop new tools that can predict DVT in hospitalized patients.
Machine learning algorithms can help us develop more precise prognostication models and are currently used in many medical fields.10,11 Previously, ML has been explored in the context of VTE, and several studies have shown the potential for ML clinical decision support systems to add incremental value in improving the VTE risk stratification of patients. 12 In recent decades, the Big Data era is bringing an explosion of data, leading to traditional ML becoming sluggish when faced with large data sets. As a type of artificial intelligence (AI), artificial neural networks (ANNs), inspired by the neural architecture of the brain, have a strong ability to handle large amounts of uncertain information and gained popularity in the medical field. 13
In China, most patients admitted to the department of rehabilitation are patients transferred from neurology, neurosurgery, and orthopedics. And the rehabilitation medicine is now increasingly focused on critical illness and early rehabilitation. So, almost all hospitalized patients in rehabilitation units suffer from dysfunction; most have long-term bed rest and have a DVT risk factor ≥ 1. A previous study confirmed that major surgery, motor weakness, immobilization, and trauma promote the occurrence of thromboembolic events.14,15 Engber et al. 16 demonstrated that impairment in activities of daily living (ADL), mobility impairment, a sedentary lifestyle, and low handgrip strength were all associated with a two- to four-fold higher risk of thrombosis. Furthermore, the diagnostic dilemma is pertinent in rehabilitation patients who may under-report symptoms associated with limb pain, or swelling or dyspnea due to aphasia, neglect, cognitive impairment, or altered conscious states. 17 Therefore, there is a serious need to evaluate the risk of DVT in the inpatients of rehabilitation medicine departments.
This study aimed to develop a prediction model for inpatients in the Department of Rehabilitation Medicine by applying ML methods. The predictive ability of the ANNs model was compared to that of four other traditional ML models. We compared and analyzed the predicted results from five models, investigated which factors might be helpful to predict the risk of DVT, and developed ML models to accurately predict the risk of DVT in patients in the Department of Rehabilitation Medicine at the Affiliated Hospital of Nantong University.
Methods
Study Design and Patients
Patient data were extracted from patient electronic medical records at the Department of Rehabilitation Medicine at the Affiliated Hospital of Nantong University (Nantong, Jiangsu Province, China). We included data extracted from inpatients seen between July 2019 and December 2021 and diagnosed with stroke, traumatic brain injury, and spinal cord injury. Patients with multiple admissions, a long course (>180 days), or who were pregnant were excluded. The study framework is shown in Figure 1. The study was approved by the Medical ethics committee of the Affiliated Hospital of Nantong University (Approval number: 2022-L017). The research was performed in accordance with the Declaration of Helsinki. The ethics committee explicitly stated that informed consent was not required as part of this study.

Study flow chart diagram. DVT, deep vein thrombosis; TBI, traumatic brain injury; SCI, spinal cord injury.
Covariates
Data on basic sociodemographic, past medical history, physical findings, scores of clinical tests, and laboratory and medication data were retrieved from the medical records of the hospital. We collated the following variables: age, sex, bedridden time, heart rate, history of smoking, history of surgery, history of VTE, central venous catheter (CVC) or peripherally inserted central catheter (PICC), hypertension, diabetes, active cancer, Glasgow Coma Scale (GCS), Barthel Index (BI), sitting and standing balanced score, the Well's score, prophylactic anticoagulants, prophylactic antiplatelet aggregation, triglyceride, total cholesterol, low-density lipoprotein, high-density lipoprotein, D-dimer levels, activated partial thromboplastin time, prothrombin time, thrombin time, fibrinogen degradation products (FDPs), platelet, and hematocrit. For all patients, we used the first laboratory index after admission. For categorical variables, if there was corresponding information in the medical record, they were assigned according to the corresponding information; if there was no corresponding information, they were considered in normal health.
Ascertainment of Outcomes
DVT was validated based on duplex positive compression ultrasonography in the lower limbs. All patients were screened by ultrasonography on the first day of admission.
Statistical Analyses
Analysis of 801 subjects was performed with open-source code Python. The data were read and preliminary processed by the NumPy and Pandas libraries. The p-value was calculated by the homogeneity of variance test and t-test in the scipy.stats library. The missing continuous and polytomous data were replaced with mean and mode data, respectively, by using the Impuer method in the sklearn library.
The actual data used in this paper showed a clear imbalance. We used the over-sampling SMOTE method in the imlearn library to up-sample the data prior to analysis. Then, the subjects were randomly assigned at a ratio of 7:3 by model selection and the train_test_split method in the sklearn library into a training set for variable determination and model construction and a test set to test the model performance. Four algorithms, including logistic regression (LR), decision tree (DT), support vector machine (SVM), and random forest classifier (RFC), were used for training and preparing the models. These models are based on LogisticRegression, DecisionTreeClassifier, SVM, and RandomForestClassfier, respectively, in the sklearn library. To obtain the best performance for the models, each class was combined with model _selection; the GraidSearchCV method was used to perform 5-fold cross-validation. The penalty parameter C was tuned for LR, max depth and min sample leaf for DT, penalty parameters C and sigma for SVM, and n estimators and max depth for RF.
In addition, with the explosive growth of data volume, the traditional machine learning model will appear a little weak when dealing with large sample data. So in the article, we also use the ANN model to analyze the data. The ANN model is based on the torch library. ANN refers to a complex network structure formed by the interconnection of a large number of processing units (neurons). It is an abstraction, simplification and simulation of human brain organizational structure and operating mechanism. It can be divided into multiple layers and single layers. Each layer contains several neurons. Different neurons are connected with corresponding weights to form a linear or nonlinear classification. The basic architecture of the ANN model is shown in Figure 2.

Structure of an artificial neural network.
The most important thing in the model is the use of activation functions. Activation function, also known as excitation and activation function, realizes the nonlinearity between the input and output of neurons. Simple neuron model and single-layer perceptron model are linear separable models. If the data set with complex distribution is encountered, the linear model cannot achieve perfect fitting. Therefore, nonlinear factors (activation functions) must be added to realize the nonlinearity between the input and output of neurons and enhance the expression ability of neural networks. In fact, if there is no activation function, the neural network model cannot carry out normal nonlinear transformation. In the article, we use the ReLu activation function as shown in Figure 3.

The ReLu activation function.
During the experiment, in order to prevent the model from over-fitting during the training process. We introduce the Dropout method to alleviate the over-fitting phenomenon. Dropout is called random inactivation. In simple terms, during the forward propagation of the model training stage, the activation value of some neurons will stop working with a certain probability, as shown in Figure 4, which can make the model more generalized.

Neural network with dropout.
Model Comparisons
A consensus receiver operating characteristic curve (ROC) curve for each model was generated by using the metrics.roc_curve and metrics.auc method in the sklearn library. Confusion matrices of each model in the testing sets were also used to evaluate the accuracy of the models.
Results
Patient Characteristics and Outcomes
Of the 801 resident patients, 71 had been confirmed to have DVT by imaging and 730 patients did not have DVT (Fig. 1). Among the 71 patients, 14 had proximal deep vein thrombosis (nine had the popliteal vein thrombosis and five had the femoral vein thrombosis) and 57 had distal deep vein thrombosis (37 had the intramuscular vein thrombosis, two had the anterior tibial vein thrombosis, six had the posterior tibial vein thrombosis, and three had the peroneal vein thrombosis).
Selection of the Best Model
The composite features and different machine learning performances were calibrated on the testing dataset with ROC analysis by calculating the area under the curve (AUC). Figure 5 shows that ANN and SVM prediction models had a larger AUC (0.97 and 0.96, respectively) when compared to the other three methods. The confusion matrix is another widely used method to evaluate classification results. Confusion matrix analysis found that the ANN model had the best performance (in terms of accuracy, sensitivity, specificity, precision, recall, and F1-score) (Table 1). Based on the results of the two evaluation methods, ANN was the most stable and accurate method for predicting the risk of DVT in patients residing in rehabilitation medicine departments.

Consensus ROC curves generated with different models.
Confusion Matrices in Different Models.
TN, True negative; FP, False positive; FN, False negative; TP, True positive; Sens refers to the sensitivity of detecting a composite outcome; Spec refers to the specificity of excluding a composite outcome; Acc refers to the accuracy of the assignment; P, precision; R, recall; SVM, support vector machine; LR, logistic regression; DT, decision tree; RFC, random forest classifier; ANN, artificial neural networks.
Variable Rankings of the Models
In order to analyze which features are more important, we have sorted the features of some of the models mentioned. RFE method in the sklearn library was used to calculate the importance of the variables in each model. The top five variables among the remaining three models are shown in Table 2. D-dimer levels, bedridden time, BI, and FDP were more consistently featured by the different models (≥ 2) as top predictors for adverse outcomes. Figure 6 shows the importance scores of these four feature variables in the different models.

Importance scores of the four feature variables in the different models.
Top Five Important Variables in the Different Models.
RFC, random forest classifier; DT, decision tree; LR, logistic regression; BI, Barthel Index; APTT, activated partial thromboplastin time; TT, thrombin time; FDPs, fibrinogen degradation products.
Discussion
The timely diagnosis of DVT can lead to earlier and more targeted rehabilitation training and is essential to minimize the risk of thromboembolic complications and avert the exposure of patients without thrombosis to the risks of anticoagulant therapy. The use of ML can improve the ability of health professionals to establish an accurate prognosis. 18 Compared to statistical models, ML methods have the following advantages: they require fewer assumptions, use more predictors, use an agnostic approach instead of a priori hypotheses, incorporate multidimensional correlations that contain prognostic information, and produce more flexible relationships among predictor variables and outcomes.19,20 In this study, we attempted to develop a preliminary ML model for predicting DVT in hospitalized patients in rehabilitation medicine departments.
ML algorithms aim to identify a linear or non-linear function for classification or prediction. 21 In this study, we should not only consider the accuracy of the model but also the recall rate of the model; that is, the proportion of the predicted positive sample to the actual positive sample. ROC curves and AUCs were also used as criteria for evaluating model performance. ROC curves use the false positive rate as the horizontal axis and the true positive rate as the vertical axis, and the AUC is the most useful measure of performance for a classification model. Conventionally, the AUC value ranges from 0.5 to 1.0; values between 0.5 and 0.7 indicate low discrimination ability, values between 0.7 and 0.9 indicate moderate discrimination ability, and values > 0.9 indicate high discrimination ability. 22 In this study, we used 70% of the data as training sets and 30% as test sets, using SVM, LR, DT, RFC, and ANN for learning and analysis. The cvAUC values of 0.97, 0.96, and 0.95 in the test sets for ANN, SVM, and RFC are an indication that the three models have high discrimination ability. The cvAUC values of 0.87 and 0.85 for the test sets of LR and DT, respectively, are an indication that the two models have moderate discrimination abilities. Therefore, of the five models, ANN achieved the best performance (the highest cvAUC value). The confusion matrix is another widely used method to evaluate classification results. 23 Confusion matrix analysis found that the ANN model had the best performance (in terms of accuracy, sensitivity, specificity, precision, recall, and F1-score). Based on the results of the two evaluation methods, ANN is the most stable and accurate method for predicting the risk of DVT in hospitalized patients in rehabilitation medicine departments.
ANN is a new type of intelligent information processing system that simulates the principles of the biological nervous system and is capable of achieving strong non-linear mapping functions. 24 It has the characteristics of parallel distributed processing, self-learning, self-organization and good fault tolerance. The ANN model is able to predict complex relationships between variables, which is not possible in other models. 25 The scientific selection of the training sample data and the rationality of the data representation have an important impact on the design of the neural network. The preparation of sample data is the basis of neural network design and training. As the neural network is trained by the rich experience and the more comprehensive the data sample, the better the performance of the training network, the training sample set must contain all the patterns, and the input data must be as uncorrelated or minimally correlated as possible, otherwise the network has no generalization capability. 26
In this study, we further found that the top predictors of adverse outcomes consistently featured by the different models were D-dimer levels, FDP, bedridden time, and BI. Several of these risk factors have been confirmed to be related to DVT; for example, D-dimer levels and FDP have been studied extensively as risk factors for DVT.14,27–30 With regards to bedridden time, studies have shown that the risk of DVT might increase by 2- to 11-fold for bedridden patients when compared to mobile patients. 31 Immobilization represents a significant risk for DVT among hospitalized patients who had longer hospital stays and was associated with high medical costs. 32 A lower BI has been shown to be a risk factor for DVT; BI is currently the most widely used clinical assessment method for ADL and is an important evaluation index in rehabilitation departments. The lower is the score, the less ability the patient gains and the more severe the functional impairment. Functional impairment is related to the increased pro-coagulation, reduced limb function, reduced mobility, increased blood viscosity, and increased blood stasis; these factors can all increase the risk of DVT.1,16 Therefore, it is very meaningful to successfully identify rehabilitation inpatients with risk of DVT who do not meet the anticoagulation conditions and carry out rehabilitation exercises promptly to reduce bedridden time and improve ADL ability.
In the clinic, predicting DVT with ANN or SVM is of great significance. Patients judged as “DVT” by this algorithm should undergo further imaging examinations to confirm the diagnosis and then be referred to a specialist for further processing in a timely manner. If patients are judged as “non-DVT” by this algorithm, then appropriate rehabilitation exercises should be applied, especially for those at risk of DVT. Clinical trials have demonstrated the effectiveness and safety of pharmacological prevention using low, fixed doses of anticoagulant drugs. Mechanical prophylactic measures, such as graduated compression stockings and intermittent pneumatic compression devices, should be considered in at-risk patients who are not candidates for pharmacological thromboprophylaxis. 15 Patients may also be encouraged to plantarflex and dorsiflex actively throughout the day while lying down. For all rehabilitation patients, an active and passive range of motion activity in the lower extremities during ADLs, along with frequent turning and positioning, are extremely important. These measures should be taught to both patients and family members to prevent venous pooling, a known cause of DVT. Furthermore, the therapist should assist the patient with early and frequent ambulation. 33
It is evident that the predictions and risk factors described herein are consistent with the results of many previous studies, thus providing a certain basis for the risk factors of DVT. We developed a DVT prediction model that is more suitable for patients in rehabilitation medicine departments that can detect and treat patients early and allow us to select the most appropriate drug therapy, physical therapy, surgery, and nursing methods according to the risk of thrombosis.
This will allow the patient to maximize their training effect, recover quickly and return to their family and society sooner. At the same time, ANN can develop sophisticated algorithms to predict risks that are personalized for each patient, so enabling healthcare systems to concentrate scarce resources on the highest-risk patients and can also address the problem of under-equipped primary care hospitals so that patients at high risk of thrombosis can all be screened and further treated in a timely manner.
Limitations
This study has certain limitations that need to be considered. First, our data collection only included patients with stroke, traumatic brain injury, and spinal cord injury. Although these are the main diseases treated by rehabilitation medicine departments, there are still patients with other diseases that were not considered herein. Second, our sample size was relatively small; future research should involve larger sample sizes. Third, this is a retrospective study; our data only include the first laboratory and physical examination results. The risk of DVT may differ during the treatment process 34 and this was not taken into account in our study. It was impossible to explore the changes in DVT risk of patients at risk of DVT but who did not meet the criteria for anticoagulant treatment after a period of targeted rehabilitation training, such as graduated compression stockings, intermittent pneumatic compression treatment, and functional exercises to reduce immobilization. More comprehensive, long-term prospective studies are needed.
Conclusion
Many existing DVT scores used in the clinic are insufficient for the risk stratification of hospitalized patients. Here, we adopted a predictive model to predict the probability of developing DVT in rehabilitation inpatients, so that patients at high risk of thrombosis are identified early and treated promptly. This is the first study to use ML techniques to estimate the risk of DVT in rehabilitation inpatients. We found that ANN model predicted the probability of risk in the most efficient and effective manner. This model can help healthcare practitioners to achieve improvements in clinical efficiency and specific appropriate rehabilitation training programs.
Footnotes
Authors’ Note
Su Liu, Li Sun, Qi Gu, Tingting Hou, Wei Qiao, and Sijin Song conceived and designed the research. Tingting Hou, Wei Qiao, Sijin Song, Yingchao Guan, and Qing Yang collected data. Tingting Hou and Chunyang Zhu analyzed the data. Tingting Hou wrote the paper. All authors are accountable for all aspects of the study and attest to the accuracy and integrity of the results. All authors have read and approved the final manuscript as submitted.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China, (grant number No.81702223).
