Abstract
Background
Ischemic stroke (IS) accounts large amount of stroke incidence. The aim of this study was to discover the risk and prognostic factors that affecting the occurrence of IS in hypertensive patients.
Method
Study data were obtained from the Medical Information Mart for Intensive Care (MIMIC)-IV database. To avoid biased factors selection process, several approaches were studied including logistic regression, elastic net regression, random forest, correlation analysis, and multifactor logistic regression methods. And seven different machine-learning methods are used to construct predictive models. The performance of the developed models was evaluated using AUC (Area Under the Curve), prediction accuracy, precision, recall, F1 score, PPV (Positive Predictive Value) and NPV (Negative Predictive Value). Interaction analysis was conducted to explore potential relationships between influential factors.
Results
The study included 92,514 hypertensive patients, of which 1746 hypertensive patients experienced IS. The Gradient Boosted Decision Tree (GBDT) model outperformed the other prediction model terms of prediction accuracy and AUC values in both ischemic and prognosis cases. By using the SHapley Additive exPlanations (SHAP), we found that a range of factors and corresponding interactions between factors are important risk factors for IS and its prognosis in hypertensive patients.
Conclusion
The study identified factors that increase the risk of IS and poor prognosis in hypertensive patients, which may provide guidance for clinical diagnosis and treatment.
Introduction
The global disease burden of stroke over the past decades has not been favorable. Stroke was reported to be the second leading cause of death, and the third leading cause of combined death and disability. 1 Published systematic analysis revealed that among the 10 diseases with the greatest number of neurological disability-adjusted life years (DALYs) in 2021, stroke accounted for the largest share globally and in 19 of the 21 GBD regions. 2 The global burden of stroke increased substantially from 1990 to 2019 by 70.0% in stroke events and 143.0% in DALYs. 3 Previously published study showed that nearly 15% to 30% of stroke survivors would experience lifelong disability, while 20% require at least three months of hospital care after stroke. 4 Stroke imposes significant costs of care and costs associated with lost productivity on patients. 5
Ischemic stroke (IS) is a general term for necrosis of brain tissue due to narrowing or occlusion of the arteries supplying blood to the brain (carotid and vertebral arteries) and insufficient blood supply to the brain. 6 The vast majority of stroke incidence are IS. 5 The prevalence of IS in China has also increased significantly, 7 data from the Hospital Quality Inspection System (HQMS) in 2019 showed that 1672 tertiary hospitals admitted IS accounted for 82.6% of strokes. 8
Hypertension, one of the major comorbidities of stroke, is prevalent in the stroke population and is the most important modifiable risk factor for stroke. 9 Therefore, large numbers of research on the factors influencing the onset and prognosis of IS in hypertensive patients. Currently, studies on the occurrence and prognosis of IS are often based on hospital follow-up data, and public databases. Among those databases, the Medical Information Mart for Intensive Care (MIMIC) database has adequate number of patients with IS, with comprehensive and complete records of the various indexes, which is eligible for the study purposes.
There are a number of machine-learning prediction studies on stroke in which a number of meaningful influences have been identified, as well as effective machine-learning models. These provide medicine with a basis for stroke prevention and treatment.10,11 The current application of the MIMIC database related to IS are mainly conducted for two types. For the first type, the analysis of all-cause mortality and risk of death in IS patients using traditional statistical models.12,13 These studies focused on a single factor and only on patients admitted to the intensive care unit (ICU). However, a large proportion of IS patients are not admitted to the ICU on their initial admission. If we lose this part of the data, meaningful information may be lost and the research results may be biased. Another type of studies has been conducted to develop predictive models.14,15 However, such studies often suffer from incomprehensive study factors, non-robustness in factor screening and machine-learning-based modeling, as well as the lack of interpretability. 16
Meanwhile, relevant experimental studies have found that age is a risk factor for IS in hypertensive patients, but the effect of age was not the same in male and female studies. 17 It was also found that changes in BMI were associated with high prevalence of IS and diabetes mellitus, while diabetic patients had an increased risk of IS. 18 This type of literature has inspired us to explore the interactions between the factors that influence the development of IS in hypertensive patients, and exploring the interactions between these factors can help to provide a more comprehensive understanding of the pathogenesis of IS, leading to the development of more effective preventive and therapeutic strategies. However, few data studies have explored the interactions between variables.
Previous studies have focused on healthy and diabetic populations, and few have looked at stroke risk in people with hypertension. 19 Therefore, in this study, we included as many potential factors related to risk and prognosis of IS in hypertensive patients as possible, used a more robust factor screening strategy. And to address the potential heterogeneity, an interpretable machine-learning approach was used to explore the effects of interactions on outcomes, in order to better support the clinical diagnosis and treatment.
Methods
Sources of data
Research data was obtained from the MIMIC-IV database (version 2.0, https://physionet.org/content/mimiciv/2.0/). The database is a high-quality, publicly available dataset consisting of clinical information on patients who were at Beth Israel Deaconess Medical Center (BIDMC) between 2008 and 2019. 20
Study population
Patients diagnosed with hypertension and IS were included according to ICD9 and ICD10, specific codes selected are available in Supplementary Materials. The exclusion criteria were as follows: (a) age <18 years; (b) absence of hypertension at the time of first admission; (c) IS occurring before hypertension; and (d) death occurring before IS.
Data collection and study outcome
This study was a retrospective observational study based on public database conducted at the Xi'an Jiaotong University Department of Epidemiology and Biostatistics from January 2023 to April 2024.
The extraction in the MIMIC-IV database was carried out in the Navicate Premium (version 16.0.11, Premium Soft, Hong Kong, China) software platform and Structured Query Language (SQL) were applied to extract the data.
The list of extracted factors is also provided in Supplementary Materials. The endpoint for studies is the occurrence of IS for hypertensive patients and prognosis for hypertensive patients with IS.
Data pre-processing
The study data extracted from databases contain missing values. Factors with ≥50% amusingness in the extracted data were not included in subsequent analyses. Other missing data were filled in using multiple imputation method. We also transformed the continuous factor BMI in the database into ordinal factor. The use of three medications, simvastatin, rosuvastatin, and atorvastatin, were combined into statins use, which was recorded as 1 whenever one of these medications was used.
When different features have different orders of magnitude, larger values may dominate the learning process of the model, resulting in a model that is less sensitive to features with smaller values. Standardization solves this problem by keeping all features within a similar range. So we use this method for deep neural network (DNN) model that rely on gradient descent. 21 Decision trees (DT) and their derived models usually do not require standardization or normalization of the data as they are based on comparisons of eigenvalues rather than numerical magnitudes.
Factor selection
In order to capture a broader range of factors relevant to the outcome and ensure that the factors included are well-represented and generalizable, we used three different methods for the factor screening in the training set
22
and we take the concatenation of these three sets of results. (a) Univariate logistic regression (LR), performing one-way LR analysis for each characteristic factor and outcome, selecting factors with two-tailed
The factors screened were then subjected to correlation analysis, using Pearson and Spearman correlation coefficients. Correlation coefficient >0.6 was judged to be strong correlation.
The last step is to conduct a multivariable LR analysis. To avoid multi-collinearity, factors with correlation greater than 0.6 are put into the model separately. The results take the intersection and the factors obtained are included in the next step of the analysis.
Statistical analysis
We performed descriptive analyses of all individuals found in the database who met the study requirements. Mean ± standard deviation (SD) was used to describe normally distributed continuous factors, and median and IQR (interquartile range) were used to describe skewed continuous factors. Categorical factors were statistically described using frequencies (component ratios). Differences in continuous factors were tested using Student's
In this study, the dataset was randomly divided into a training set and a test set according to the sample size ratio of 7:3. The training set was used to select factors and train the model, and the test set was used to validate the performance of the model. In this study SMOTENC was applied to the training set to address data imbalances hence allowing for better model performance and prediction following past studies. 24 By reading the related literatures,25,26 considering both the complexity of the model and the interpretability of the results, seven commonly used machine-learning prediction models with high correct rates were selected and covered both machine-learning models and deep learning models. The study used seven methods such as DT, random forests (RF), Gradient Boosted Decision Tree (GBDT), Extreme Gradient Boosting (XGBoost), LR, DNN, and Oblique Decision Random Forest (ODRF). The training and testing process involves 5-fold cross-validation. Prediction accuracy, AUC, precision, recall, F1 score, Positive Predictive Value (PPV) and Negative Predictive Value (NPV) were then computed and compared to assess the performance of the models and to identify the best model for predicting the risk and prognosis of IS in hypertensive patients. Finally, the best model was interpreted using SHapley Additive exPlanations (SHAP).
The descriptive analysis and factor selection processes were performed using the R programming language (version 4.1.3, R Core Team, Vienna, Austria) and RStudio software (version 2022.02.1, RStudio Team, Boston, MA, USA). The modeling and interpretable learning of the model were performed using the Python programming language (version 3.7, Python Core Development Team, Virginia, USA) and Pycharm software (version 2020.3.4, JetBrains, Czech Republic). The statistical significance level was 0.05 (two-tailed).
Results
Baseline characteristics
The demographic features of 90,768 and 1746 participants in the study of the risk of IS in hypertensive patients who did not suffer from IS and those who suffered from IS, respectively, is shown in Table 1. The mean age was 68.1 ± 14.7 years. The difference in BMI between the two groups was not statistically significant. Hyperlipidemia accounted for the largest proportion of comorbidities at 49.1%, 45.6% for those taking statins, and 42.6% for those taking aspirin.
Baseline characteristics of participants.
The number of hypertensive patients suffering from IS prognosis study without death and death were 907 and 839, respectively, and the demographic status are shown in Table 1. The mean age was 75.3 ± 12.5 years. The difference in BMI was not statistically significant between the two groups. The largest proportion of comorbidities was hyperlipidemia at 62.6%, 74.6% of those taking aspirin, and 67.4% of those taking statins. Descriptive analyses of factors in the laboratory results and medicine section in Supplementary Table S1.
Factor selection
For the study of the risk of IS, the predictor factors that were closely related were age, sex, peripheral vascular disease, renal disease, aspirin, glucose maximum, COPD, hyperlipidemia, amlodipine, sepsis, neutrophil maximum, statins, and BMI. Predictor factors that were strongly associated with prognostic studies of IS were age, CCI, RDW maximum, BMI, sepsis, triglyceride, rivaroxaban, respiratory failure, statins, and acetaminophen.
Model development and validation
Because of the imbalance in the proportion of hypertensive patients with and without, the SMOTENC balancing technique was applied to the training dataset before modeling. We applied machine-learning algorithms with the training dataset and validated the model using the test dataset (parameters settings are shown in Supplementary Table S2).
The prediction results of the seven machine-learning algorithms are shown in Supplementary Table S3. The comparison of AUC of the seven machine-learning algorithms is shown in Figure 1A and B. From Supplementary Table S3, it can be seen that the model with the highest accuracy and AUC is the GBDT model.

Performance evaluation of seven machine-learning algorithms with ROC curves. (A) ROC curves of seven models for predicting risk of stroke in hypertensive patients. (B) ROC curves of seven models for predicting prognosis of stroke in hypertensive patients.
Model interpretation
Our study found that the GBDT model performed the best in risk prediction and prognosis prediction.
The basic idea of the GBDT is to combine many weak base classifiers into one strong base classifier. 27 The advantages of GBDT are good training results, less overfitting, and flexibility in handling various data types, including continuous and discrete values. 28 GBDT is a model with strong generalization capabilities. A number of medical studies have used the GBDT model. 29
We then performed a SHAP analysis of the GBDT model to reveal the distribution of the effects of each selected factor.
Firstly, in order to determine the importance of each feature to the predictive model, the SHAP summary of the GBDT model was plotted Figure 2.

Summary of SHAP for the GBDT model. (A) The higher the characteristic SHAP value the more likely IS is to occur. A point is created in the model to represent one feature attribute value for one patient, so a point is assigned to each feature on the line for each patient. Points are colored according to the feature values of the corresponding patient and accumulated vertically to depict density. Red color indicates higher feature values and blue color indicates lower feature values. (B) The absolute value of the mean of the SHAP values for each feature is the feature importance distribution. (C) The higher the SHAP value of a feature, the more likely it is that a poor prognosis for IS will occur. (D) The absolute value of the mean of the SHAP values for each feature is the feature importance distribution.
Red dots indicate high-risk value and blue indicate low. 30 As shown in Figure 2A and B, high values of age, BMI, neutrophil maximum, and glucose maximum correspond to SHAP values greater than zero. This suggests that these characteristics are important risk factors.
As shown in Figure 2C and D, high values for age, CCI, RDW maximum, and triglyceride corresponded to SHAP values greater than zero. This suggests that these features are important factors for the prognosis.
Secondly, the practical application of the model takes the form shown in Supplementary Figure S2. Red areas indicate that the eigenvalue increases the probability of the ending occurring, while blue areas indicate that the eigenvalue decreases.
The importance ranking of factor interactions was plotted as shown in Figure 3A and B. The interactions between age and BMI, glucose maximum and aspirin, age and glucose maximum, and sex and age were significant for IS risk prediction. As shown in Figure 3B, the interactions between age and CCI, age and RDW maximum, age and respiratory failure, CCI and triglyceride, age and triglyceride, CCI and RDW maximum were significant for the prognostic prediction.

Ranking of the importance of variable interactions. (A) Important risk factors in the predictive model of the risk of ischemic stroke in hypertensive patients. (B) Important risk factors in the predictive model of the prognosis of ischemic stroke in hypertensive patients.
As shown in Figure 4A, with increasing age, the risk of IS is higher in hypertensive patients with higher BMI, especially those between 40 and 90 years of age.

Variable interaction dependency plots. (A) Interaction between age and BMI on the risk of having an ischemic stroke. (B) Interaction between blood glucose maximum and aspirin on the risk of having an ischemic stroke. (C) Interaction between age and blood glucose maximum on the risk of having an ischemic stroke. (D) Interaction between sex and age on the risk of having an ischemic stroke. (E) Interaction between age and the CCI on the prognosis of having an ischemic stroke. (F) Interaction between age and RDW maximum on the prognosis of having an ischemic stroke. (G) Interaction between age and respiratory failure on the prognosis of having an ischemic stroke. (H) Interaction between the CCI and triglyceride on the prognosis of having an ischemic stroke. (I) Interaction between age and triglyceride on the prognosis of having an ischemic stroke. (J) Interaction between CCI and RDW maximum on prognosis of having ischemic stroke.
As shown in Figure 4B, for hypertensive patients with higher than normal blood glucose (fasting blood glucose normal values of 3.9–6.1 mmol/L, database in mg/dL, 6.1 mmol/L is approximately equal to 109.8 mg/dL). There are more red dots below the SHAP value = 0 than above the SHAP value = 0, indicating that taking aspirin is more effective in reducing the risk of IS in hypertensive patients with high blood glucose values.
As shown in Figure 4C, with increasing age, blood glucose values outside the normal range increase the risk, especially those between the ages of 40 and 90 years old.
As shown in Figure 4D females (gender = 0) should take care of IS prevention after 60 years of age, and males (gender = 1) should prevent IS when they are around 45 years of age.
As shown in Figure 4E the CCI increases with age and the higher age CCI is detrimental to the prognosis.
The normal range of RDW is 11.5% to 14.5%, as shown in Figure 4F high RDW values are detrimental to the prognosis as age increases.
As shown in Figure 4G with increasing age, having respiratory failure disease is detrimental to the prognosis.
The normal range for triglyceride is 0.45 to 1.69 mmol/L, database unit is mg/dL, unit conversion mg/dL * 0.011 = mmol/L, and exceeding 149.7 mg/dL is outside the normal range. As shown in Figure 4H, more and more patients had triglycerides outside the normal values with increasing CCI. Both patient comorbidity with other diseases and triglycerides exceeding normal values are detrimental to the prognosis.
As shown in Figure 4I as age increases, the more triglycerides exceed normal values the less favorable the prognosis.
As shown in Figure 4J as the CCI increases, the number of patients with RDW exceeding normal values is increasing, and having other comorbidities with RDW exceeding normal values is detrimental to the prognosis.
Discussion
Over the past few decades, the global burden of stroke has increased dramatically, especially in low- and middle-income countries, because of the increasing in aging populations and modifiable stroke risk factors. 3 The correlation between hypertension and increased risk of stroke has long been the strongest and most recognized. 32
Our study, in line with other related studies, identified a number of influential factors that affect the risk and prognosis of IS in hypertensive patients. The importance of our study lies in the fact that we used statistical methods and machine-learning models to determine the impact on the onset and prognosis IS in patients with hypertension and identified specific high-risk subgroups with above-average responses to specific risk factors.
Data imbalance is a common problem in many real-world datasets, which can seriously affect machine-learning model performance, as many models are sensitive to the distribution of classes, so we perform data balancing on the training set. Feature importance is an important aspect of understanding the predictive power of a model. The GBDT model provides built-in methods for calculating feature importance, which can help us understand which features have the most impact on prediction. Combining all the result metrics reveals that the GBDT model achieves the best performance on the internal validation set compared to the other models. We used the SHAP method to perform interpretable learning on the GBDT model to explore the risk and prognostic influences affecting the risk of IS in hypertensive patients.
Both global and regional changes in brain tissue volume occur specifically with age. 33 Aging is the most important factor influencing the incidence and prevalence of stroke. 34 The incidence of IS increases with age, especially for those aged 50 to 69 years or older. 35 Although age is un-modifiable, it is important to pay more attention to stroke prevention as age increases.
Previously published study has suggested that risk of IS is higher for those with obesity, 36 high blood glucose, 37 while RDW is associated with poor prognosis 38 and higher triglyceride indices are associated with a higher risk of poor functional prognosis and in-hospital mortality. 39 Our findings also verified these findings, which also suggested the reliability of our analysis.
Published study found that neutrophils are the first cells in the peripheral blood to reach the infarcted area of the brain after the onset of IS. 40 Previous studies have found that the number of neutrophils in the area of cerebral infarction increases over time. 41 High levels of neutrophil count and neutrophil ratio were found to be associated with mild IS or transient ischemic attack, as well as an increased risk of IS. 42 Similarly, our study found that high levels of neutrophil ratio can be a risk factor and can be used as a diagnostic aspect for clinicians.
According to our results, comorbidities are an unfavorable prognostic factor. Treatment after the first IS occurs should be accompanied by attention to the treatment and prevention of other comorbidities.
The relationship between factors affecting hypertensive patients suffering from IS are complex, the effect of each factor is also simultaneously influenced by the others. Statistically, interaction is representative of the moderating effect between factors. 43 In order to better understand the relationship of factors associated with IS risk and prognosis for hypertensive patients, we also conducted an interaction effect analysis.
Gender differences of IS depend on the age of the patient with hypertension, as the impact of gender on stroke risk and prognosis changes throughout the life cycle. 44 This is consistent with our findings in the interaction of sex and age on the risk of developing IS. In middle age, our findings suggested that more attention should be given to the prevention of IS in men. While in older age, the incidence of IS begins to increase in women. This may be related to hormones. A large meta-analysis reports that the risk of stroke associated with metabolic syndrome is significantly higher in women than in men. 45 One study found that women with early menopause had a twofold increased risk of IS. 46
The interaction of age and BMI also influenced the risk. Changes in important risk factors for thromboembolic stroke during aging may be related to weight gain, with particular attention to elevate blood pressure or the acute phase of hypertension. 47 Therefore, it is important to focus on weight management for older hypertensive patients.
In the analysis, interaction between aspirin and blood glucose was significant. Hyperglycemia increases the size of cerebral infarcts and the permeability of the blood–brain barrier, 48 so hypertensive patients need to pay attention to glycemic control in order to reduce the risk. 49 Meanwhile, antiplatelet drugs can be used to reduce the long-term risk of non-cardiogenic embolic IS, 50 so hypertensive patients can take aspirin reasonably to prevent IS.
The interaction effect of age and blood glucose was also significant, suggesting that the older the more important it is to control blood glucose in hypertensive patients.
Increased mortality in the elderly may be related to the effects of aging and late complications. 33 Results of a mixed-gender clinical study showed older stroke patients appeared to be more likely to develop infectious complications, such as pneumonia and urinary tract infections. 51 In our study the interactions of age and CCI, as well as the interaction between age and respiratory failure were found to have some effect on the prognosis.
The interactions of CCI and RDW, as well as age and RDW also affected the prognosis as suggested by our findings. A growing number of studies have shown that RDW is strongly associated with the incidence and prognosis of many diseases, such as myocardial infarction, 52 and so on. The higher the CCI is, the less favorable the patient's prognosis. 53
Studies have shown that both inflammatory response and oxidative stress are associated with RDW. 54 Oxidative stress affects the life span of red blood cell, destroys the red blood cell membrane, and increases the osmotic fragility of red blood cell as well as their ability to adhere and aggregate. 54 All these factors lead to changes in RDW. In addition, these pathological processes can promote coagulation and thrombosis. 55 Meanwhile, several studies have identified potential mechanisms for the effects of aging, which include the promotion of oxidative stress and inflammatory responses. 56 Both RDW and age are relatively simple and readily available indicators that can help physicians better identify and assess the prognosis.
In our study, we found that the interaction between CCI and, triglyceride, as well as the interaction of age and triglyceride also influence the prognosis. Several studies have shown that plasma triglyceride concentrations accumulate with age. 57 Triglyceride accumulation leads to the formation of atherosclerotic plaques and are also a risk factor for diseases including peripheral arterial disease, 58 and so on. Control of triglyceride reduces the incidence of comorbid diseases and benefits the prognosis.
There are also some limitations to this study. Firstly, our study was a retrospective research conducted based on public database, thus further prospective study is needed to verify our findings. Then, several important measurements in IS research such as the NIHSS scores, results of cranial CT and MRI are not available in the MIMIC-IV database, which may cause the loss of information and potential bias in the analysis. Besides, data missing occurs for almost all factors, though we have conducted missing data imputation to minimize the influence, potential bias may not be totally avoided.
Conclusions
Combining all the resultant metrics, GBDT model obtained the best performance on the internal validation set as compared to the other models. Subgroup identification analysis suggests that hypertensive patients with specific characteristics have a higher risk of IS and a poor prognosis. Our findings may provide a reference for the prevention of IS and the improvement of prognosis after disease onset in the hypertensive population.
Supplemental Material
sj-docx-1-dhj-10.1177_20552076241288833 - Supplemental material for Designing machine learning for big data: A study to identify factors that increase the risk of ischemic stroke and prognosis in hypertensive patients
Supplemental material, sj-docx-1-dhj-10.1177_20552076241288833 for Designing machine learning for big data: A study to identify factors that increase the risk of ischemic stroke and prognosis in hypertensive patients by Lingmin Gong, Shiyu Chen, Yuhui Yang, Weiwei Hu, Jiaxin Cai, Sitong Liu, Yaling Zhao, Leilei Pei, Jiaojiao Ma and Fangyao Chen in DIGITAL HEALTH
Footnotes
Availability of data and materials
Contributors
F.C. and J.M. contributed to the conception of the study, supervised the analysis, and conducted critical revision of the manuscript. L.G. conducted data curation and management, the formal analysis, the presentation of results and graphs, drafted the original manuscript, and the revision of the manuscript. S.C. and Y.Y. conducted data curation and the draft of the manuscript. W.H., J.C., and S.L. assisted in the analysis. Y.Z. and L.P. assisted in the revision of the manuscript.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Ethical approval
Not applicable, because this article does not contain any studies with human or animal subjects.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Social Science Fund of China (21CTJ009), the Natural Science Basic Research Program of Shaanxi Province, China (2022JQ-769), and the National Natural Science Foundation of China (81703325).
Guarantor
FC.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
