Sage Journals: Discover world-class research

Abstract

Objective

This study aims to identify the most suitable machine-learning model for early heart disease risk screening in diabetic populations.

Methods

This retrospective cohort study utilized data from the China Health and Retirement Longitudinal Study, with baseline data from 2011 and follow-up data from 2020. Using features selected by Least Absolute Shrinkage and Selection Operator (LASSO) regression, we systematically constructed 16 distinct machine-learning models. Model performance was evaluated using a comprehensive set of metrics, including the area under the receiver operating characteristic curve, F1-score, sensitivity, specificity, precision, accuracy, and balanced accuracy. To interpret the decision-making process of the best-performing model, we conducted Shapley additive explanations (SHAP) analysis.

Results

After the 9-year follow-up period concluding in 2020, 157 of the 819 patients with diabetes at baseline (2011) developed heart disease. From the available features, LASSO regression selected 19 core features for model construction. Among the models developed, the K-Nearest Neighbors (KNN) model demonstrated optimal performance across key metrics, achieving the highest F1 score, balanced accuracy, and precision. The SHAP analysis identified body mass index, systolic blood pressure, and waist circumference as the three most important predictive features within the diabetic cohort. The contribution patterns of these features in the KNN model align closely with clinical expertise, achieving a strong balance between predictive power and interpretability.

Conclusion

This study developed a machine-learning model to predict heart disease risk in patients with diabetes. Although the model exhibited only modest predictive performance, it provides a valuable empirical foundation and clear direction for constructing more reliable and clinically useful prediction tools in this field.

Keywords

Diabetes heart disease machine learning risk prediction

Introduction

Diabetes has become a major challenge in global public health, with its prevalence continuing to rise. In 2021, the number of people with diabetes worldwide reached 529 million, with an age-standardized total prevalence of 6.1%.¹ It is projected that by 2050, the number of patients will exceed 1.31 billion.¹ This increase is primarily driven by multiple factors such as global population aging, economic growth, rapid urbanization, and nutritional transitions in various countries.² Particularly in Asia, the rise in diabetes prevalence in China is significant,³ where diabetes and its preconditions are widespread among adults aged 45 and above.⁴

Diabetes is not merely a disease characterized by abnormal glucose metabolism but it also serves as a major contributor to various cardiovascular conditions, including diabetic cardiomyopathy, atherosclerosis, myocardial infarction, and heart failure.⁵ In fact, cardiovascular diseases have become the leading cause of morbidity and mortality among diabetic patients.⁶ The risk of major cardiovascular events in individuals with diabetes is twice as high as that in nondiabetic individuals of the same age and gender.⁷ Among these cardiovascular complications, heart failure represents one of the most serious prognostic challenges associated with diabetes. The rising prevalence of diabetes, combined with an aging population, further intensifies the epidemic of diabetes-related heart failure.⁸ There is a strong bidirectional relationship between diabetes and heart failure: diabetes increases the risk of developing heart failure by two- to fourfold, and the prognosis of heart failure patients with coexisting diabetes is significantly worse.^9,10 Given the high disability rate and multiple complications linked to diabetes, developing effective preventive strategies is essential.² Importantly, diabetes is largely preventable, and early identification and intervention may even reverse the disease in some cases.¹ Therefore, exploring variations in risk factor profiles and diabetes burden across different populations is crucial for successfully managing diabetes risk factors within a complex and evolving set of drivers.¹

Currently, artificial intelligence and machine learning technologies are revolutionizing the medical field by providing data-driven personalized solutions for managing diabetes and its cardiovascular risks.¹¹ However, research on long-term heart disease risk prediction for diabetic patients remains relatively scarce, particularly in the development of machine learning models tailored to this population. This gap underscores an urgent need for in-depth exploration. Therefore, constructing a predictive model for heart disease risk in diabetic individuals is of great significance, as it can enable early and accurate identification of high-risk patients, thereby providing a solid basis for implementing personalized interventions and improving patient outcomes. This study aims to offer scientific insights for selecting the optimal model for early screening of heart disease risk in diabetic populations.

Materials and methods

Data source

This was a retrospective cohort study. The data for this study were derived from the China Health and Retirement Longitudinal Study (CHARLS), a high-quality longitudinal cohort study targeting the Chinese population. The baseline data were from 2011, and the follow-up data were from 2020. This study was conducted in strict accordance with the ethical principles set forth in the Declaration of Helsinki, as revised in 2024. As part of this commitment, all personally identifiable information of patients was removed to protect participant confidentiality. The reporting of this study follows the Strengthening the Reporting of Observational Studies in Epidemiology guidelines.¹²

The inclusion criteria for the study subjects were: (1) fasting state in 2011; (2) no history of heart disease in 2011; (3) fasting blood glucose ≥126 mg/dL. The exclusion criteria were defined correspondingly: (1) nonfasting state or missing related data in 2011; (2) a history of heart disease or missing such data in 2011; (3) fasting blood glucose <126 mg/dL or missing data in 2011, which directly contrasts with the third inclusion criterion; (4) missing data on heart disease status at the 2020 follow-up. The flowchart depicting the results of the eligibility assessment is shown in Figure 1.

Figure 1.

Flowchart of participant eligibility assessment.

In this study, participants with a fasting blood glucose value ≥126 mg/dL in 2011 were defined as having diabetes. Concurrently, within the CHARLS cohort, the occurrence of heart disease was determined through self-reporting, based on the question: “Has a doctor ever diagnosed you with heart disease, angina, coronary heart disease, heart failure, or other heart problems?” Participants who answered “yes” to this question were accordingly classified as having heart disease.

Variable selection

This study analyzed variables collected at baseline, which were categorized into two major groups. Demographic and anthropometric variables included gender, age, systolic blood pressure, diastolic blood pressure, pulse, waist circumference, height, and weight. Blood test indicators comprised white blood cell count (WBC), platelets, hemoglobin (HGB), hematocrit (HCT), blood urea nitrogen (BUN), creatinine (CREA), uric acid (UA), blood glucose (GLU), total cholesterol (CHOL), triglycerides (TG), high-density lipoprotein cholesterol (HDL), low-density lipoprotein cholesterol (LDL), and C-reactive protein (CRP).

Data cleaning

The missing rate for each variable was first calculated, as shown in Table 1. To handle these missing values, a random forest-based nonparametric imputation algorithm was applied (missForest package). The performance of this imputation model was internally validated using the Out-of-Bag error; specifically, the Normalized Root Mean Square Error was used for numeric variables and the Proportion of False Classifications for categorical variables. Prior to imputation, Winsorization was employed on continuous variables to mitigate the influence of extreme outliers. Finally, body mass index (BMI) was calculated from the imputed height and weight data, after which the original height and weight variables were removed from the dataset to prevent multicollinearity in subsequent analyses.

Table 1.

Sample missing data overview.

Variables	Valid samples	Missing samples	Missing ratios
Gender	819	0	0.00%
Age	819	0	0.00%
Diastolic blood pressure	709	110	13.43%
Height	705	114	13.92%
Waist circumference	711	108	13.19%
Weight	710	109	13.31%
Pulse	709	110	13.43%
Systolic blood pressure	709	110	13.43%
White blood cell count	816	3	0.37%
Platelets	816	3	0.37%
Blood urea nitrogen	818	1	0.12%
Blood glucose	819	0	0.00%
Creatinine	816	3	0.37%
Total cholesterol	819	0	0.00%
Triglycerides	819	0	0.00%
High-density lipoprotein Cholesterol	819	0	0.00%
Low-density lipoprotein Cholesterol	809	10	1.22%
C-reactive protein	819	0	0.00%
Uric acid	819	0	0.00%
Hematocrit	818	1	0.12%
Hemoglobin	816	3	0.37%

Dataset partitioning and standardization

The complete, preprocessed dataset was partitioned using a stratified random sampling approach, allocating 70% of the data to a training set and 30% to a test set. To ensure comparability, all numeric variables subsequently underwent Z-score standardization. The success of the partitioning in creating comparable groups was evaluated by comparing baseline characteristics between the training and test sets. Continuous variables were analyzed using Student's t-test or the Wilcoxon rank-sum test, while categorical variables were analyzed using the chi-square test or Fisher's exact test, as appropriate.

Feature selection

A Least Absolute Shrinkage and Selection Operator (LASSO) regression model was constructed on the training set. The optimal regularization parameter (λ) was identified through tenfold cross-validation, and features with non-zero regression coefficients at this optimal lambda.min value were selected. Subsequently, multicollinearity among these selected features was evaluated using the variance inflation factor (VIF).

Machine-learning model construction

Using the feature subset identified by LASSO, 16 distinct machine-learning models were systematically constructed with the caret package. The models included: Logistic Regression, Elastic Net, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Random Forest, Gradient Boosting Machine (GBM), XGBoost, LightGBM, AdaBoost, C5.0 Decision Tree, Bagged CART, Support Vector Machine with radial basis kernel (SVM Radial), K-Nearest Neighbors (KNN), Neural Network (average), Multi-Layer Perceptron (MLP), and Naive Bayes. All models were trained and their hyperparameters were tuned using repeated fivefold cross-validation (with 5 repeats) combined with a grid search. To address class imbalance in the training set, a class-weighting strategy—inversely proportional to class frequency—was implemented to increase the penalty for misclassifying the minority class.

Model performance assessment and validation

The performance of all trained models was evaluated on the independent test set. Prior to this evaluation, an optimal probability cutoff threshold for each model was determined on the training set by maximizing the F1-Score, thereby converting predicted probabilities into binary classifications. A comprehensive set of metrics was then employed for assessment, including the area under the receiver operating characteristic curve (ROC-AUC), F1-Score, sensitivity, specificity, precision, accuracy, and balanced accuracy. To ensure robust estimation, point estimates for all performance metrics, along with their 95% confidence intervals (CIs), were calculated using the bias-corrected and accelerated bootstrap method.

Model interpretability analysis

To elucidate the decision-making process of the best-performing model, a Shapley additive explanations (SHAP) analysis was conducted. This interpretability framework provides in-depth explanations by quantifying the contribution of each feature to individual predictions (local interpretability) and across the entire dataset (global interpretability).

Statistical methods

All statistical analyses and model construction procedures were performed using R software (version 4.4.1). Unless otherwise specified, all statistical tests were two-sided, and a p-value of less than 0.05 was defined as the threshold for statistical significance.

Results

Baseline characteristics of study participants

After a 9-year follow-up period concluding in 2020, 157 of the 819 patients with diabetes at baseline (2011) developed heart disease. A comparison of baseline characteristics between the heart disease group and the nonheart disease group is detailed in Table 2. The results indicated significant differences between the two groups for several variables, including gender, age, waist circumference, systolic blood pressure, platelets, BUN, blood glucose, UA, and BMI. These findings suggest that these factors are associated with the subsequent development of heart disease in this cohort.

Table 2.

Comparison of baseline characteristics.

Characteristic	OverallN = 819	Heart diseaseN = 157	Non-Heart diseaseN = 662	p value
Gender				0.003
Female	419 (51%)	97 (62%)	322 (49%)
Male	400 (49%)	60 (38%)	340 (51%)
Age (years)				0.006
Median (Q1, Q3)	58.0 (52.0, 64.0)	60.0 (53.0, 64.0)	57.0 (52.0, 63.0)
Diastolic blood pressure (mmHg)				0.051
Median (Q1, Q3)	77.5 (71.0, 83.0)	78.5 (72.5, 84.0)	77.1 (71.0, 83.0)
Waist circumference (cm)	88.0 ± 9.7	90.0 ± 9.0	87.5 ± 9.7	<0.001
Pulse (beats/min)				0.259
Median (Q1, Q3)	73.5 (68.0, 79.5)	73.8 (69.0, 80.5)	73.5 (68.0, 79.0)
Systolic blood pressure (mmHg)				<0.001
Median (Q1, Q3)	131.4 (119.7, 142.0)	136.0 (126.0, 146.0)	130.3 (119.0, 141.0)
White blood cell count (×10⁹/L)				0.050
Median (Q1, Q3)	6.2 (5.2, 7.5)	6.4 (5.3, 8.1)	6.1 (5.2, 7.4)
Platelets (×10⁹/L)				0.016
Median (Q1, Q3)	205.0 (161.0, 253.0)	216.0 (168.0, 272.0)	202.5 (160.0, 249.0)
Blood urea nitrogen (mg/dL)				0.025
Median (Q1, Q3)	15.4 (12.9, 18.5)	15.1 (12.7, 17.3)	15.5 (13.1, 18.8)
Blood glucose (mg/dL)				0.021
Median (Q1, Q3)	146.7 (132.5, 184.0)	155.5 (135.5, 196.4)	145.6 (132.1, 180.4)
Creatinine (mg/dL)				0.092
Median (Q1, Q3)	0.8 (0.7, 0.9)	0.7 (0.7, 0.9)	0.8 (0.7, 0.9)
Total cholesterol (mg/dL)				0.536
Median (Q1, Q3)	200.3 (174.4, 228.9)	199.5 (175.1, 229.6)	200.3 (174.0, 228.5)
Triglycerides (mg/dL)				0.614
Median (Q1, Q3)	146.9 (94.7, 245.1)	158.4 (98.2, 235.4)	145.6 (93.8, 246.9)
High-density lipoprotein cholesterol (mg/dL)				0.181
Median (Q1, Q3)	43.3 (34.8, 54.1)	42.9 (35.2, 49.9)	44.1 (34.8, 54.9)
Low-density lipoprotein cholesterol (mg/dL)				0.224
Median (Q1, Q3)	113.3 (85.4, 138.8)	116.0 (93.2, 144.6)	112.9 (84.3, 136.5)
C-reactive protein (mg/L)				0.167
Median (Q1, Q3)	1.3 (0.7, 2.8)	1.4 (0.8, 2.9)	1.2 (0.6, 2.7)
Uric acid (mg/dL)				0.011
Median (Q1, Q3)	4.4 (3.6, 5.3)	4.1 (3.5, 5.0)	4.5 (3.7, 5.4)
Hematocrit (%)				0.511
Median (Q1, Q3)	42.3 (38.6, 45.9)	42.4 (39.6, 45.8)	42.3 (38.3, 46.0)
Hemoglobin (g/dL)				0.418
Median (Q1, Q3)	14.5 (13.3, 15.9)	14.7 (13.4, 16.2)	14.5 (13.3, 15.8)
Body mass index (kg/m²)				0.006
Median (Q1, Q3)	24.4 (22.3, 26.7)	25.3 (22.8, 27.2)	24.2 (22.2, 26.5)

Feature selection results

The LASSO regression analysis selected 19 core features for subsequent modeling: TG, CHOL, LDL, waist circumference, BMI, HDL, HGB, HCT, UA, CREA, gender, BUN, systolic blood pressure, age, CRP, WBC, platelets, GLU, and pulse. Multicollinearity assessment confirmed that the VIF for all selected features was below the common threshold of 10, indicating no significant multicollinearity concerns. These 19 features were therefore retained for the construction and evaluation of all machine learning models.

Performance validation of machine learning models

The optimal hyperparameter configurations for the sixteen machine learning models are provided in Supplemental Table 1. All models were evaluated using a comprehensive set of performance metrics, including ROC-AUC, F1-Score, sensitivity, specificity, precision, accuracy, and balanced accuracy. A heatmap visualizing these performance metrics across all datasets is presented in Figure 2, and a comprehensive performance comparison is shown in Figure 3. The 95% CIs for the test set performance metrics are presented in Table 3.The performance of the KNN model on the independent test set was as follows: ROC-AUC (95% CI) = 0.613 (0.505–0.707), balanced accuracy (95% CI) = 0.606 (0.524–0.686), accuracy (95% CI) = 0.678 (0.596–0.722), F1-score (95% CI) = 0.368 (0.255–0.476), sensitivity (95% CI) = 0.489 (0.354–0.667), specificity (95% CI) = 0.722 (0.648–0.772), and precision (95% CI) = 0.295 (0.189–0.400).

Figure 2.

Performance metrics of the 16 machine-learning models evaluated. C5.0, C5.0 Decision Tree; GBM, Gradient Boosting Machine; KNN, K-Nearest Neighbors; LDA, Linear Discriminant Analysis; MLP, Multi-Layer Perceptron; QDA, Quadratic Discriminant Analysis; SVM Radial, Support Vector Machine with radial basis kernel.

Figure 3.

Comprehensive performance comparison. C5.0, C5.0 Decision Tree; GBM, Gradient Boosting Machine; KNN, K-Nearest Neighbors; LDA, Linear Discriminant Analysis; MLP, Multi-Layer Perceptron; QDA, Quadratic Discriminant Analysis; SVM Radial, Support Vector Machine with radial basis kernel.

Table 3.

Performance of 16 machine learning models in the test set.

Model	ROC-AUC (95%CI)	F1 score (95%CI)	Sensitivity (95%CI)	Specificity (95%CI)	Precision (95%CI)	Accuracy (95%CI)	Balanced accuracy (95%CI)
SVM Radial	0.654 (0.554–0.739)	0.322 (0.244–0.383)	1	0	0.192 (0.139–0.237)	0.192 (0.139–0.237)	0.5
MLP	0.630 (0.544–0.707)	0.358 (0.257–0.462)	0.468 (0.320–0.608)	0.727 (0.651–0.780)	0.289 (0.203–0.408)	0.678 (0.607–0.731)	0.598 (0.523–0.675)
LDA	0.617 (0.539–0.703)	0.348 (0.250–0.456)	0.511 (0.355–0.667)	0.662 (0.601–0.734)	0.264 (0.180–0.366)	0.633 (0.576–0.697)	0.586 (0.505–0.671)
Neural Network	0.616 (0.519–0.703)	0.355 (0.271–0.484)	0.532 (0.368–0.658)	0.652 (0.579–0.716)	0.266 (0.195–0.385)	0.629 (0.560–0.690)	0.592 (0.509–0.669)
Logistic Regression	0.613 (0.518–0.699)	0.347 (0.244–0.446)	0.553 (0.395–0.690)	0.611 (0.541–0.672)	0.252 (0.171–0.347)	0.600 (0.527–0.653)	0.582 (0.500–0.661)
KNN	0.613 (0.505–0.707)	0.368 (0.255–0.476)	0.489 (0.354–0.667)	0.722 (0.648–0.772)	0.295 (0.189–0.400)	0.678 (0.596–0.722)	0.606 (0.524–0.686)
Naive Bayes	0.609 (0.507–0.700)	0.342 (0.243–0.450)	0.426 (0.295–0.571)	0.747 (0.687–0.811)	0.286 (0.196–0.397)	0.686 (0.624–0.739)	0.587 (0.515–0.663)
AdaBoost	0.579 (0.473–0.653)	0.296 (0.207–0.397)	0.426 (0.292–0.602)	0.657 (0.589–0.719)	0.227 (0.149–0.322)	0.612 (0.551–0.661)	0.541 (0.462–0.616)
Bagged CART	0.571 (0.473–0.664)	0.272 (0.156–0.387)	0.298 (0.163–0.435)	0.788 (0.721–0.842)	0.250 (0.137–0.374)	0.694 (0.629–0.751)	0.543 (0.474–0.614)
LightGBM	0.563 (0.480–0.650)	0.235 (0.127–0.357)	0.255 (0.136–0.394)	0.783 (0.726–0.837)	0.218 (0.113–0.335)	0.682 (0.620–0.731)	0.519 (0.454–0.591)
Random Forest	0.548 (0.458–0.643)	0.264 (0.161–0.400)	0.255 (0.151–0.399)	0.838 (0.788–0.884)	0.273 (0.147–0.411)	0.727 (0.661–0.776)	0.547 (0.487–0.628)
GBM	0.540 (0.451–0.645)	0.224 (0.129–0.324)	0.298 (0.162–0.451)	0.677 (0.610–0.741)	0.179 (0.097–0.272)	0.604 (0.538–0.669)	0.487 (0.414–0.581)
QDA	0.524 (0.434–0.631)	0.288 (0.182–0.405)	0.340 (0.210–0.488)	0.758 (0.688–0.812)	0.250 (0.145–0.355)	0.678 (0.612–0.731)	0.549 (0.485–0.636)
XGBoost	0.485 (0.397–0.575)	0.322 (0.250–0.383)	1	0	0.192 (0.143–0.237)	0.192 (0.143–0.237)	0.5
C5.0	0.454 (0.371–0.545)	0.135 (0.050–0.293)	0.106 (0.034–0.250)	0.889 (0.827–0.923)	0.185 (0.059–0.367)	0.739 (0.661–0.784)	0.498 (0.455–0.557)
Elastic Net	0.5	0.322 (0.250–0.383)	1	0	0.192 (0.143–0.237)	0.192 (0.143–0.237)	0.5

Shapley additive explanation interpretability analysis

To interpret the predictions of the KNN model, the SHAP framework was applied. The analysis identified BMI, systolic blood pressure, and waist circumference as the three most important predictive features, as shown in Figure 4. This ranking is visualized in the SHAP summary plot (Figure 5), where features are ordered by importance on the Y-axis and the corresponding SHAP values on the X-axis represent their impact on the model's output. The plot demonstrates that higher values of BMI, systolic blood pressure, and waist circumference are associated with increased SHAP values, meaning they significantly elevate the model's predicted probability of heart disease. The SHAP dependence plots for the top three most important features (Figure 6) reveal consistent positive trends in their contributions to the model's prediction beyond specific thresholds. Once BMI exceeds 25 kg/m², its influence turns positive and increases with further elevation. Similarly, systolic blood pressure exerts a consistently positive effect after surpassing 125 mmHg. Waist circumference also demonstrates an almost linear upward trend, generating a systematically positive contribution once it exceeds 90 cm.

Figure 4.

Feature importance ranking based on SHAP analysis. BMI, body mass index; CREA, creatinine; BUN, blood urea nitrogen; WBC, white blood cell count; UA, uric acid; CRP, C-reactive protein; CHOL, total cholesterol; HCT, hematocrit; HDL, high-density lipoprotein cholesterol; TG, triglycerides; HGB, hemoglobin; LDL, low-density lipoprotein cholesterol; SHAP, Shapley additive explanation.

Figure 5.

Shapley additive explanation (SHAP) summary plot. BMI, body mass index; CREA, creatinine; BUN, blood urea nitrogen; WBC, white blood cell count; UA, uric acid; CRP, C-reactive protein; CHOL, total cholesterol; HCT, hematocrit; HDL, high-density lipoprotein cholesterol; TG, triglycerides; HGB, hemoglobin; LDL, low-density lipoprotein cholesterol.

Figure 6.

Shapley additive explanation (SHAP) dependence plots for the top three predictive features: (A) BMI, (B) systolic blood pressure, and (C) waist circumference. BMI, body mass index.

The KNN model demonstrates optimal performance across key metrics, including F1 score, balanced accuracy, and precision. Moreover, SHAP analysis confirms that its feature contribution patterns align closely with clinical expertise, achieving a strong balance between predictive power and interpretability. Therefore, the KNN model stands out as the best candidate for further application.

Discussion

In recent years, machine-learning models have been widely applied to heart-disease prediction. For example, several studies have utilized the diabetes complications screening research initiative (DiScRi) dataset to build models for predicting the co-occurrence of diabetes and cardiovascular disease¹³; another proposed an artificial intelligence model to assess coronary heart disease risk specifically in patients with type 2 diabetes.¹⁴ Additionally, an XGBoost model was constructed and validated for its predictive value in elderly patients with both diabetes and coronary heart disease,¹⁵ while other researchers used U.S. National Health and Nutrition Examination Survey data to build models for cardiovascular risk prediction in individuals with type 2 diabetes.¹⁶ Although these models have demonstrated relatively good performance, they all rely on cross-sectional data, where exposure factors and outcomes were collected simultaneously. This design lacks long-term follow-up, consequently limiting their ability to predict long-term heart-disease risk—a pronounced limitation. In contrast, some studies based on the CHARLS cohort have employed machine learning to effectively identify heart-disease risk among elderly hypertensive patients.¹⁷ Further research utilizing CHARLS follow-up data has analyzed the 9-year incidence of cardiovascular disease among middle-aged and older adults in China and constructed corresponding prediction models.¹⁸ These studies benefit from long-term follow-up information but do not specifically target populations with diabetes.

This study aims to develop a machine-learning model that utilizes baseline characteristics to predict the long-term risk of heart disease in a diabetic population. The distinctive feature of this research, compared with the aforementioned studies, is its specific focus on patients with diabetes and its use of follow-up data spanning up to 9 years, thereby providing evidence for early risk identification. Furthermore, the study is designed to be cost-effective, as it relies on easily obtainable, objective, and reliable measures—including routine variables and hematological indicators—while avoiding potential biases associated with subjective data such as self-reports.

During the study, of the 16 machine-learning prediction models built upon 19 variables selected by the LASSO method, the KNN model demonstrated the best performance. The SHAP analysis identified BMI, systolic blood pressure, and waist circumference as the three most important predictive features. The SHAP dependence plots revealed that all three features were positively associated with SHAP values, indicating that higher BMI, elevated systolic blood pressure, and larger waist circumference all correspond to increased heart-disease risk—a finding fully consistent with established clinical knowledge.

This alignment is well-supported by existing literature. Previous studies have shown that obesity in early adulthood significantly increases cardiovascular disease risk,¹⁹ while elevated systolic blood pressure variability is associated with higher cardiovascular risk.^20,21 Interventions targeting hypertension may mitigate the coronary heart disease risk associated with high BMI,²² and maintaining systolic blood pressure below 130 mmHg can substantially reduce the risk of major cardiovascular events and all-cause mortality.²³ Furthermore, central obesity has been positively correlated with cardiovascular disease events,²⁴ with waist circumference serving as a cardiovascular risk marker independent of BMI²⁵—a relationship confirmed across multiple studies.²⁶ In diabetic populations, where cardiovascular risk factors such as obesity, hypertension, and dyslipidemia frequently coexist,⁶ these associations become particularly relevant. Blood pressure changes represent a shared risk factor for both diabetes and cardiovascular disease,¹³ while hypertension significantly influences diabetes prevalence.³ Therefore, the positive association between these anthropometric and hemodynamic measures and heart disease risk in diabetic patients underscores the clinical importance of early intervention and ongoing management of these modifiable risk factors.

This study has many limitations. Firstly, we strictly adopted “fasting blood glucose ≥126 mg/dL” as the sole criterion for defining diabetes, whereas other studies often employ additional methods.^18,27 We acknowledge that this approach may lead to an underestimation of the overall diabetes prevalence. However, this trade-off was necessary to prioritize definitional purity for the purpose of this study. The decision to rely on a single biomarker, rather than integrating multiple indicators as is common in other studies, was primarily driven by the need to adhere to international authoritative guidelines from the World Health Organization and American Diabetes Association, thereby maximizing diagnostic accuracy and ensuring the international comparability of our findings. This approach also safeguards objectivity and comparability, as the use of a standardized fasting blood glucose test effectively avoids the subjective influences—such as recall bias and diagnostic variation—inherent in self-reported data, ensuring all cases are identified by a single, objective criterion. Furthermore, this specific definition enables a precise focus on undiagnosed diabetes, allowing for the accurate identification of cases not yet clinically diagnosed, which constitutes a central aim of this research. Secondly, this study could not distinguish between type 1 and type 2 diabetes. This limitation stems from the absence of key clinical indicators in the CHARLS database, such as islet autoantibodies and C-peptide levels, which are necessary for a definitive differentiation. However, given that the study participants were a community-based population aged 45 years and above—a demographic in which type 2 diabetes constitutes the vast majority—this lack of classification is expected to have a limited impact on the overall results. Nevertheless, the distinct pathophysiological mechanisms of the two diabetes types remain a consideration that should be acknowledged when interpreting the study's findings. Thirdly, the assessment of heart disease events in this study relied on participants’ self-report of a prior physician diagnosis within the CHARLS cohort. While this method is operationally feasible and commonly employed in large epidemiological studies,^18,28 it is subject to several important limitations. Self-reported data are susceptible to recall bias, as participants may inaccurately report their disease history due to memory lapses or misunderstandings of complex medical terminology. Furthermore, the broad diagnostic category of “heart disease” amalgamates several heterogeneous conditions—including angina, heart failure, and coronary heart disease—which differ substantially in their severity, diagnostic criteria, and clinical implications. The inability to distinguish between these specific cardiovascular outcomes in our analysis may thus introduce heterogeneity into the results. An additional constraint is that this approach cannot identify cases of heart disease that remain undiagnosed in the community, potentially leading to an underestimation of the true prevalence. Despite these limitations, the use of physician-based self-report remains a practical necessity for large-scale cohort studies like CHARLS, as it provides a standardized ascertainment method that balances feasibility with a degree of clinical validity, largely avoiding mere subjective speculation. Fourth, we acknowledge the limitations of the feature-selection method employed in this study. During the feature-selection stage, we utilized LASSO regression. While this method offers efficiency and stability for high-dimensional data reduction and handles multicollinearity well, it is based on a linear assumption and may therefore fail to fully capture complex nonlinear relationships between predictors and the outcome. Given these considerations, we applied LASSO as a preliminary screening tool. To evaluate the appropriateness of this choice, we conducted a sensitivity analysis using a wrapper-style cumulative-feature approach, progressively adding features based on importance ranking from the optimal KNN model and computing the corresponding cumulative ROC-AUC. The results demonstrated that the feature subset selected by LASSO was highly consistent with that obtained from the wrapper method (Supplemental Figures 1 and 2). This indicates that, despite its theoretical limitations, LASSO provided a robust and effective feature subset within the specific data context of this study, supporting its reasonableness for dimensionality reduction. Nevertheless, future work could adopt more sophisticated nonlinear feature-selection techniques to further optimize the model. Fifth, the predictive power of the model is limited. Although the KNN model was identified as the best-performing model in this study, its key performance indicators remain modest. Specifically, the ROC-AUC value—the core evaluation metric—was only 0.613, reflecting a relatively limited overall predictive ability. This performance constraint likely stems from a combination of factors related to data and methodology, rather than any single cause. One contributing factor is the sample size limitation, as this study was based on the CHARLS database and included 819 patients, among whom only 157 experienced a positive heart disease event. The relatively small sample size, particularly of positive cases, restricts the ability of machine-learning algorithms to fully leverage their potential with high-dimensional data. A second interrelated issue is the significant class imbalance problem. In the 9-year follow-up data, the ratio of positive to negative cases was less than 1:4, and this imbalance was even more pronounced in the 2-year follow-up data where only 33 of 1001 diabetic patients developed heart disease. Although we applied class weighting inversely proportional to class frequency and conducted an exhaustive search of classification thresholds from 0.01 to 0.99 to maximize the F1 score, the underlying data distribution imbalance persisted, constraining further improvement in model performance. Challenges inherent to long-term follow-up also present a complex set of constraints. As with most longitudinal studies, this research is subject to selection bias since the cohort consisted of individuals with no baseline heart disease in 2011, elevated fasting glucose (≥126 mg/dL), and completed endpoint follow-up in 2020. The exclusion of individuals lost to follow-up may have introduced this selection bias and potentially led to an underestimation of heart disease incidence. Consequently, the findings primarily apply to this specific subpopulation—those without baseline heart disease, with elevated fasting glucose, and who completed long-term follow-up. Furthermore, over such an extended period, numerous unmeasured confounders such as lifestyle changes, treatment adjustments, or other complications could influence outcomes. The presence of these complex factors inherently increases prediction difficulty and inevitably affects model accuracy. Another consideration is the potential for threshold-selection overfitting. Although optimizing the F1-score in small, imbalanced datasets may yield unstable thresholds,²⁹ this approach demonstrated satisfactory applicability in our empirical analysis. We systematically compared multiple threshold-determination strategies, including the default 0.5 cutoff, Youden's index, and a high-sensitivity guarantee (Supplemental Table 2). We found that models preselected by these methods, such as MLP or SVM Radial, consistently exhibited notable drawbacks. Some showed conspicuous overfitting on the training set, while others produced SHAP-based feature-importance rankings that sharply contradicted established clinical knowledge, thereby undermining their clinical credibility. In contrast, the KNN model, whose threshold was set by F1-maximisation, not only ranked first on key metrics such as F1-score, balanced accuracy, and precision but also yielded a SHAP-derived feature-contribution pattern that aligned closely with clinical experience. This approach unified predictive performance with interpretability. As the harmonic mean of precision and recall, F1-maximization intrinsically seeks a clinically acceptable compromise between detecting target cases, which requires high recall, and controlling false alarms, which requires high precision. This makes it especially suitable for imbalanced scenarios like ours, where the costs of both types of error must be weighed. Taken together, through comprehensive performance comparisons and interpretability verification, we conclude that, within the present study context, adopting the F1-maximization strategy and selecting the KNN model represents a methodologically prudent and clinically logical decision. Sixth, the study lacks external validation: while internal validation was performed, the model has not yet been evaluated in an independent cohort or broader population. This limitation affects the generalizability and robustness of the findings. Consequently, future research should prioritize obtaining larger, multicenter samples to verify the model's effectiveness and pursue prospective independent validation.

In summary, this study developed a machine-learning model to predict heart disease risk in patients with diabetes. Although the model exhibited only modest predictive performance, it provides a valuable empirical foundation and clear direction for constructing more reliable and clinically useful prediction tools in this field.

Supplemental Material

sj-pdf-1-sci-10.1177_00368504261424391 - Supplemental material for A study on the risk prediction of heart disease in diabetes patients based on machine learning

Supplemental material, sj-pdf-1-sci-10.1177_00368504261424391 for A study on the risk prediction of heart disease in diabetes patients based on machine learning by Tinghua Zhang and Huan Liu in Science Progress

Supplemental Material

sj-pdf-2-sci-10.1177_00368504261424391 - Supplemental material for A study on the risk prediction of heart disease in diabetes patients based on machine learning

Supplemental material, sj-pdf-2-sci-10.1177_00368504261424391 for A study on the risk prediction of heart disease in diabetes patients based on machine learning by Tinghua Zhang and Huan Liu in Science Progress

Supplemental Material

sj-docx-3-sci-10.1177_00368504261424391 - Supplemental material for A study on the risk prediction of heart disease in diabetes patients based on machine learning

Supplemental material, sj-docx-3-sci-10.1177_00368504261424391 for A study on the risk prediction of heart disease in diabetes patients based on machine learning by Tinghua Zhang and Huan Liu in Science Progress

Supplemental Material

sj-docx-4-sci-10.1177_00368504261424391 - Supplemental material for A study on the risk prediction of heart disease in diabetes patients based on machine learning

Supplemental material, sj-docx-4-sci-10.1177_00368504261424391 for A study on the risk prediction of heart disease in diabetes patients based on machine learning by Tinghua Zhang and Huan Liu in Science Progress

Footnotes

Acknowledgements

The authors have reviewed and edited the content as needed and take full responsibility for the content of the publication. The authors acknowledge the use of an AI language model (DeepSeek V3.2-Exp) to assist in improving the English language and clarity of the manuscript during the writing process.

ORCID iDs

Tinghua Zhang

Huan Liu

Ethics approval and consent to participate

The study was approved by the Biomedical Ethics Committee of Peking University (Approval No: IRB00001052-11015; IRB00001052-11014) and strictly adhered to the ethical principles outlined in the Declaration of Helsinki. All participants provided written informed consent, ensuring the ethical compliance of the data collection process and the protection of participants’ rights. Therefore, the Ethics Committee of the Central Hospital of Huaihua City exempted this study from ethical review. The data used in this study came from the CHARLS database between 2011 and 2020.

Author contributions

Conceptualization: Tinghua Zhang and Huan Liu; Methodology: Tinghua Zhang; Formal analysis and investigation: Tinghua Zhang; Writing—original draft, review, and editing: Tinghua Zhang and Huan Liu.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

This study uses publicly available data from the China Health and Retirement Longitudinal Study. The raw data can be accessed through the following link: .

Clinical trial number

This study was completed based on the publicly available longitudinal survey database, the CHARLS, and is not a clinical trial, therefore, no clinical trial registration number was obtained.

Supplemental material

Supplemental material for this article is available online.

References

Ong

Stafford

McLaughlin

, et al. Global, Regional, and National Burden of Diabetes from 1990 to 2021, with projections of prevalence to 2050: a systematic analysis for the global burden of disease study 2021. Lancet 2023; 402: 203–234.

Liu

Ren

Z-H

Qiang

, et al. Trends in the incidence of diabetes mellitus: results from the global burden of disease study 2017 and implications for diabetes mellitus prevention. BMC Public Health 2020; 20: 1415.

Yang

Tian

, et al. Increased prevalence of diabetes mellitus and its metabolic risk factors from 2002 to 2017 in Shanghai, China. J Diabetes 2024; 16: e70003.

Bai

Tao

, et al. Prevalence and risk factors of diabetes among adults aged 45 years or older in China: a national cross-sectional study. Endocrinol Diabetes Metab 2021; 4: e00265.

Heather

Hafstad

Halade

, et al. Guidelines on models of diabetic heart disease. Am J Physiol Heart Circ Physiol 2022; 323: H176–H200.

Leon

. Diabetes and cardiovascular disease: epidemiology, biological mechanisms, treatment recommendations and future research. World J Diabetes 2015; 6: 1246–1258.

Conning-Rowland

Cubbon

. Molecular mechanisms of diabetic heart disease: insights from transcriptomic technologies. Diabetes Vasc Dis Res 2023; 20: 14791641231205428.

Ritchie

Abel

. Basic mechanisms of diabetic heart disease. Circ Res 2020; 126: 1501–1525.

Nakamura

Miyoshi

Yoshida

, et al. Pathophysiology and treatment of diabetic cardiomyopathy and heart failure in patients with diabetes mellitus. Int J Mol Sci 2022; 23: 3587.

10.

Park

. Epidemiology, pathophysiology, diagnosis and treatment of heart failure in diabetes. Diabetes Metab J 2021; 45: 146–157.

11.

Oikonomou

Khera

. Machine learning in precision diabetes care and cardiovascular risk prediction. Cardiovasc Diabetol 2023; 22: 259.

12.

von Elm

Altman

Egger

, et al. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. J Clin Epidemiol 2008; 61: 344–349. 2008/03/04.

13.

Abdalrada

Abawajy

Al-Quraishi

, et al. Machine learning models for prediction of co-occurrence of diabetes and cardiovascular diseases: a retrospective cohort study. J Diabetes Metab Disord 2022; 21: 251–261.

14.

Fan

Zhang

Yang

, et al. AI-based prediction for the risk of coronary heart disease among patients with type 2 diabetes mellitus. Sci Rep 2020; 10: 14457.

15.

Cao

Bai

, et al. Establishment of a diagnostic model of coronary heart disease in elderly patients with diabetes mellitus based on machine learning algorithms. J Geriatr Cardiol 2022; 19: 445–455. 2022/07/19.

16.

Shi

Ding

, et al. Development and validation of a machine learning model for cardiovascular disease risk prediction in type 2 diabetes patients. Sci Rep 2025; 15: 32818.

17.

Liu

. A prediction study on the occurrence risk of heart disease in older hypertensive patients based on machine learning. BMC Geriatr 2025; 25: 27.

18.

Huang

Jiang

Shi

, et al. Characterisation of cardiovascular disease (CVD) incidence and machine learning risk prediction in middle-aged and elderly populations: data from the China Health and Retirement Longitudinal Study (CHARLS). BMC Public Health 2025; 25: 518.

19.

Chen

, et al. Early adulthood BMI and cardiovascular disease: a prospective cohort study from the China Kadoorie biobank. Lancet Public Health 2024; 9: e1005–e1013. 2024/06/18.

20.

Cheng

Song

Ouyang

, et al. Systolic blood pressure variability: risk of cardiovascular events, chronic kidney disease, dementia, and death. Eur Heart J 2025; 46: 2673–2687. 2025/04/18.

21.

Stevens

Wood

Koshiaris

, et al. Blood pressure variability and cardiovascular disease: systematic review and meta-analysis. Br Med J 2016; 354: i4098. 2016/08/12.

22.

Hajifathalian

Ezzati

, et al. Metabolic mediators of the effects of body-mass index, overweight, and obesity on coronary heart disease and stroke: a pooled analysis of 97 prospective cohorts with 1·8 million participants. Lancet 2014; 383: 970–983. 2013/11/26.

23.

Whelton

O'Connell

Mills

, et al. Optimal antihypertensive systolic blood pressure: a systematic review and meta-analysis. Hypertension 2024; 81: 2329–2339. 2024/09/12.

24.

Song

Hong

Sung

, et al. Waist circumference and mortality or cardiovascular events in a general Korean population. PLoS One 2022; 17: e0267597. 2022/04/28.

25.

Powell-Wiley

Poirier

Burke

, et al. Obesity and cardiovascular disease: a scientific statement from the American Heart Association. Circulation 2021; 143: e984–e1010. 2021/04/23.

26.

Wang

Lee

, et al. A prospective study of waist circumference trajectories and incident cardiovascular disease in China: the Kailuan cohort study. Am J Clin Nutr 2021; 113: 338–347. 2020/12/18.

27.

Bai

Zhang

, et al. Association of gait speed with risk of diabetes mellitus among older adults: findings from the China Health and Retirement Longitudinal Study. BMC Geriatr 2025; 25: 806.

28.

Zhang

, et al. Cholesterol, high-density lipoprotein, and glucose index versus triglyceride–glucose index in predicting cardiovascular disease risk: a cohort study. Cardiovasc Diabetol 2025; 24: 116.

29.

Lipton

Elkan

Naryanaswamy

Optimal thresholding of classifiers to maximize F1 measure. Mach Learn Knowl Discov Databases 2014; 8725: 225–239. 2014/01/01.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.05 MB

0.02 MB

0.01 MB