Sage Journals: Discover world-class research

Abstract

Objectives

Heart failure (HF) patients admitted to intensive care units are prone to early readmission, which leads to adverse outcomes and increased healthcare costs. Existing prediction models often suffer from data heterogeneity, class imbalance, and limited interpretability. This study aimed to develop an interpretable ensemble learning framework to predict 30-day ICU readmission in adult patients with HF and to compare its performance with conventional single-classifier approaches.

Methods

This retrospective study analyzed 5414 adult HF patients from the MIMIC-III database. Using clinical and demographic variables collected within the first 24 h of the index ICU admission, the study aimed to predict 30-day ICU readmission (return to ICU). A two-stage ensemble model was developed using stratified sampling and grid-search optimization, with top learners integrated via a soft-voting mechanism. Additionally, SHapley Additive exPlanation (SHAP) analysis was employed to ensure model interpretability and quantify variable contributions to the predictions.

Results

The KNN-imputed Voting (3 Models) ensemble emerged as the optimal framework, achieving an accuracy of 0.8413, F1-score of 0.8195, and AUROC of 0.6718. Despite moderate AUROC, the model achieved strong recall and reliable calibration, making it suitable for risk stratification in post-ICU care transitions. The SHAP analysis identified Glucose, hemodynamic parameters (e.g., blood pressure, heart rate), and inflammatory indicators as key predictors, aligning with established clinical understanding of stress hyperglycemia and hemodynamic instability in HF.

Conclusion

This interpretable ensemble framework predicts 30-day ICU readmission in HF patients with robust performance, effectively balancing sensitivity and discrimination. It supports electronic health record–based risk stratification and timely intervention. Future work should focus on external validation across diverse populations to ensure generalizability.

Keywords

Heart failure intensive care unit readmission prediction machine learning ensemble learning SHAP values clinical decision support electronic health records

Introduction

Over the past decade, the integration of artificial intelligence (AI) and machine learning (ML) into healthcare has been increasingly adopted to support clinical decision-making and improve operational efficiency.^1–3 These technologies support the analysis of complex datasets by combining physiological indicators with detailed patient demographics and have shown promise in enhancing the accuracy of outcome predictions such as 30-day ICU readmission risk. Here, “30-day readmission” refers specifically to unplanned ICU readmission (i.e., return to the ICU) within 30 days after the index ICU discharge, rather than all-cause hospital readmission after hospital discharge. Through early identification of high-risk individuals, AI and ML facilitate timely interventions and more efficient resource allocation, aligning with international health priorities such as Sustainable Development Goal 3, which advocates for technological innovation to improve health services.^4,5 In high-risk cardiovascular populations such as patients with heart failure (HF), existing ML-based 30-day readmission prediction models have already demonstrated promising discriminative performance while exposing important methodological and implementation gaps that motivate further refinement.⁶ Despite advances in predictive analytics, ICU readmissions continue to present significant clinical and financial burdens. Estimates suggest that preventable readmissions may lead to a significant increase in healthcare costs, placing an additional burden on the healthcare system.^7,8 In response, the U.S. Affordable Care Act introduced the Hospital Readmissions Reduction Program (HRRP) in 2010, which imposes penalties on hospitals with excessive readmission rates. Although HRRP has demonstrated modest success, persistent challenges—including care coordination gaps, disease complexity, and socioeconomic disparities—underscore the need for more effective and scalable prediction strategies.^9,10

Among high-risk cardiovascular conditions, HF is a major driver of hospital admissions and rehospitalization. Recent estimates indicate that ∼64.3 million people were living with HF worldwide in 2017, including ∼6.0 million adults in the United States (2015–2018).¹¹ ICU patients with HF are particularly vulnerable due to clinical instability and complex postdischarge care needs.¹² Consequently, timely prediction of 30-day ICU readmissions in this population is critical for reducing adverse outcomes and managing hospital resources.

Conventional approaches, however, often fall short due to their limited capacity to model high-dimensional, heterogeneous data and their inability to effectively address class imbalance. Moreover, many prior models for ICU readmission rely on small, proprietary single-center datasets with restricted external validity.^13,14 Although MIMIC-III is single-institutional, its scale (over 53,000 adult ICU admissions) and open access make it a valuable resource for transparent, reproducible modeling while also enabling rigorous internal transportability checks to mitigate single-center bias.¹⁵

In summary, while existing research has revealed the limitations of small-scale and single-center models, broader challenges remain in balancing predictive accuracy with interpretability to ensure effective clinical implementation. To address these issues, ensemble learning methods—such as those combining Random Forest and XGBoost—offer a compelling solution due to their robustness against overfitting and their capacity to synthesize multiple data signals. When coupled with explainable AI approaches such as SHapley Additive exPlanation (SHAP), these methods hold promise for both accurate prediction and clinically interpretable insights in critical care.¹⁶

This study focused on adult patients with a primary diagnosis of HF, considering only the first ICU admission per patient. The outcome of interest was unplanned 30-day ICU readmission. Using structured variables from the MIMIC-III database (with detailed definitions provided in the Appendix), we developed and validated a stacked ensemble classifier optimized for imbalanced clinical data. The framework first evaluated diverse base learners and then integrated top-performing models through a soft-voting mechanism. To enhance clinical interpretability, SHAP-based explanations were incorporated, facilitating clinician trust and decision support. Beyond readmission forecasting, the design can extend to broader applications, such as ICU bed demand prediction, dynamic staffing allocation, or identifying patients requiring extended rehabilitation and follow-up support.^1,2,17 The key contributions of this study are threefold: (1) developing a robust ensemble learning model for 30-day ICU readmission risk in HF patients, (2) incorporating SHAP-based interpretability to strengthen transparency and applicability, and (3) presenting a reproducible methodology adaptable to other disease contexts or healthcare settings.

Materials and methods

Study population

This retrospective cohort study utilized data from the publicly available MIMIC-III (v1.4) database, which includes de-identified electronic health records (EHRs) of over 40,000 ICU admissions at the Beth Israel Deaconess Medical Center between 2001 and 2012. Ethical access to the MIMIC-III database was obtained by completing a recognized human research participants protection course and securing approval through the formal PhysioNet access process (Certificate No. 35628530).¹⁵ From an initial 61,532 ICU stays, we restricted the sample to the first ICU stay per patient, applied predefined inclusion and exclusion criteria, and handled missing-data as described in the “Data Cleaning” subsection, yielding a final analytic cohort of 5414 adult HF ICU stays with complete predictor and outcome data.

An initial set of 20 clinically relevant predictors was defined a priori, based on their established use in HF mortality and readmission models and their availability across ICU stays.^16,18,19 Variables that are not known or not actionable at the time of ICU discharge—such as postdischarge mortality during follow-up—and sensitive sociodemographic attributes such as religion were explicitly excluded from the candidate predictor set to reflect real-world decision points and to avoid potential fairness concerns. The final list of predictors (e.g., vital signs, common laboratory parameters, Glasgow Coma Scale components, age, and sex) is provided in the Appendix.

Research process

The overall research workflow for this study is illustrated in Figure 1 and depicts the sequential pipeline used to develop and compare single and ensemble models for predicting 30-day ICU readmission among HF patients using MIMIC-III data.^15,16 The flowchart comprises four main stages: (1) Data Input, (2) Data Preprocessing, (3) Model Development, and (4) Model Evaluation.

Figure 1.

Research process.

The extraction of MIMIC-III records and the definition of the target HF population are described in “Study population” section. The subsequent data preprocessing procedures—including data cleaning, cohort selection, predictor construction, and missing-data handling—are detailed in “Data cleaning” section. To address class imbalance in 30-day readmission outcomes, imbalance-correction techniques are applied only to the training set, as described in “Imbalanced data handling” section.^18,20,21 “Model development” section presents the model development stage, in which multiple baseline single classifiers are trained and compared, followed by the construction of ensemble models to enable a systematic single-model versus ensemble-model comparison. Finally, “Model evaluation” section summarizes the model-evaluation procedures, including discrimination and calibration metrics as well as SHAP-based explainability analyses, which are used to quantify predictive performance and to enhance the transparency and reproducibility of the proposed models.^18,22–24

Data cleaning

The initial dataset comprised 61,532 ICU stays; restricting the sample to each patient's first ICU stay yielded 46,520 unique admissions.^15,19 We then excluded ICU stays shorter than 24 h to ensure that all predictors could be consistently derived from the first 24 h after ICU admission, and we removed patients younger than 16 years. Heart failure cases were identified based on ICD-9-CM codes. Specifically, adult patients (aged >16 years) with HF were identified using ICD-9-CM codes 398.91, 402.01, 402.11, 402.91, 404.01, 404.03, 404.11, 404.13, 404.91, 404.93, and 428.xx, consistent with prior HF ICU studies,^16,19 yielding 7278 eligible HF ICU stays (Figure 2).

Figure 2.

Data selection process.

Missing-data were processed using a three-stage procedure adapted from the “30–40–20” strategy proposed in prior MIMIC-III work.^18,19 First, patient records with more than 30% missing values across the candidate predictors were excluded.^18,19 Second, predictors with more than 40% missingness were removed from the feature set.^18,19 Third, among the remaining variables, features with more than 20% missingness were excluded. This procedure resulted in a final KNN-imputed modeling set of 5414 HF ICU admissions with predictors that satisfied the predefined missing-data thresholds and were used for subsequent model development.

For the residual sporadic missing entries (<20%), we constructed two parallel analysis datasets to handle missingness: one using listwise deletion of records with any remaining missing values and one using k-nearest neighbors (KNN) imputation to preserve multivariate relationships.^18,19 Both strategies were carried forward to the modeling stage and comparatively evaluated in the experimental analysis.

The outcome of interest was unplanned 30-day ICU readmission after index ICU discharge. This outcome is conceptually distinct from general hospital readmission, which refers to rehospitalization after hospital discharge. Admissions without complete 30-day readmission information or those exceeding the missing-data thresholds were excluded. After applying these eligibility criteria and the missing-data procedure, the final analytic cohort comprised 5414 ICU stays for adult HF patients with complete outcome and predictor data, as summarized in the patient selection flowchart.^16,18,19

Imbalanced data handling

The naturally imbalanced distribution between patients who were readmitted and those who were not presents a significant challenge for predictive models. This imbalance can bias models toward the majority class, leading to reduced sensitivity in identifying true readmission cases, which are of high clinical importance. To mitigate this issue, data balancing techniques were applied exclusively to the training dataset.

A hybrid resampling approach was employed. The Synthetic Minority Oversampling Technique was used to generate synthetic samples for the minority class, thereby enhancing its representation without duplicating existing records. Concurrently, Tomek Links were utilized to reduce class overlap and eliminate ambiguous instances, thus improving the separation between classes. These techniques were applied only to the training dataset to prevent data leakage into the validation and testing stages.^20,21

The implementation of this strategy resulted in an improved ability for the models to effectively detect both readmitted and nonreadmitted patients, which contributed to more stable and accurate classification outcomes.

Model development

Building on the preprocessed cohort described in “Data cleaning” and “Imbalanced data handling” sections, the final analytic dataset was randomly split into training, validation, and testing subsets using stratified sampling to preserve the observed 30-day ICU readmission rate. The training set was used for model fitting and, where necessary, for class-imbalance handling; the validation set was reserved for hyperparameter tuning and model selection; and the held-out test set was used exclusively for the final performance evaluation.

Model construction followed the two-experiment framework illustrated in Figure 3. In Experiment 1 (single-model comparison), a unified missing-value procedure produced two versions of the data: one constructed via listwise deletion and the other via KNNs imputation. These two strategies were chosen because they represent widely used and conceptually distinct approaches in clinical ML applications: listwise deletion provides a simple, transparent complete-case analysis,²⁵ whereas KNN imputation offers a more flexible, data-driven method that retains all observations by borrowing information from similar patients.²⁶ Under each imputation strategy, 14 candidate machine-learning classifiers—including logistic regression, decision tree, random forest, gradient boosting, XGBoost, LightGBM, CatBoost, bagging, extra trees, support-vector machine, KNNs, and neural-network-based models—were trained on the (balanced) training set and evaluated using the validation and test sets.^13,18,27,28 This yielded two families of single-model predictors corresponding to listwise deletion and KNN imputation.

Figure 3.

Model development framework.

Experiment 2 (ensemble-model comparison) then focused on the best-performing single model selected under each imputation strategy. These two single models were contrasted with ensemble counterparts trained on the same feature representations. For each imputation strategy, soft-voting ensembles were constructed by aggregating the highest-ranked base learners; specifically, two ensemble configurations were explored that combined the top three and the top five classifiers, respectively, according to validation accuracy. All four models—two ensembles and two best single models—were finally compared on the held-out test set to assess the added value of ensemble learning under different missing-data handling strategies.

Model evaluation

Model performance was assessed using accuracy, precision, recall, F1-score, and AUROC. Since the dataset was imbalanced, recall and F1-score were given particular attention when evaluating the model's sensitivity to correctly identify patients at risk of 30-day ICU readmission.^14,29 The AUROC was used to assess discrimination between classes, with a higher value indicating stronger classification ability.³⁰

Accuracy (A): Accuracy was defined as the ratio of correctly predicted observations to the total observations.

\begin{aligned} TP + TN \\ Accuracy = \underline{} \\ TP + FP + TN + FN \end{aligned}

(1)

F1 Score: The F1 score represented the harmonic mean of precision and recall, useful for imbalanced classification tasks.

\begin{aligned} 2 * Precision * Recall \\ F 1 - score = \underline{} \\ Pr ecision + Recall \end{aligned}

(2)

Precision (P): Precision measured the proportion of true positives among predicted positives.

\begin{aligned} TP \\ Precision = PPV = \underline{} \\ TP + FP \end{aligned}

(3)

Recall (R): Recall measured the proportion of actual positive instances that were correctly identified by the model.

\begin{aligned} TP \\ Recall = TPR = \underline{} \\ TP + FN \end{aligned}

(4)

To evaluate model calibration, we computed Brier scores and generated calibration plots comparing predicted and observed 30-day readmission probabilities across risk strata, thereby assessing the agreement between estimated risks and empirical event rates. In addition, we summarized threshold-dependent behavior using ROC curves and precision–recall (PR) curves. These graphical assessments complement scalar summary statistics by providing visual insight into prediction reliability and potential regions of miscalibration. For post hoc interpretability, we further applied SHAP and visualized feature attributions.^16,31

Results

Population description

The study cohort included 5414 adult patients with a diagnosis of HF based on the above ICD-9-CM discharge codes, extracted from the MIMIC-III database following preprocessing. A total of 12.5% of these patients were readmitted to the ICU within 30 days. Readmitted patients were older on average (73.6 vs. 69.5 years), while gender distribution showed no substantial difference between the groups. Admission type, insurance type, and discharge location differed markedly between the two groups. Readmitted patients exhibited a higher proportion of emergency admissions and greater reliance on Medicare coverage. They were also less frequently discharged home and exhibited a significantly higher mortality rate (23.01%) compared to the nonreadmitted group (1.88%). Physiologically, readmitted patients presented higher levels of blood urea nitrogen, creatinine, and lower systolic and diastolic blood pressure (SBP/DBP). Table 1 presents these demographic and clinical differences, emphasizing the predictive relevance of baseline characteristics.

Table 1.

Selected patient demographic information.

Admission	Total	Readmission	No readmission
Patient	5414 (100%)	678 (12.5%)	4736 (87.5%)
Age	70.0162	73.64865	69.49618
Gender(Male)	3026 (55.89%)	376 (6.94%)	2650 (48.95%)
(Female)	2388 (44.11%)	302 (5.58%)	2086 (38.53%)
Admission type
Emergency	4580 (84.59%)	627 (11.58%)	3953 (73.01%)
Elective	634 (11.71%)	30 (0.55%)	604 (11.16%)
Urgent	200 (3.69%)	21 (0.38%)	179 (3.31%)
Insurance type
Self-pay	20 (0.37%)	2 (0.04%)	18 (0.33%)
Government	77(1.42%)	3 (0.05%)	74 (1.37%)
Medicare	3790 (70.00%)	557 (10.28%)	3233 (59.72%)
Medicaid	331 (6.11%)	30 (4.63%)	301 (6.36%)
Private	1196 (22.09%)	56 (1.59%)	1110 (20.50%)
Discharge location
Home	2475 (45.71%)	177 (3.26%)	2298 (42.45%)
SNF	1304 (24.09%)	154 (2.85%)	1150 (21.24%)
Rehab	1378 (25.45%)	182 (3.36%)	1196 (22.09%)
Hospice	12 (0.22%)	9 (0.17%)	3 (0.05%)
Dead/Expired	245 (4.53%)	156 (2.88%)	89 (1.65%)
Variable value
Heart rate	85.14 ± 15.21	85.89 ± 15.78	85.03 ± 15.12
Respiratory rate	19.36 ± 4.03	20.05 ± 4.46	19.26 ± 3.95
Temperature	98.20 ± 1.38	98.00 ± 1.31	98.23 ± 1.39
Diastolic blood pressure	57.73 ± 11.30	56.41 ± 11.52	57.92 ± 11.25
Systolic blood pressure	115.80 ± 17.65	114.13 ± 17.86	116.04 ± 17.61
Oxygen saturation	97.06 ± 2.02	97.02 ± 2.18	97.07 ± 1.99
Blood urea nitrogen	31.95 ± 22.79	38.85 ± 26.62	30.96 ± 22.01
Creatinine	1.72 ± 2.14	1.90 ± 2.03	1.70 ± 2.15
MBP	77.04 ± 11.98	76.55 ± 11.98	77.09 ± 11.98
Glucose	145.37 ± 44.04	147.60 ± 48.86	145.05 ± 43.30
White blood cell	12.31 ± 11.07	13.61 ± 25.73	12.13 ± 6.72
Platelets	217.51 ± 94.31	219.41 ± 94.31	217.22 ± 94.31
GCS eye	3.37 ± 0.80	3.33 ± 0.84	3.38 ± 0.80
GCS motor	5.40 ± 1.01	5.40 ± 1.02	5.40 ± 1.00
GCS verbal	3.53 ± 1.63	3.32 ± 1.71	3.56 ± 1.62
ICU length of stay	5.47 ± 0.1856	0.2	0.25
Hospital length of stay	12.83 ± 10.95	14.97	12.52

Model development and evaluation

According to the Research Framework illustrated in Figure 3, this study was conducted in two experiments: Experiment 1 – Single Model Comparison and Experiment 2 – Ensemble Model Comparison. To ensure robust model evaluation, we conducted multiple experiments on data partitioning strategies and identified that an 8:1:1 ratio for the training, validation, and testing sets, respectively, yielded the optimal configuration. Consequently, the experimental results reported in this section are derived exclusively from the held-out test set. As summarized in the framework, Experiment 1 first evaluated the performance of 14 single classifiers on the balanced training set under two distinct missing-data handling strategies. The test-set results utilizing the KNN imputation strategy are presented in Table 2, while the results derived using the listwise deletion method are summarized in Table 3. All implementation details are provided in the Appendix.

Table 2.

The predictive performance of single models using KNN imputation.

Models	Accuracy	F1_score	Precision	Recall	AUROC
BaggingClassifier	0.8229	0.8025	0.7853	0.8229	0.6166
LightGBM	0.821	0.8049	0.7913	0.821	0.6126
CatBoost	0.8173	0.8026	0.7899	0.8173	0.6275
AdaBoost	0.8155	0.798	0.7828	0.8155	0.5328
ExtraTrees	0.8081	0.7953	0.7837	0.8081	0.6453
RandomForest	0.7989	0.7912	0.784	0.7989	0.667
XGBoost	0.7934	0.7862	0.7795	0.7934	0.6327
GradientBoosting	0.7934	0.792	0.7907	0.7934	0.6078
LSTM	0.7583	0.7695	0.7817	0.7583	0.6547
MLPClassifier	0.7066	0.7391	0.78	0.7066	0.5348
DNN	0.6827	0.7288	0.8003	0.6827	0.5748
LogisticRegression	0.6328	0.6935	0.8093	0.6328	0.6512
SVM	0.6218	0.6854	0.8223	0.6218	0.6074
KNN	0.6162	0.6789	0.7866	0.6162	0.5293

Table 3.

The predictive performance of single models using listwise deletion.

Models	Accuracy	F1_score	Precision	Recall	AUROC
AdaBoost	0.8306	0.8127	0.8	0.8306	0.5381
BaggingClassifier	0.825	0.8065	0.7929	0.825	0.5874
LightGBM	0.8167	0.796	0.7798	0.8167	0.5361
RandomForest	0.8028	0.7963	0.7903	0.8028	0.5832
ExtraTrees	0.8	0.7924	0.7854	0.8	0.5916
CatBoost	0.7833	0.7773	0.7716	0.7833	0.5634
GradientBoosting	0.7778	0.7716	0.7657	0.7778	0.5492
XGBoost	0.775	0.7676	0.7606	0.775	0.5553
LSTM	0.7444	0.7657	0.7929	0.7444	0.6266
MLPClassifier	0.7361	0.7534	0.7735	0.7361	0.5826
DNN	0.675	0.7153	0.7726	0.675	0.5735
LogisticRegression	0.625	0.6837	0.819	0.625	0.6447
KNN	0.6167	0.6732	0.7661	0.6167	0.4933
SVM	0.5972	0.6605	0.8136	0.5972	0.6121

Under the KNN imputation strategy (Table 2), the Bagging Classifier achieved the highest accuracy (0.8229), followed closely by LightGBM (0.8210) and CatBoost (0.8173). Regarding discrimination, Random Forest yielded the highest AUROC (0.6670), with LSTM (0.6547) and Extra Trees (0.6453) also showing comparatively strong performance in separating classes. In contrast, when applying the listwise deletion strategy (Table 3), although AdaBoost achieved the highest accuracy (0.8306) in this subset, followed by the Bagging Classifier (0.8250), their discriminatory capabilities were notably compromised. Specifically, the top-performing AdaBoost model yielded an AUROC of only 0.5381, indicating poor predictive reliability despite its high accuracy. Comparing the top-performing single models from each approach, the Bagging Classifier from the KNN group demonstrated superior overall stability compared to the AdaBoost model from the listwise deletion group. While AdaBoost showed marginally higher accuracy (0.8306 vs. 0.8229), the Bagging Classifier achieved a significantly higher AUROC (0.6166 vs. 0.5381), reflecting a better balance between sensitivity and specificity. This comparison suggests that preserving sample size and multivariate relationships through KNN imputation offers a significant advantage in building robust models over removing incomplete records.

Subsequently, following the identification of top-performing base learners in Experiment 1 and Experiment 2 (Table 4) evaluated the efficacy of ensemble integration, soft-voting ensembles combining the top three and top five classifiers were developed under both missing-data handling strategies.

Table 4.

Predictive performance of best ensemble models.

Imputation type	Ensemble type	Models	Accuracy	F1_score	Precision	Recall	AUROC
KNN imputation	Voting (3 Models)	BaggingClassifier, CatBoost, LSTM	0.8413	0.8195	0.8039	0.8413	0.6718
KNN imputation	Voting (5 Models)	BaggingClassifier, CatBoost, DNN, XGBoost, MLPClassifier	0.8432	0.8168	0.7985	0.8432	0.6110
listwise deletion	Voting (3 Models)	BaggingClassifier, LSTM, AdaBoost	0.8389	0.8158	0.8017	0.8389	0.6207
listwise deletion	Voting (5 Models)	LightGBM, BaggingClassifier, ExtraTrees, LSTM, AdaBoost	0.8333	0.8094	0.7937	0.8333	0.6019

Under the KNN imputation strategy (Table 4), the Voting (5 Models) ensemble achieved the highest overall accuracy of 0.8432; however, this configuration showed a reduction in discriminatory power with an AUROC of 0.6110. In contrast, the Voting (3 Models) ensemble—integrating Bagging Classifier, CatBoost, and LSTM—demonstrated a more favorable balance between metrics, yielding a comparable accuracy of 0.8413 while achieving the highest AUROC of 0.6718. This indicates that adding weaker learners to the ensemble (as in the 5-model configuration) diluted the discriminatory capability without providing a meaningful gain in accuracy.

In the listwise deletion cohort (Table 4), the Voting (3 Models) ensemble consistently outperformed the 5-model ensemble, achieving both higher accuracy (0.8389 vs. 0.8333) and AUROC (0.6207 vs. 0.6019). When comparing the optimal frameworks from both strategies, the KNN-based Voting (3 Models) ensemble emerged as the definitive optimal predictive model. It not only surpassed the best listwise deletion ensemble in discriminatory capability (AUROC 0.6718 vs. 0.6207) but also confirmed that preserving multivariate data structure through imputation, combined with selective ensemble integration, yields the most reliable performance for predicting 30-day readmission.

In the second stage of Experiment 2, we conducted an in-depth comparison of four representative models to validate our selection. Although the KNN-based Voting (5 Models) ensemble achieved the highest nominal accuracy (0.8432), the Voting (3 Models) ensemble was selected as the optimal framework. The difference in accuracy was negligible (<0.2%), whereas the Voting (3 Models) ensemble demonstrated significantly higher discriminatory power (AUROC 0.6718 vs. 0.6110), offering a more robust balance for clinical application. Consequently, for the detailed performance analysis (Figure 4), we selected: (1) AdaBoost (Listwise); (2) Bagging Classifier (KNN) as the best single-model reference; (3) the Listwise Deletion-based Voting (3 Models) ensemble; and (4) the KNN-based Voting (3 Models) ensemble as the proposed optimal model for comparison.

Figure 4.

Comparison of ROC curve; precision–recall (PR) curve and calibration curve for top-performing models across different missing-data handling strategies.

We evaluated these four models using ROC curves, PR curves, and calibration curves. First, the ROC Curve analysis assessed the discriminatory capacity across threshold settings. The KNN-based Voting (3 Models) ensemble exhibited robust performance, confirming its superior AUROC of 0.6718 compared to the listwise counterparts. Second, the PR Curve was analyzed to evaluate model stability given the imbalanced outcome. While the Listwise Deletion ensemble achieved a marginally higher area under the PR curve (0.19) compared to the KNN-based ensemble (0.18), the KNN approach maintained a competitive performance that was superior or comparable to single-model baselines. Third, we examined the Calibration Curve to visualize the agreement between predicted probabilities and observed outcome rates. A perfectly calibrated model would follow the diagonal ideal line. Visual inspection indicated that the KNN-based Voting (3 Models) ensemble aligned most closely with the ideal diagonal, demonstrating superior calibration and reliability in risk estimation compared to the listwise deletion models.

Finally, to quantify this calibration quality, we computed the Brier score, where a lower value indicates superior probabilistic accuracy. The results, summarized in the Brier score comparison table, confirmed the visual trends. The KNN-based Voting (3 Models) ensemble achieved the lowest (best) Brier score of 0.1305, outperforming the KNN-based Voting (5 Models) ensemble (0.1338) and demonstrating the highest reliability in risk estimation. This was followed by the KNN-based Bagging Classifier (0.1398). In contrast, the listwise deletion strategy resulted in poorer calibration, with its Voting (3 Models) ensemble scoring 0.1635 and the single AdaBoost model yielding the worst score of 0.1824. In conclusion, the convergence of the highest discriminatory power (AUROC 0.6718) and the most accurate probability calibration (Brier score 0.1305) identifies the KNN imputation strategy combined with the Voting (3 Models) ensemble as the definitive optimal modeling approach for predicting 30-day ICU readmission.

Model interpretation

The SHAP values were used to interpret the decision logic of the optimal ensemble models derived from the two missing-data handling strategies. The results are visualized as shown in Figure 5, which includes SHAP summary plots highlighting the most influential features in predicting 30-day ICU readmission for both the Listwise Deletion (Voting 3 Models) and KNN Imputation (Voting 3 Models) frameworks. The y-axis lists features in descending order of importance, and the x-axis represents the SHAP value impact. Both ensemble strategies consistently identified glucose, heart rate (HR), SBP, platelets, white blood cell (WBC), temperature, respiratory rate, DBP, and oxygen saturation (OS) as the most important features.

Figure 5.

SHapley Additive exPlanation (SHAP) summary plots illustrating feature importance and directional impact for the optimal ensemble models.

Discussion

The ensemble learning algorithm proposed in this study was designed to address the challenge of class imbalance in HF readmission prediction and was validated through a rigorous experimental framework. Across our experiments, ensemble models generally achieved more favorable performance than single classifiers on key metrics (F1-score, recall, and AUROC). This consistency affirms the algorithm's robustness and supports its potential applicability in clinical settings. This also demonstrates the feasibility of integrating predictive accuracy with interpretability through SHAP, aligning with the goal of explainable AI in healthcare.^16,31

Building on this foundation, the following sections present a structured discussion of our findings across four dimensions: (1) model performance and description, (2) comparison with existing literature, (3) interpretability in a clinical context, and (4) Clinical perspective and practical impact.

Model performance and description

Heart failure remains a critical determinant of ICU readmission, particularly within the ICU population, necessitating robust predictive tools to manage elevated mortality risks and resource allocation. Building on this imbalanced outcome distribution, we first benchmarked 14 single classifiers under two missing-data strategies—KNN imputation and listwise deletion. Under the KNN imputation setting, tree-based ensemble learners (e.g., Bagging, LightGBM, CatBoost, Random Forest, Extra Trees) tended to yield stronger discrimination and classification performance than linear and distance-based methods in our dataset. In particular, the Bagging Classifier attained the highest test accuracy, while Random Forest and CatBoost provided competitive AUROC values, indicating a stronger ability to separate readmitted from nonreadmitted patients compared with deep neural networks, logistic regression, SVM, and KNN. In contrast, although listwise deletion yielded marginally higher peak accuracy for AdaBoost, its AUROC remained markedly lower than the top-performing KNN-based models in our experiments, revealing that simply discarding incomplete records can inflate accuracy at the expense of reliable discrimination and clinical utility.

After identifying high-performing base learners in Experiment 1, Experiment 2 was organized into two stages:

In the first stage, we focused on the construction and evaluation of ensemble models under two missing-data strategies: KNN imputation and listwise deletion. The results indicate that the KNN-imputed Voting (3 Models) ensemble showed a favorable balance across discrimination and calibration metrics compared with other tested configurations in our experiments. Specifically, this optimal ensemble—comprising Bagging Classifier, CatBoost, and LSTM—achieved an accuracy of 0.8413 and the highest AUROC of 0.6718. This performance notably surpassed the ensemble model derived from the listwise deletion strategy, which yielded a lower AUROC of 0.6207 despite a comparable accuracy of 0.8389. Furthermore, a critical trade-off analysis was conducted against the Voting (5 Models) ensemble. While the 5-model ensemble achieved a marginally higher accuracy (0.8432), it suffered a significant reduction in discriminatory power (AUROC 0.6110). Consequently, the Voting (3 Models) ensemble was selected as the final proposed model, as it demonstrated a more favorable balance, avoiding the dilution of discriminatory power observed in the larger ensemble while maintaining high classification accuracy.

In the second stage of Experiment 2, we conducted an in-depth comparison of four representative models using four distinct evaluation metrics: ROC curves, PR curves, calibration curves, and Brier scores. To ensure that this comparison was grounded in clinically interpretable performance, we prioritized models that maintained robust discrimination alongside high accuracy. Crucially, calibration analysis confirmed that the Voting (3 Models) ensemble provided the most accurate probabilistic risk estimates with a Brier score of 0.1305, surpassing the 5-model configuration (0.1338) and ensuring reliable risk stratification. These improvements in both discrimination and calibration reflect clinically meaningful gains in identifying high-risk patients. This performance advantage can be attributed to the synergistic effect of the ensemble components and the data preservation strategy: KNN imputation maintained the multivariate structure of the dataset, preventing the information loss associated with listwise deletion. By aggregating the stability of the Bagging Classifier with the gradient boosting capabilities of CatBoost and the sequential pattern recognition of LSTM, the KNN Voting (3 Models) ensemble effectively mitigated the overfitting and calibration issues observed in single high-accuracy models like AdaBoost (AUROC 0.5381). This approach ensured a more robust generalization capability for predicting readmission in complex clinical datasets.

Comparison with existing literature

The predictive performance of our proposed model—specifically the KNN-imputed Voting (3 Models) ensemble—demonstrates competitive predictive capability when evaluated against recent benchmarks. While maximizing accuracy was our primary objective, we prioritized AUROC as the deciding factor when accuracy differences were minimal. To provide a balanced comparison, we discuss our findings across three dimensions: comparisons within the MIMIC-III database, performance relative to other HF readmission studies, and contextualization within broader cardiovascular outcome prediction.

First, benchmarking against studies utilizing the same data source allows for a direct assessment of methodological efficacy. Dafrallah and Akhloufi recently proposed a readmission prediction approach using MIMIC-III discharge notes combined with NLP techniques. While their method leverages rich unstructured text, our study suggests that an ensemble using structured clinical variables can achieve performance comparable to prior work and within the range reported in the literature, with a favorable balance between precision and recall. Specifically, our model yielded an F1-score of 0.8195, exceeding the F1-score of 0.733 reported in their NLP-enhanced study.³² This suggests that for HF cohorts in MIMIC-III, structured physiological and administrative data—when processed with effective imputation and ensemble techniques—remain highly effective for identifying at-risk patients.

Expanding the comparison to the broader domain of HF readmission reveals important trade-offs between discrimination (AUROC) and sensitivity (Recall). Zhang et al. developed an XGBoost model for acute HF (AHF) patients using a localized dataset from China. While they reported a higher AUROC of 0.763 compared to our 0.6718, their model's recall was limited to 0.660.³³ In a clinical context, low recall implies missing a substantial portion of high-risk patients. In contrast, our proposed framework achieved a Recall of 0.8413 and an Accuracy of 0.8413, indicating improved recall in identifying specific readmission cases. Furthermore, by utilizing 20 universally available clinical variables rather than region-specific or complex unstructured data, our model offers higher potential for generalizability and easier integration into diverse clinical decision support systems (CDSS).

Similarly, Pikatza-Huerga et al. reported a voting ensemble trained on data from five hospitals, achieving an AUROC of 0.81 and an accuracy of 0.81.³⁴ While their multicenter study exhibited superior discrimination, our Voting (3 Models) ensemble demonstrated a competitive edge in classification Accuracy (0.8413 vs. 0.81) and maintained a high F1-score (0.8195). This suggests that our specific configuration of base learners (Bagging, CatBoost, LSTM) prioritizes the balance of precision and recall, which is often more actionable for resource allocation than AUROC alone. Additionally, our findings regarding the importance of discharge location align with recent work by Esteban-Fernández et al., who emphasized patient characteristics and discharge destination as critical determinants of readmission risk.³⁵

Finally, to contextualize the inherent difficulty of predicting short-term adverse outcomes in critical care, we reviewed studies on related cardiovascular conditions. For instance, Chen et al. predicting MACE in STEMI patients and Zhao et al. predicting in-hospital mortality reported accuracies ranging from approximately 0.63–0.78.^36,37 Although these studies target different endpoints, they illustrate the challenging nature of modeling acute cardiovascular events. In this context, the stable accuracy of 0.8413 achieved by our proposed framework reflects a robust performance.

Interpretability in a clinical context

We employed SHAP values to explain the predictive outputs and quantify the contribution of individual features. Glucose emerged as the leading predictor across both models, followed closely by hemodynamic and inflammatory indicators. A distinct pattern in feature contribution was observed: positive SHAP values were predominant among glucose, HR, platelets, and WBC, indicating that elevated levels in these variables were associated with an increased risk of readmission. Conversely, SBP, DBP, and OS exhibited a protective trend, where higher values contributed to negative SHAP values (reduced risk), while lower values in these vital signs were linked to higher readmission probabilities. These findings suggest that patients with hyperglycemia, inflammatory activation (elevated WBC/platelets), and hemodynamic instability (hypotension, tachycardia, hypoxia) are at the highest risk.

First, our SHAP analysis identified blood glucose as the predominant predictor of 30-day readmission, with elevated levels consistently associated with higher risk. This finding aligns with a growing body of evidence highlighting the deleterious impact of stress hyperglycemia in AHF patients. Even in nondiabetic patients, acute hyperglycemia often reflects a surge in cortisol and catecholamines due to physiological stress, which can exacerbate oxidative stress and endothelial dysfunction. Chun et al. demonstrated that in-hospital glycemic variability and hyperglycemia are strong independent predictors of all-cause mortality and readmission in AHF populations.³⁸ Similarly, Carrera et al. emphasized that admission glucose levels serve as a critical biomarker for poor prognosis.³⁹ Our model's heavy reliance on glucose underscores the importance of strict glycemic monitoring and management as a potential strategy to reduce readmission rates. Clinically, the prominence of glucose in the SHAP profile suggests that peridischarge glycemic instability may be a useful signal for identifying patients who warrant closer physiologic surveillance and more structured transitional-care planning after ICU discharge.^38–40

Second, we observed that higher SBP/DBP exhibited a protective trend (negative SHAP values), whereas lower blood pressure was linked to increased risk. This relationship supports the “reverse epidemiology” often observed in HF cohorts, where lower systemic blood pressure suggests severe pump failure, low cardiac output, and an inability to tolerate guideline-directed medical therapy. As noted by Huang et al. admission SBP is inversely associated with 1-year clinical outcomes, with hypotension signaling a significantly worse prognosis due to tissue hypoperfusion.⁴¹ Conversely, elevated HR contributed to positive SHAP values, indicating increased risk.

Third, our results showed that elevated WBC and platelet counts were associated with higher readmission probabilities, likely reflecting a state of systemic inflammation and hypercoagulability. Recent studies, such as Zhang et al. have corroborated that higher inflammatory indices are significantly correlated with HF severity and adverse events.⁴² The concurrent identification of low OS as a risk factor further points to residual pulmonary congestion or respiratory compromise. Collectively, these SHAP values suggest that our ensemble model effectively captures a high-risk phenotype characterized by the triad of metabolic derangement (hyperglycemia), hemodynamic instability (hypotension, tachycardia), and active inflammation. From a clinical standpoint, this pattern is compatible with a subset of patients leaving the ICU with unresolved inflammatory and respiratory burden, for whom intensified post-ICU monitoring and early follow-up may be particularly relevant within transitional-care pathways.^40,42

Clinical perspective and practical impact

In large cohort studies, 30-day readmission rates after a HF hospitalization typically range from approximately 13% to 25% and can be even higher in certain Medicare populations,⁴³ imposing a substantial burden on mortality, quality of life, and healthcare expenditures. Existing systematic reviews have shown that most readmission prediction models achieve a c-statistic in the range of 0.55–0.75, indicating only moderate discriminative performance.⁶ Consequently, such models are generally used in clinical practice as supportive risk stratification tools, rather than as stand-alone determinants of individual patient management decisions.⁶

In this study, the best-performing configuration was the KNN-imputed Voting (3 Models) ensemble. Its AUROC fell within the moderate performance range commonly reported for readmission models.⁶ Importantly, using only core structured variables collected during the ICU stay—without incorporating free-text notes or specialized scales—the model achieved moderate discrimination and acceptable calibration. This suggests that such a data configuration has nontrivial potential for HF ICU 30-day readmission risk stratification. Within this context, our HF ICU readmission model can be positioned as an EHR-based risk stratification tool that provides a combination of moderate discriminatory ability and reasonable calibration. The primary goal is to supply additional risk information prior to ICU or hospital discharge to help clinicians identify relatively high-risk patients and to inform decisions about the intensity of transitional care.

Clinically, if the 30-day readmission rate after HF hospitalization is around 20%, approximately one in five discharged patients will be readmitted in the short term.⁴³ At the same time, transitional care interventions for high-risk HF patients—such as follow-up by specialized HF nurses, structured telephone calls, home visits, and dedicated HF clinics—are resource-intensive and cannot be universally provided to all patients. Prior randomized trials and systematic reviews indicate that sustained transitional care interventions targeted to selected high-risk HF populations may reduce readmissions and mortality.^40,44 Under these circumstances, even a model with only moderate discrimination may, provided that recall, F1-score, and probability calibration remain reasonably stable, theoretically assist clinical teams in allocating limited transitional care resources more efficiently. For example: (1) Using predicted risk scores as an adjunctive input to prioritize early outpatient visits or telehealth follow-up for patients classified as higher risk. (2) Using relatively favorable Brier scores, as an indicator of more reliable probability estimates, to inform threshold setting for different levels of intervention intensity. (3) Recognizing that these potential benefits remain theoretical at this stage and will require confirmation in future interventional studies.

However, because the present model was developed using a single-center ICU cohort from MIMIC-III, its transportability cannot be assumed and requires explicit external validation. Single-center critical care datasets may encode site-specific case mix, ICU admission/discharge thresholds, local practice patterns (e.g., laboratory ordering frequency, hemodynamic monitoring intensity), and transitional-care capacity, all of which can shift both feature distributions and baseline 30-day readmission risk. In addition, MIMIC-III represents one institution (Beth Israel Deaconess Medical Center) and a specific historical period (2001–2012),¹⁵ and temporal changes in HF management and discharge pathways may lead to performance drift when applied to contemporary cohorts. Accordingly, the model should be viewed as a promising structured-data prototype whose generalizability must be demonstrated through staged external validation rather than inferred from internal testing alone.

To address this limitation and align with established reporting and validation guidance,⁴⁵ we propose the following external validation roadmap: (i) temporal validation within the same health system using a more recent era dataset⁴⁶ to quantify time-related drift in discrimination and calibration⁴⁶; (ii) geographic validation in independent multicenter ICU datasets (e.g., eICU-CRD) to test transportability across hospitals with different workflows, measurement frequency, and documentation patterns⁴⁷; and (iii) local-site validation in the intended deployment environment, ideally as a prospective “silent-mode” evaluation where predictions are generated but not shown to clinicians until calibration performance and alert burden are acceptable. For each external validation, evaluation should extend beyond AUROC to include calibration-in-the-large, calibration slope, Brier score, and clinically oriented utility analyses to assess net benefit across plausible risk thresholds.⁴⁸ In addition, external validation should examine performance stability across clinically relevant subgroups and under differing missingness patterns, given that imputation behavior may vary across sites and time periods. If miscalibration is observed, pragmatic options include recalibration (intercept/slope updating) and, if necessary, limited model updating using local data while preserving the original feature set to maintain implementability. Beyond external validation, clinical impact evaluation (e.g., pragmatic trials or stepped-wedge implementation studies) would be required to determine whether risk-guided transitional care allocation improves outcomes without undue alert burden.

From a practical standpoint, in terms of system-level usability and scalability, our model primarily relies on structured fields commonly available in most inpatient EHR systems—such as vital sign flowsheets, laboratory codes, and basic administrative variables—rather than on hospital-specific free-text fields or bespoke assessment scales.^16,49,50 In practice, this design may allow technical efforts during implementation to focus mainly on mapping data fields and configuring extraction pipelines, without requiring the creation of extensive new paper forms or manual data entry workflows. However, whether the model can in fact be ported smoothly across hospitals and EHR vendors remains an empirical question that depends on real-world deployment and local evaluation.

Furthermore, from the perspective of clinical implementation, the proposed model can be considered within conceptual use cases that may facilitate staged introduction into practice. Drawing on existing experience with EHR-based early-warning and readmission-risk tools,⁴⁰ we infer that a similar “phased implementation” approach could potentially be applied to HF ICU readmission models. For example, a hospital might consider the following stepwise strategy:

★ Silent phase: The model runs in the background and periodically computes 30-day readmission risk for HF ICU patients approaching discharge, but the outputs are not shown to clinicians. The primary purpose is to evaluate local data quality, calibration, and outcome rates across risk strata.

★ Limited-exposure phase: Risk scores and basic explanatory information are selectively presented during discharge planning meetings or multidisciplinary case conferences, serving as an adjunctive reference while feedback is collected from clinicians regarding usability and acceptable alert frequency.

★ Broader deployment phase: If the preceding phases are satisfactory, the hospital may consider highlighting a subset of higher-risk patients on ward dashboards or discharge lists, linking these flags to preexisting high-risk care pathways or transitional care protocols.

On this basis, one may further hypothesize a clinical usage scenario in which, as an HF ICU patient approaches discharge, the system automatically retrieves recent vital signs and laboratory results (processed with KNN imputation), computes the 30-day readmission risk, and flags patients whose risk exceeds a predefined threshold on a ward dashboard. When a patient is labeled “high-risk,” the care team can treat this as an auxiliary signal, integrating it with clinical judgment to decide whether to initiate a more intensive high-risk discharge pathway, such as prioritizing early in-person or telehealth follow-up; arranging enhanced medication review and self-management education by a pharmacist or HF nurse; or, where appropriate, referring the patient to a specialized HF program or home-based care service.^40,44

If the risk score is accompanied by SHAP-based visual explanations that highlight the most influential features for the current prediction—such as elevated glucose, low SBP or DBP, increased inflammatory markers, or reduced OS—this may help clinicians understand the dominant risk signals identified by the model and incorporate that information into their overall assessment.⁴²

Conclusions

This study makes several contributions to the prediction of 30-day readmission among HF patients discharged from the ICU, particularly at the levels of methodological design and model development, with a focus on data utilization strategies, missing-data handling and ensemble evaluation, and model interpretability. At the same time, several limitations warrant careful consideration.

First, with respect to data and missing-data strategies, the models were built on routinely collected structured EHR data during the ICU stay, including demographic information, vital signs, and laboratory tests. This feature space is broadly consistent with that used in prior HF readmission and prognosis models.^6,49,50 On this basis, we systematically compared two missing-data handling approaches: KNN imputation and listwise deletion. The AdaBoost model trained on listwise deletion achieved a relatively higher test accuracy, but its AUROC was only 0.5381, indicating limited discriminatory ability. In contrast, under KNN imputation, the Bagging classifier achieved a slightly lower accuracy yet reached an AUROC of 0.6166, providing a more balanced trade-off between sensitivity and specificity.

Subsequent ensemble modeling identified the KNN imputed Voting (3 Models) ensemble as the optimal predictive framework. While the Voting (5 Models) ensemble achieved a marginally higher accuracy (0.8432), the Voting (3 Models) ensemble yielded a significantly superior AUROC (0.6718) while maintaining a robust accuracy (0.8413). Taken together, these findings suggest that in this dataset, a KNN-based strategy that preserves incomplete cases, when combined with a selective voting ensemble (Bagging, CatBoost, LSTM), is more conducive to achieving a model that simultaneously maintains discrimination and probability calibration than strategies that simply discard incomplete records or indiscriminately aggregate weaker learners.

Second, in terms of model interpretability, this study illustrates how SHAP values can be used within a multimodel soft-voting ensemble framework to quantify the relative contribution of individual features to predicted risk. In our dataset, features such as blood glucose, blood pressure (SBP/DBP), HR, WBC, platelet count, and OS exhibited particularly prominent effects in the SHAP distributions. Higher glucose and inflammation-related indices were generally associated with increased predicted risk, consistent in direction with previous work showing that stress hyperglycemia and inflammatory states in AHF are linked to adverse outcomes.^38,39,42 Lower blood pressure and OS tended to be associated with higher risk, aligning with the clinical observation that hemodynamic instability and impaired oxygenation are common in more severely ill patients. Although these SHAP-based findings are derived from a single database and a specific model architecture and are therefore insufficient for causal inference, they indicate that, in the present context, interpretable ML methods can help delineate a set of high-risk feature patterns that accord with clinical intuition and are broadly consistent with other SHAP-based HF risk modeling studies.^33,42

These findings underscore the potential of interpretable AI models as integral components of CDSS. By integrating this model into EHR systems, real-time risk predictions could support proactive patient management. The transparency afforded by SHAP enhances clinical trust and allows for evidence-based, personalized interventions.

With respect to limitations, despite the use of a rigorous train–validation–test procedure within a single database and evaluation of model performance using multiple metrics (ROC/PR curves, calibration curves, and Brier scores), several caveats remain. First, the model was developed in a single dataset and a specific HF ICU population; both the feature space and care processes may differ across institutions or regions. The apparent advantages of KNN imputation and Voting ensembles observed in this study may not generalize directly to other settings. Future work should therefore conduct external and prospective validation across different hospitals, HF ICU populations, and EHR infrastructures^6,16 to assess discrimination and calibration in new contexts.

Second, the discussion of system implementation and clinical application in this study remains largely conceptual. For example, the descriptions of “phased implementation” and a “high-risk discharge pathway” are proposed as plausible use cases informed by prior literature on EHR-based risk models and transitional care,^42,44 rather than processes that have been deployed and evaluated in a real-world EHR environment. We have not yet examined how such implementation strategies would affect readmission rates, clinical workload, or interprofessional collaboration. Consequently, these designs should currently be regarded as conceptual blueprints that require refinement and confirmation through local pilot testing, user-centered studies, and prospective intervention research.

Finally, although the SHAP analysis provides feature-level information on relative importance, the results are derived from observational data and reflect associations rather than causal relationships. To more deeply investigate the causal roles of specific variables (e.g., changes in blood glucose or inflammatory markers) in HF readmission risk, future work will need to incorporate clinical trials or carefully designed causal inference methodologies. In addition, subsequent studies may build on the present model framework to explore more parsimonious feature subsets or regularization strategies, thereby facilitating easier implementation and maintenance of the model across diverse clinical settings.

Footnotes

Acknowledgments

The authors would like to express their gratitude to the National Science and Technology Council, Taiwan, for its support, and to National Taipei University of Technology for providing administrative and institutional assistance.

ORCID iDs

Chuan-Mei Chu

Bo-Yi Li

Ethics approval and consent to participate

This study used the de-identified MIMIC-III database, which was approved by the Institutional Review Boards of MIT and Beth Israel Deaconess Medical Center. As the data are fully de-identified, additional ethics approval and informed consent were not required.

Contributorship

C-MC conceived the study design, collected and organized the data, and drafted the initial manuscript. C-SW supervised the research process and provided critical revisions. H-YC performed data analysis and contributed to manuscript revision. B-YL developed the methodological framework and contributed to manuscript writing and revision. T-NC provided data and assisted with data collection and organization. All authors contributed to the refinement of the manuscript, and all authors read and approved the final version.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was supported by the National Science and Technology Council (NSTC), Taiwan, through Grant Number NSTC 113-2410-H-027-014.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Availability of data and materials

The data used in this study are publicly available from the MIMIC-III database (). Access requires completion of the Collaborative Institutional Training Initiative (CITI) program and signing of a data use agreement.

Declaration of AI and AI-assisted technologies in the writing process

AI-assisted language tools were used to improve readability and language presentation (e.g., spelling and grammar). The authors reviewed and edited the manuscript as needed and take full responsibility for the integrity, accuracy, and final content of the publication.

Appendix

Experimental environment and software versions

scikit-learn 1.5.1 imbalanced-learn 0.12

Hyperparameter configurations for all models (KNN Imputation)

Hyperparameter configurations for all models (Listwise Deletion)

References

Yasmeen

, et al. Transforming patient outcomes: cutting-edge applications of AI and ML in predictive healthcare. South East Eur J Public Health 2024; 25: S1.

Sarker

. Revolutionizing healthcare: the role of machine learning in the health sector. J Artif Intell Gen Sci 2024; 2: 36–61.

Bajwa

Munir

Nori

, et al. Artificial intelligence in healthcare: transforming the practice of medicine. Future Healthc J 2021; 8: e188–e194.

Akhtar

. The design approach of an artificial intelligent (AI) medical system based on electronical health records (EHR) and priority segmentations. J Eng 2024; 2024: e12381.

Bouttell

Grieve

Hawkins

. The role of development-focused health technology assessment in optimizing science, technology, and innovation to achieve sustainable development goal 3. In: Science, technology, and innovation for sustainable development goals: insights from agriculture, health, environment, and energy. Oxford, England: Oxford University Press Inc, 2020, pp .243.

Son

. Machine learning–based 30-day readmission prediction models for patients with heart failure: a systematic review. Eur J Cardiovasc Nurs 2024; 23: 711–719.

Gilman

, et al. The financial effect of value-based purchasing and the hospital readmissions reduction program on safety-net hospitals in 2014: a cohort study. Ann Intern Med 2015; 163: 427–436.

Friedman

Basu

. The rate and cost of hospital readmissions for preventable conditions. Med Care Res Rev 2004; 61: 225–240.

Gai

Pachamanova

. Impact of the Medicare hospital readmissions reduction program on vulnerable populations. BMC Health Serv Res 2019; 19: 37.

10.

Carey

Lin

. Hospital readmissions reduction program: safety-net hospitals show improvement, modifications to penalty formula still needed. Health Aff (Millwood) 2016; 35: 1918–1923.

11.

Savarese

, et al. Global burden of heart failure: a comprehensive and updated review of epidemiology. Cardiovasc Res 2022; 118: 3272–3287.

12.

, et al. The effect of comorbidities on risk of intensive care readmission during the same hospitalization: a linked data cohort study. J Crit Care 2009; 24: 101–107.

13.

de Hond

AAH

, et al. Predicting readmission or death after discharge from the ICU: external validation and retraining of a machine learning model. Crit Care Med 2023; 51: 291–300.

14.

Ruppert

, et al. Predictive modeling for readmission to intensive care: a systematic review. Crit Care Explor 2023; 5: e0848.

15.

Johnson

AEW

, et al. MIMIC-III, a freely accessible critical care database. Sci Data 2016; 3: –9.

16.

Pishgar

, et al. Prediction of unplanned 30-day readmission for ICU patients with heart failure. BMC Med Inform Decis Mak 2022; 22: 17.

17.

Escobar

, et al. Evolution of economic burden of heart failure by ejection fraction in newly diagnosed patients in Spain. BMC Health Serv Res 2023; 23: 1340.

18.

Chiu

, et al. Predicting the mortality of ICU patients by topic model with machine-learning techniques. Healthcare (Basel) 2022; 10: 1087.

19.

Chiu

, et al. Applying an improved stacking ensemble model to predict the mortality of ICU patients with heart failure. J Clin Med 2022; 11: 6460.

20.

Pradipta

, et al. SMOTE for handling imbalanced data problem: A review. In: 2021 Sixth International Conference on Informatics and Computing (ICIC). IEEE, 2021, pp. 1–8. DOI: 10.1109/ICIC54025.2021.9632912.

21.

Swana

Doorsamy

Bokoro

. Tomek link and SMOTE approaches for machine fault classification with an imbalanced dataset. Sensors (Basel) 2022; 22: 3246.

22.

Chatterjee

Byun

. Voting ensemble approach for enhancing Alzheimer’s disease classification. Sensors (Basel) 2022; 22: 7661.

23.

Liu

, et al. A machine learning predictive model of in-hospital mortality in patients with sepsis complicated by anemia: a retrospective study based on the MIMIC-III database [Preprint]. Res Sq 2021: 1–18. DOI: 10.21203/rs.3.rs-832364/v1

24.

Khan

Hafiz

MFB

Pramanik

. Enhancing predictive modelling and interpretability in heart failure prediction: a SHAP-based analysis. Int J Inform Commun Technol 2025; 14: 11–19.

25.

van Kuijk

Dankers

FJWM

Traverso

, et al. Preparing data for predictive modelling. In: Machine Learning in Radiation Oncology. Cham: Springer International Publishing, 2019, pp.75–84.

26.

Beretta

Santaniello

. Nearest neighbor imputation algorithms: a critical evaluation. BMC Med Inform Decis Mak 2016; 16: 74.

27.

Wang

, et al. A novel web-based calculator to predict 30-day all-cause in-hospital mortality for 7,202 elderly patients with heart failure in ICUs: a multicenter retrospective cohort study in the United States. Front Med (Lausanne) 2023; 10: 1237229.

28.

, et al. Using a multi-task recurrent neural network with attention mechanisms to predict hospital mortality of patients. IEEE J Biomed Health Inform 2019; 24: 486–492.

29.

Scholten

, et al. Comorbidities in heart failure patients that predict cardiovascular readmissions within 100 days—an observational study. PLoS One 2024; 19: e0296527.

30.

Choi

, et al. Mortality prediction of patients in intensive care units using machine learning algorithms based on electronic health records. Sci Rep 2022; 12: 7180.

31.

Gao

, et al. AI-based assessment of risk factors for coronary heart disease in patients with diabetes mellitus and construction of a prediction model for a treatment regimen. Rev Cardiovasc Med 2025; 26: 36293.

32.

Dafrallah

Akhloufi

. Hospital re-admission prediction using named entity recognition and explainable machine learning. Diagnostics (Basel) 2024; 14: 2151.

33.

Zhang

Xiang

Wang

, et al. Explainable machine learning for predicting 30-day readmission in acute heart failure patients. iScience 2024; 27: 110281.

34.

Pikatza-Huerga

Almeida

Quiros

, et al. Machine learning approaches for predicting heart failure readmissions. Postgrad Med J 2025; 101: 1351–1360.

35.

Esteban-Fernández

Anguita-Sánchez

Rosillo

, et al. Influence of patient characteristics and discharge destination on circulatory system diseases readmissions among older heart failure patients. Arch Gerontol Geriatr 2025; 129: 105660.

36.

Chen

Sun

Yang

, et al. Application of machine learning algorithms in predicting major adverse cardiovascular events after percutaneous coronary intervention in patients with new-onset ST-segment elevation myocardial infarction. Rev Cardiovasc Med 2025; 26: 25758.

37.

Zhao

Liu

Zhang

, et al. Using machine learning to predict the in-hospital mortality in women with ST-segment elevation myocardial infarction. Rev Cardiovasc Med 2023; 24: 26.

38.

Chun

Lee

, et al. In-hospital glycemic variability and all-cause mortality among patients hospitalized for acute heart failure. Cardiovasc Diabetol 2022; 21: 91.

39.

Carrera

Moliner

Llauradó

, et al. Prognostic value of the acute-to-chronic glycemic ratio at admission in heart failure: a prospective study. J Clin Med 2021; 11: 6.

40.

Feltner

Jones

Cené

, et al. Transitional care interventions to prevent readmissions for persons with heart failure: a systematic review and meta-analysis. Ann Intern Med 2014; 160: 774–784.

41.

Huang

Liu

Zhang

, et al. Systolic blood pressure and 1-year clinical outcomes in patients hospitalized for heart failure. Front Cardiovasc Med 2022; 9: 877293.

42.

Zhang

Feng

Zhu

, et al. Association between blood inflammatory indices and heart failure: a cross-sectional study of NHANES 2009-2018. Acta Cardiol 2024; 79: 473–485.

43.

Khan

Sreenivasan

Lateef

, et al. Trends in 30-and 90-day readmission rates for heart failure. Circ Heart Fail 2021; 14: e008335.

44.

Tung

Chang

, et al. Relationship between early physician follow-up and 30-day readmission after acute myocardial infarction and heart failure. PLoS One 2017; 12: e0170061.

45.

Collins

Reitsma

Altman

, et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. Br J Surg 2015; 102: 148–158.

46.

Johnson

Bulgarelli

Shen

, et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data 2023; 10: 1.

47.

Pollard

Johnson

Raffa

, et al. The eICU collaborative research database, a freely available multi-center database for critical care research. Sci Data 2018; 5: 1–13.

48.

Vickers

Elkin

. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Mak 2006; 26: 565–574.

49.

Golas

Shibahara

Agboola

, et al. A machine learning model to predict the risk of 30-day readmissions in patients with heart failure: a retrospective analysis of electronic medical records data. BMC Med Inform Decis Mak 2018; 18: 44.

50.

Sabouri

Rajabi

Hajianfar

, et al. Machine learning based readmission and mortality prediction in heart failure patients. Sci Rep 2023; 13: 18671.

Leveraging ensemble learning for predicting 30-day readmission in heart failure ICU patients

Abstract

Objectives

Methods

Results

Conclusion

Keywords

Introduction

Materials and methods

Study population

Research process

Data cleaning

Imbalanced data handling

Model development

Model evaluation

Results

Population description

Model development and evaluation

Model interpretation

Discussion

Model performance and description

Comparison with existing literature

Interpretability in a clinical context

Clinical perspective and practical impact

Conclusions

Footnotes

Acknowledgments

ORCID iDs

Ethics approval and consent to participate

Contributorship

Funding

Declaration of conflicting interests

Availability of data and materials

Declaration of AI and AI-assisted technologies in the writing process

Appendix

References