Abstract
Objectives
Heart failure (HF) patients admitted to intensive care units are prone to early readmission, which leads to adverse outcomes and increased healthcare costs. Existing prediction models often suffer from data heterogeneity, class imbalance, and limited interpretability. This study aimed to develop an interpretable ensemble learning framework to predict 30-day ICU readmission in adult patients with HF and to compare its performance with conventional single-classifier approaches.
Methods
This retrospective study analyzed 5414 adult HF patients from the MIMIC-III database. Using clinical and demographic variables collected within the first 24 h of the index ICU admission, the study aimed to predict 30-day ICU readmission (return to ICU). A two-stage ensemble model was developed using stratified sampling and grid-search optimization, with top learners integrated via a soft-voting mechanism. Additionally, SHapley Additive exPlanation (SHAP) analysis was employed to ensure model interpretability and quantify variable contributions to the predictions.
Results
The KNN-imputed Voting (3 Models) ensemble emerged as the optimal framework, achieving an accuracy of 0.8413, F1-score of 0.8195, and AUROC of 0.6718. Despite moderate AUROC, the model achieved strong recall and reliable calibration, making it suitable for risk stratification in post-ICU care transitions. The SHAP analysis identified Glucose, hemodynamic parameters (e.g., blood pressure, heart rate), and inflammatory indicators as key predictors, aligning with established clinical understanding of stress hyperglycemia and hemodynamic instability in HF.
Conclusion
This interpretable ensemble framework predicts 30-day ICU readmission in HF patients with robust performance, effectively balancing sensitivity and discrimination. It supports electronic health record–based risk stratification and timely intervention. Future work should focus on external validation across diverse populations to ensure generalizability.
Keywords
Introduction
Over the past decade, the integration of artificial intelligence (AI) and machine learning (ML) into healthcare has been increasingly adopted to support clinical decision-making and improve operational efficiency.1–3 These technologies support the analysis of complex datasets by combining physiological indicators with detailed patient demographics and have shown promise in enhancing the accuracy of outcome predictions such as 30-day ICU readmission risk. Here, “30-day readmission” refers specifically to unplanned ICU readmission (i.e., return to the ICU) within 30 days after the index ICU discharge, rather than all-cause hospital readmission after hospital discharge. Through early identification of high-risk individuals, AI and ML facilitate timely interventions and more efficient resource allocation, aligning with international health priorities such as Sustainable Development Goal 3, which advocates for technological innovation to improve health services.4,5 In high-risk cardiovascular populations such as patients with heart failure (HF), existing ML-based 30-day readmission prediction models have already demonstrated promising discriminative performance while exposing important methodological and implementation gaps that motivate further refinement. 6 Despite advances in predictive analytics, ICU readmissions continue to present significant clinical and financial burdens. Estimates suggest that preventable readmissions may lead to a significant increase in healthcare costs, placing an additional burden on the healthcare system.7,8 In response, the U.S. Affordable Care Act introduced the Hospital Readmissions Reduction Program (HRRP) in 2010, which imposes penalties on hospitals with excessive readmission rates. Although HRRP has demonstrated modest success, persistent challenges—including care coordination gaps, disease complexity, and socioeconomic disparities—underscore the need for more effective and scalable prediction strategies.9,10
Among high-risk cardiovascular conditions, HF is a major driver of hospital admissions and rehospitalization. Recent estimates indicate that ∼64.3 million people were living with HF worldwide in 2017, including ∼6.0 million adults in the United States (2015–2018). 11 ICU patients with HF are particularly vulnerable due to clinical instability and complex postdischarge care needs. 12 Consequently, timely prediction of 30-day ICU readmissions in this population is critical for reducing adverse outcomes and managing hospital resources.
Conventional approaches, however, often fall short due to their limited capacity to model high-dimensional, heterogeneous data and their inability to effectively address class imbalance. Moreover, many prior models for ICU readmission rely on small, proprietary single-center datasets with restricted external validity.13,14 Although MIMIC-III is single-institutional, its scale (over 53,000 adult ICU admissions) and open access make it a valuable resource for transparent, reproducible modeling while also enabling rigorous internal transportability checks to mitigate single-center bias. 15
In summary, while existing research has revealed the limitations of small-scale and single-center models, broader challenges remain in balancing predictive accuracy with interpretability to ensure effective clinical implementation. To address these issues, ensemble learning methods—such as those combining Random Forest and XGBoost—offer a compelling solution due to their robustness against overfitting and their capacity to synthesize multiple data signals. When coupled with explainable AI approaches such as SHapley Additive exPlanation (SHAP), these methods hold promise for both accurate prediction and clinically interpretable insights in critical care. 16
This study focused on adult patients with a primary diagnosis of HF, considering only the first ICU admission per patient. The outcome of interest was unplanned 30-day ICU readmission. Using structured variables from the MIMIC-III database (with detailed definitions provided in the Appendix), we developed and validated a stacked ensemble classifier optimized for imbalanced clinical data. The framework first evaluated diverse base learners and then integrated top-performing models through a soft-voting mechanism. To enhance clinical interpretability, SHAP-based explanations were incorporated, facilitating clinician trust and decision support. Beyond readmission forecasting, the design can extend to broader applications, such as ICU bed demand prediction, dynamic staffing allocation, or identifying patients requiring extended rehabilitation and follow-up support.1,2,17 The key contributions of this study are threefold: (1) developing a robust ensemble learning model for 30-day ICU readmission risk in HF patients, (2) incorporating SHAP-based interpretability to strengthen transparency and applicability, and (3) presenting a reproducible methodology adaptable to other disease contexts or healthcare settings.
Materials and methods
Study population
This retrospective cohort study utilized data from the publicly available MIMIC-III (v1.4) database, which includes de-identified electronic health records (EHRs) of over 40,000 ICU admissions at the Beth Israel Deaconess Medical Center between 2001 and 2012. Ethical access to the MIMIC-III database was obtained by completing a recognized human research participants protection course and securing approval through the formal PhysioNet access process (Certificate No. 35628530). 15 From an initial 61,532 ICU stays, we restricted the sample to the first ICU stay per patient, applied predefined inclusion and exclusion criteria, and handled missing-data as described in the “Data Cleaning” subsection, yielding a final analytic cohort of 5414 adult HF ICU stays with complete predictor and outcome data.
An initial set of 20 clinically relevant predictors was defined a priori, based on their established use in HF mortality and readmission models and their availability across ICU stays.16,18,19 Variables that are not known or not actionable at the time of ICU discharge—such as postdischarge mortality during follow-up—and sensitive sociodemographic attributes such as religion were explicitly excluded from the candidate predictor set to reflect real-world decision points and to avoid potential fairness concerns. The final list of predictors (e.g., vital signs, common laboratory parameters, Glasgow Coma Scale components, age, and sex) is provided in the Appendix.
Research process
The overall research workflow for this study is illustrated in Figure 1 and depicts the sequential pipeline used to develop and compare single and ensemble models for predicting 30-day ICU readmission among HF patients using MIMIC-III data.15,16 The flowchart comprises four main stages: (1) Data Input, (2) Data Preprocessing, (3) Model Development, and (4) Model Evaluation.

Research process.
The extraction of MIMIC-III records and the definition of the target HF population are described in “Study population” section. The subsequent data preprocessing procedures—including data cleaning, cohort selection, predictor construction, and missing-data handling—are detailed in “Data cleaning” section. To address class imbalance in 30-day readmission outcomes, imbalance-correction techniques are applied only to the training set, as described in “Imbalanced data handling” section.18,20,21 “Model development” section presents the model development stage, in which multiple baseline single classifiers are trained and compared, followed by the construction of ensemble models to enable a systematic single-model versus ensemble-model comparison. Finally, “Model evaluation” section summarizes the model-evaluation procedures, including discrimination and calibration metrics as well as SHAP-based explainability analyses, which are used to quantify predictive performance and to enhance the transparency and reproducibility of the proposed models.18,22–24
Data cleaning
The initial dataset comprised 61,532 ICU stays; restricting the sample to each patient's first ICU stay yielded 46,520 unique admissions.15,19 We then excluded ICU stays shorter than 24 h to ensure that all predictors could be consistently derived from the first 24 h after ICU admission, and we removed patients younger than 16 years. Heart failure cases were identified based on ICD-9-CM codes. Specifically, adult patients (aged >16 years) with HF were identified using ICD-9-CM codes 398.91, 402.01, 402.11, 402.91, 404.01, 404.03, 404.11, 404.13, 404.91, 404.93, and 428.xx, consistent with prior HF ICU studies,16,19 yielding 7278 eligible HF ICU stays (Figure 2).

Data selection process.
Missing-data were processed using a three-stage procedure adapted from the “30–40–20” strategy proposed in prior MIMIC-III work.18,19 First, patient records with more than 30% missing values across the candidate predictors were excluded.18,19 Second, predictors with more than 40% missingness were removed from the feature set.18,19 Third, among the remaining variables, features with more than 20% missingness were excluded. This procedure resulted in a final KNN-imputed modeling set of 5414 HF ICU admissions with predictors that satisfied the predefined missing-data thresholds and were used for subsequent model development.
For the residual sporadic missing entries (<20%), we constructed two parallel analysis datasets to handle missingness: one using listwise deletion of records with any remaining missing values and one using k-nearest neighbors (KNN) imputation to preserve multivariate relationships.18,19 Both strategies were carried forward to the modeling stage and comparatively evaluated in the experimental analysis.
The outcome of interest was unplanned 30-day ICU readmission after index ICU discharge. This outcome is conceptually distinct from general hospital readmission, which refers to rehospitalization after hospital discharge. Admissions without complete 30-day readmission information or those exceeding the missing-data thresholds were excluded. After applying these eligibility criteria and the missing-data procedure, the final analytic cohort comprised 5414 ICU stays for adult HF patients with complete outcome and predictor data, as summarized in the patient selection flowchart.16,18,19
Imbalanced data handling
The naturally imbalanced distribution between patients who were readmitted and those who were not presents a significant challenge for predictive models. This imbalance can bias models toward the majority class, leading to reduced sensitivity in identifying true readmission cases, which are of high clinical importance. To mitigate this issue, data balancing techniques were applied exclusively to the training dataset.
A hybrid resampling approach was employed. The Synthetic Minority Oversampling Technique was used to generate synthetic samples for the minority class, thereby enhancing its representation without duplicating existing records. Concurrently, Tomek Links were utilized to reduce class overlap and eliminate ambiguous instances, thus improving the separation between classes. These techniques were applied only to the training dataset to prevent data leakage into the validation and testing stages.20,21
The implementation of this strategy resulted in an improved ability for the models to effectively detect both readmitted and nonreadmitted patients, which contributed to more stable and accurate classification outcomes.
Model development
Building on the preprocessed cohort described in “Data cleaning” and “Imbalanced data handling” sections, the final analytic dataset was randomly split into training, validation, and testing subsets using stratified sampling to preserve the observed 30-day ICU readmission rate. The training set was used for model fitting and, where necessary, for class-imbalance handling; the validation set was reserved for hyperparameter tuning and model selection; and the held-out test set was used exclusively for the final performance evaluation.
Model construction followed the two-experiment framework illustrated in Figure 3. In Experiment 1 (single-model comparison), a unified missing-value procedure produced two versions of the data: one constructed via listwise deletion and the other via KNNs imputation. These two strategies were chosen because they represent widely used and conceptually distinct approaches in clinical ML applications: listwise deletion provides a simple, transparent complete-case analysis, 25 whereas KNN imputation offers a more flexible, data-driven method that retains all observations by borrowing information from similar patients. 26 Under each imputation strategy, 14 candidate machine-learning classifiers—including logistic regression, decision tree, random forest, gradient boosting, XGBoost, LightGBM, CatBoost, bagging, extra trees, support-vector machine, KNNs, and neural-network-based models—were trained on the (balanced) training set and evaluated using the validation and test sets.13,18,27,28 This yielded two families of single-model predictors corresponding to listwise deletion and KNN imputation.

Model development framework.
Experiment 2 (ensemble-model comparison) then focused on the best-performing single model selected under each imputation strategy. These two single models were contrasted with ensemble counterparts trained on the same feature representations. For each imputation strategy, soft-voting ensembles were constructed by aggregating the highest-ranked base learners; specifically, two ensemble configurations were explored that combined the top three and the top five classifiers, respectively, according to validation accuracy. All four models—two ensembles and two best single models—were finally compared on the held-out test set to assess the added value of ensemble learning under different missing-data handling strategies.
Model evaluation
Model performance was assessed using accuracy, precision, recall, F1-score, and AUROC. Since the dataset was imbalanced, recall and F1-score were given particular attention when evaluating the model's sensitivity to correctly identify patients at risk of 30-day ICU readmission.14,29 The AUROC was used to assess discrimination between classes, with a higher value indicating stronger classification ability. 30
Accuracy (A): Accuracy was defined as the ratio of correctly predicted observations to the total observations.
F1 Score: The F1 score represented the harmonic mean of precision and recall, useful for imbalanced classification tasks.
Precision (P): Precision measured the proportion of true positives among predicted positives.
Recall (R): Recall measured the proportion of actual positive instances that were correctly identified by the model.
To evaluate model calibration, we computed Brier scores and generated calibration plots comparing predicted and observed 30-day readmission probabilities across risk strata, thereby assessing the agreement between estimated risks and empirical event rates. In addition, we summarized threshold-dependent behavior using ROC curves and precision–recall (PR) curves. These graphical assessments complement scalar summary statistics by providing visual insight into prediction reliability and potential regions of miscalibration. For post hoc interpretability, we further applied SHAP and visualized feature attributions.16,31
Results
Population description
The study cohort included 5414 adult patients with a diagnosis of HF based on the above ICD-9-CM discharge codes, extracted from the MIMIC-III database following preprocessing. A total of 12.5% of these patients were readmitted to the ICU within 30 days. Readmitted patients were older on average (73.6 vs. 69.5 years), while gender distribution showed no substantial difference between the groups. Admission type, insurance type, and discharge location differed markedly between the two groups. Readmitted patients exhibited a higher proportion of emergency admissions and greater reliance on Medicare coverage. They were also less frequently discharged home and exhibited a significantly higher mortality rate (23.01%) compared to the nonreadmitted group (1.88%). Physiologically, readmitted patients presented higher levels of blood urea nitrogen, creatinine, and lower systolic and diastolic blood pressure (SBP/DBP). Table 1 presents these demographic and clinical differences, emphasizing the predictive relevance of baseline characteristics.
Selected patient demographic information.
Model development and evaluation
According to the Research Framework illustrated in Figure 3, this study was conducted in two experiments: Experiment 1 – Single Model Comparison and Experiment 2 – Ensemble Model Comparison. To ensure robust model evaluation, we conducted multiple experiments on data partitioning strategies and identified that an 8:1:1 ratio for the training, validation, and testing sets, respectively, yielded the optimal configuration. Consequently, the experimental results reported in this section are derived exclusively from the held-out test set. As summarized in the framework, Experiment 1 first evaluated the performance of 14 single classifiers on the balanced training set under two distinct missing-data handling strategies. The test-set results utilizing the KNN imputation strategy are presented in Table 2, while the results derived using the listwise deletion method are summarized in Table 3. All implementation details are provided in the Appendix.
The predictive performance of single models using KNN imputation.
The predictive performance of single models using listwise deletion.
Under the KNN imputation strategy (Table 2), the Bagging Classifier achieved the highest accuracy (0.8229), followed closely by LightGBM (0.8210) and CatBoost (0.8173). Regarding discrimination, Random Forest yielded the highest AUROC (0.6670), with LSTM (0.6547) and Extra Trees (0.6453) also showing comparatively strong performance in separating classes. In contrast, when applying the listwise deletion strategy (Table 3), although AdaBoost achieved the highest accuracy (0.8306) in this subset, followed by the Bagging Classifier (0.8250), their discriminatory capabilities were notably compromised. Specifically, the top-performing AdaBoost model yielded an AUROC of only 0.5381, indicating poor predictive reliability despite its high accuracy. Comparing the top-performing single models from each approach, the Bagging Classifier from the KNN group demonstrated superior overall stability compared to the AdaBoost model from the listwise deletion group. While AdaBoost showed marginally higher accuracy (0.8306 vs. 0.8229), the Bagging Classifier achieved a significantly higher AUROC (0.6166 vs. 0.5381), reflecting a better balance between sensitivity and specificity. This comparison suggests that preserving sample size and multivariate relationships through KNN imputation offers a significant advantage in building robust models over removing incomplete records.
Subsequently, following the identification of top-performing base learners in Experiment 1 and Experiment 2 (Table 4) evaluated the efficacy of ensemble integration, soft-voting ensembles combining the top three and top five classifiers were developed under both missing-data handling strategies.
Predictive performance of best ensemble models.
Under the KNN imputation strategy (Table 4), the Voting (5 Models) ensemble achieved the highest overall accuracy of 0.8432; however, this configuration showed a reduction in discriminatory power with an AUROC of 0.6110. In contrast, the Voting (3 Models) ensemble—integrating Bagging Classifier, CatBoost, and LSTM—demonstrated a more favorable balance between metrics, yielding a comparable accuracy of 0.8413 while achieving the highest AUROC of 0.6718. This indicates that adding weaker learners to the ensemble (as in the 5-model configuration) diluted the discriminatory capability without providing a meaningful gain in accuracy.
In the listwise deletion cohort (Table 4), the Voting (3 Models) ensemble consistently outperformed the 5-model ensemble, achieving both higher accuracy (0.8389 vs. 0.8333) and AUROC (0.6207 vs. 0.6019). When comparing the optimal frameworks from both strategies, the KNN-based Voting (3 Models) ensemble emerged as the definitive optimal predictive model. It not only surpassed the best listwise deletion ensemble in discriminatory capability (AUROC 0.6718 vs. 0.6207) but also confirmed that preserving multivariate data structure through imputation, combined with selective ensemble integration, yields the most reliable performance for predicting 30-day readmission.
In the second stage of Experiment 2, we conducted an in-depth comparison of four representative models to validate our selection. Although the KNN-based Voting (5 Models) ensemble achieved the highest nominal accuracy (0.8432), the Voting (3 Models) ensemble was selected as the optimal framework. The difference in accuracy was negligible (<0.2%), whereas the Voting (3 Models) ensemble demonstrated significantly higher discriminatory power (AUROC 0.6718 vs. 0.6110), offering a more robust balance for clinical application. Consequently, for the detailed performance analysis (Figure 4), we selected: (1) AdaBoost (Listwise); (2) Bagging Classifier (KNN) as the best single-model reference; (3) the Listwise Deletion-based Voting (3 Models) ensemble; and (4) the KNN-based Voting (3 Models) ensemble as the proposed optimal model for comparison.

Comparison of ROC curve; precision–recall (PR) curve and calibration curve for top-performing models across different missing-data handling strategies.
We evaluated these four models using ROC curves, PR curves, and calibration curves. First, the ROC Curve analysis assessed the discriminatory capacity across threshold settings. The KNN-based Voting (3 Models) ensemble exhibited robust performance, confirming its superior AUROC of 0.6718 compared to the listwise counterparts. Second, the PR Curve was analyzed to evaluate model stability given the imbalanced outcome. While the Listwise Deletion ensemble achieved a marginally higher area under the PR curve (0.19) compared to the KNN-based ensemble (0.18), the KNN approach maintained a competitive performance that was superior or comparable to single-model baselines. Third, we examined the Calibration Curve to visualize the agreement between predicted probabilities and observed outcome rates. A perfectly calibrated model would follow the diagonal ideal line. Visual inspection indicated that the KNN-based Voting (3 Models) ensemble aligned most closely with the ideal diagonal, demonstrating superior calibration and reliability in risk estimation compared to the listwise deletion models.
Finally, to quantify this calibration quality, we computed the Brier score, where a lower value indicates superior probabilistic accuracy. The results, summarized in the Brier score comparison table, confirmed the visual trends. The KNN-based Voting (3 Models) ensemble achieved the lowest (best) Brier score of 0.1305, outperforming the KNN-based Voting (5 Models) ensemble (0.1338) and demonstrating the highest reliability in risk estimation. This was followed by the KNN-based Bagging Classifier (0.1398). In contrast, the listwise deletion strategy resulted in poorer calibration, with its Voting (3 Models) ensemble scoring 0.1635 and the single AdaBoost model yielding the worst score of 0.1824. In conclusion, the convergence of the highest discriminatory power (AUROC 0.6718) and the most accurate probability calibration (Brier score 0.1305) identifies the KNN imputation strategy combined with the Voting (3 Models) ensemble as the definitive optimal modeling approach for predicting 30-day ICU readmission.
Model interpretation
The SHAP values were used to interpret the decision logic of the optimal ensemble models derived from the two missing-data handling strategies. The results are visualized as shown in Figure 5, which includes SHAP summary plots highlighting the most influential features in predicting 30-day ICU readmission for both the Listwise Deletion (Voting 3 Models) and KNN Imputation (Voting 3 Models) frameworks. The y-axis lists features in descending order of importance, and the x-axis represents the SHAP value impact. Both ensemble strategies consistently identified glucose, heart rate (HR), SBP, platelets, white blood cell (WBC), temperature, respiratory rate, DBP, and oxygen saturation (OS) as the most important features.

SHapley Additive exPlanation (SHAP) summary plots illustrating feature importance and directional impact for the optimal ensemble models.
Discussion
The ensemble learning algorithm proposed in this study was designed to address the challenge of class imbalance in HF readmission prediction and was validated through a rigorous experimental framework. Across our experiments, ensemble models generally achieved more favorable performance than single classifiers on key metrics (F1-score, recall, and AUROC). This consistency affirms the algorithm's robustness and supports its potential applicability in clinical settings. This also demonstrates the feasibility of integrating predictive accuracy with interpretability through SHAP, aligning with the goal of explainable AI in healthcare.16,31
Building on this foundation, the following sections present a structured discussion of our findings across four dimensions: (1) model performance and description, (2) comparison with existing literature, (3) interpretability in a clinical context, and (4) Clinical perspective and practical impact.
Model performance and description
Heart failure remains a critical determinant of ICU readmission, particularly within the ICU population, necessitating robust predictive tools to manage elevated mortality risks and resource allocation. Building on this imbalanced outcome distribution, we first benchmarked 14 single classifiers under two missing-data strategies—KNN imputation and listwise deletion. Under the KNN imputation setting, tree-based ensemble learners (e.g., Bagging, LightGBM, CatBoost, Random Forest, Extra Trees) tended to yield stronger discrimination and classification performance than linear and distance-based methods in our dataset. In particular, the Bagging Classifier attained the highest test accuracy, while Random Forest and CatBoost provided competitive AUROC values, indicating a stronger ability to separate readmitted from nonreadmitted patients compared with deep neural networks, logistic regression, SVM, and KNN. In contrast, although listwise deletion yielded marginally higher peak accuracy for AdaBoost, its AUROC remained markedly lower than the top-performing KNN-based models in our experiments, revealing that simply discarding incomplete records can inflate accuracy at the expense of reliable discrimination and clinical utility.
After identifying high-performing base learners in Experiment 1, Experiment 2 was organized into two stages:
In the first stage, we focused on the construction and evaluation of ensemble models under two missing-data strategies: KNN imputation and listwise deletion. The results indicate that the KNN-imputed Voting (3 Models) ensemble showed a favorable balance across discrimination and calibration metrics compared with other tested configurations in our experiments. Specifically, this optimal ensemble—comprising Bagging Classifier, CatBoost, and LSTM—achieved an accuracy of 0.8413 and the highest AUROC of 0.6718. This performance notably surpassed the ensemble model derived from the listwise deletion strategy, which yielded a lower AUROC of 0.6207 despite a comparable accuracy of 0.8389. Furthermore, a critical trade-off analysis was conducted against the Voting (5 Models) ensemble. While the 5-model ensemble achieved a marginally higher accuracy (0.8432), it suffered a significant reduction in discriminatory power (AUROC 0.6110). Consequently, the Voting (3 Models) ensemble was selected as the final proposed model, as it demonstrated a more favorable balance, avoiding the dilution of discriminatory power observed in the larger ensemble while maintaining high classification accuracy.
In the second stage of Experiment 2, we conducted an in-depth comparison of four representative models using four distinct evaluation metrics: ROC curves, PR curves, calibration curves, and Brier scores. To ensure that this comparison was grounded in clinically interpretable performance, we prioritized models that maintained robust discrimination alongside high accuracy. Crucially, calibration analysis confirmed that the Voting (3 Models) ensemble provided the most accurate probabilistic risk estimates with a Brier score of 0.1305, surpassing the 5-model configuration (0.1338) and ensuring reliable risk stratification. These improvements in both discrimination and calibration reflect clinically meaningful gains in identifying high-risk patients. This performance advantage can be attributed to the synergistic effect of the ensemble components and the data preservation strategy: KNN imputation maintained the multivariate structure of the dataset, preventing the information loss associated with listwise deletion. By aggregating the stability of the Bagging Classifier with the gradient boosting capabilities of CatBoost and the sequential pattern recognition of LSTM, the KNN Voting (3 Models) ensemble effectively mitigated the overfitting and calibration issues observed in single high-accuracy models like AdaBoost (AUROC 0.5381). This approach ensured a more robust generalization capability for predicting readmission in complex clinical datasets.
Comparison with existing literature
The predictive performance of our proposed model—specifically the KNN-imputed Voting (3 Models) ensemble—demonstrates competitive predictive capability when evaluated against recent benchmarks. While maximizing accuracy was our primary objective, we prioritized AUROC as the deciding factor when accuracy differences were minimal. To provide a balanced comparison, we discuss our findings across three dimensions: comparisons within the MIMIC-III database, performance relative to other HF readmission studies, and contextualization within broader cardiovascular outcome prediction.
First, benchmarking against studies utilizing the same data source allows for a direct assessment of methodological efficacy. Dafrallah and Akhloufi recently proposed a readmission prediction approach using MIMIC-III discharge notes combined with NLP techniques. While their method leverages rich unstructured text, our study suggests that an ensemble using structured clinical variables can achieve performance comparable to prior work and within the range reported in the literature, with a favorable balance between precision and recall. Specifically, our model yielded an F1-score of 0.8195, exceeding the F1-score of 0.733 reported in their NLP-enhanced study. 32 This suggests that for HF cohorts in MIMIC-III, structured physiological and administrative data—when processed with effective imputation and ensemble techniques—remain highly effective for identifying at-risk patients.
Expanding the comparison to the broader domain of HF readmission reveals important trade-offs between discrimination (AUROC) and sensitivity (Recall). Zhang et al. developed an XGBoost model for acute HF (AHF) patients using a localized dataset from China. While they reported a higher AUROC of 0.763 compared to our 0.6718, their model's recall was limited to 0.660. 33 In a clinical context, low recall implies missing a substantial portion of high-risk patients. In contrast, our proposed framework achieved a Recall of 0.8413 and an Accuracy of 0.8413, indicating improved recall in identifying specific readmission cases. Furthermore, by utilizing 20 universally available clinical variables rather than region-specific or complex unstructured data, our model offers higher potential for generalizability and easier integration into diverse clinical decision support systems (CDSS).
Similarly, Pikatza-Huerga et al. reported a voting ensemble trained on data from five hospitals, achieving an AUROC of 0.81 and an accuracy of 0.81. 34 While their multicenter study exhibited superior discrimination, our Voting (3 Models) ensemble demonstrated a competitive edge in classification Accuracy (0.8413 vs. 0.81) and maintained a high F1-score (0.8195). This suggests that our specific configuration of base learners (Bagging, CatBoost, LSTM) prioritizes the balance of precision and recall, which is often more actionable for resource allocation than AUROC alone. Additionally, our findings regarding the importance of discharge location align with recent work by Esteban-Fernández et al., who emphasized patient characteristics and discharge destination as critical determinants of readmission risk. 35
Finally, to contextualize the inherent difficulty of predicting short-term adverse outcomes in critical care, we reviewed studies on related cardiovascular conditions. For instance, Chen et al. predicting MACE in STEMI patients and Zhao et al. predicting in-hospital mortality reported accuracies ranging from approximately 0.63–0.78.36,37 Although these studies target different endpoints, they illustrate the challenging nature of modeling acute cardiovascular events. In this context, the stable accuracy of 0.8413 achieved by our proposed framework reflects a robust performance.
Interpretability in a clinical context
We employed SHAP values to explain the predictive outputs and quantify the contribution of individual features. Glucose emerged as the leading predictor across both models, followed closely by hemodynamic and inflammatory indicators. A distinct pattern in feature contribution was observed: positive SHAP values were predominant among glucose, HR, platelets, and WBC, indicating that elevated levels in these variables were associated with an increased risk of readmission. Conversely, SBP, DBP, and OS exhibited a protective trend, where higher values contributed to negative SHAP values (reduced risk), while lower values in these vital signs were linked to higher readmission probabilities. These findings suggest that patients with hyperglycemia, inflammatory activation (elevated WBC/platelets), and hemodynamic instability (hypotension, tachycardia, hypoxia) are at the highest risk.
First, our SHAP analysis identified blood glucose as the predominant predictor of 30-day readmission, with elevated levels consistently associated with higher risk. This finding aligns with a growing body of evidence highlighting the deleterious impact of stress hyperglycemia in AHF patients. Even in nondiabetic patients, acute hyperglycemia often reflects a surge in cortisol and catecholamines due to physiological stress, which can exacerbate oxidative stress and endothelial dysfunction. Chun et al. demonstrated that in-hospital glycemic variability and hyperglycemia are strong independent predictors of all-cause mortality and readmission in AHF populations. 38 Similarly, Carrera et al. emphasized that admission glucose levels serve as a critical biomarker for poor prognosis. 39 Our model's heavy reliance on glucose underscores the importance of strict glycemic monitoring and management as a potential strategy to reduce readmission rates. Clinically, the prominence of glucose in the SHAP profile suggests that peridischarge glycemic instability may be a useful signal for identifying patients who warrant closer physiologic surveillance and more structured transitional-care planning after ICU discharge.38–40
Second, we observed that higher SBP/DBP exhibited a protective trend (negative SHAP values), whereas lower blood pressure was linked to increased risk. This relationship supports the “reverse epidemiology” often observed in HF cohorts, where lower systemic blood pressure suggests severe pump failure, low cardiac output, and an inability to tolerate guideline-directed medical therapy. As noted by Huang et al. admission SBP is inversely associated with 1-year clinical outcomes, with hypotension signaling a significantly worse prognosis due to tissue hypoperfusion. 41 Conversely, elevated HR contributed to positive SHAP values, indicating increased risk.
Third, our results showed that elevated WBC and platelet counts were associated with higher readmission probabilities, likely reflecting a state of systemic inflammation and hypercoagulability. Recent studies, such as Zhang et al. have corroborated that higher inflammatory indices are significantly correlated with HF severity and adverse events. 42 The concurrent identification of low OS as a risk factor further points to residual pulmonary congestion or respiratory compromise. Collectively, these SHAP values suggest that our ensemble model effectively captures a high-risk phenotype characterized by the triad of metabolic derangement (hyperglycemia), hemodynamic instability (hypotension, tachycardia), and active inflammation. From a clinical standpoint, this pattern is compatible with a subset of patients leaving the ICU with unresolved inflammatory and respiratory burden, for whom intensified post-ICU monitoring and early follow-up may be particularly relevant within transitional-care pathways.40,42
Clinical perspective and practical impact
In large cohort studies, 30-day readmission rates after a HF hospitalization typically range from approximately 13% to 25% and can be even higher in certain Medicare populations, 43 imposing a substantial burden on mortality, quality of life, and healthcare expenditures. Existing systematic reviews have shown that most readmission prediction models achieve a c-statistic in the range of 0.55–0.75, indicating only moderate discriminative performance. 6 Consequently, such models are generally used in clinical practice as supportive risk stratification tools, rather than as stand-alone determinants of individual patient management decisions. 6
In this study, the best-performing configuration was the KNN-imputed Voting (3 Models) ensemble. Its AUROC fell within the moderate performance range commonly reported for readmission models. 6 Importantly, using only core structured variables collected during the ICU stay—without incorporating free-text notes or specialized scales—the model achieved moderate discrimination and acceptable calibration. This suggests that such a data configuration has nontrivial potential for HF ICU 30-day readmission risk stratification. Within this context, our HF ICU readmission model can be positioned as an EHR-based risk stratification tool that provides a combination of moderate discriminatory ability and reasonable calibration. The primary goal is to supply additional risk information prior to ICU or hospital discharge to help clinicians identify relatively high-risk patients and to inform decisions about the intensity of transitional care.
Clinically, if the 30-day readmission rate after HF hospitalization is around 20%, approximately one in five discharged patients will be readmitted in the short term. 43 At the same time, transitional care interventions for high-risk HF patients—such as follow-up by specialized HF nurses, structured telephone calls, home visits, and dedicated HF clinics—are resource-intensive and cannot be universally provided to all patients. Prior randomized trials and systematic reviews indicate that sustained transitional care interventions targeted to selected high-risk HF populations may reduce readmissions and mortality.40,44 Under these circumstances, even a model with only moderate discrimination may, provided that recall, F1-score, and probability calibration remain reasonably stable, theoretically assist clinical teams in allocating limited transitional care resources more efficiently. For example: (1) Using predicted risk scores as an adjunctive input to prioritize early outpatient visits or telehealth follow-up for patients classified as higher risk. (2) Using relatively favorable Brier scores, as an indicator of more reliable probability estimates, to inform threshold setting for different levels of intervention intensity. (3) Recognizing that these potential benefits remain theoretical at this stage and will require confirmation in future interventional studies.
However, because the present model was developed using a single-center ICU cohort from MIMIC-III, its transportability cannot be assumed and requires explicit external validation. Single-center critical care datasets may encode site-specific case mix, ICU admission/discharge thresholds, local practice patterns (e.g., laboratory ordering frequency, hemodynamic monitoring intensity), and transitional-care capacity, all of which can shift both feature distributions and baseline 30-day readmission risk. In addition, MIMIC-III represents one institution (Beth Israel Deaconess Medical Center) and a specific historical period (2001–2012), 15 and temporal changes in HF management and discharge pathways may lead to performance drift when applied to contemporary cohorts. Accordingly, the model should be viewed as a promising structured-data prototype whose generalizability must be demonstrated through staged external validation rather than inferred from internal testing alone.
To address this limitation and align with established reporting and validation guidance, 45 we propose the following external validation roadmap: (i) temporal validation within the same health system using a more recent era dataset 46 to quantify time-related drift in discrimination and calibration 46 ; (ii) geographic validation in independent multicenter ICU datasets (e.g., eICU-CRD) to test transportability across hospitals with different workflows, measurement frequency, and documentation patterns 47 ; and (iii) local-site validation in the intended deployment environment, ideally as a prospective “silent-mode” evaluation where predictions are generated but not shown to clinicians until calibration performance and alert burden are acceptable. For each external validation, evaluation should extend beyond AUROC to include calibration-in-the-large, calibration slope, Brier score, and clinically oriented utility analyses to assess net benefit across plausible risk thresholds. 48 In addition, external validation should examine performance stability across clinically relevant subgroups and under differing missingness patterns, given that imputation behavior may vary across sites and time periods. If miscalibration is observed, pragmatic options include recalibration (intercept/slope updating) and, if necessary, limited model updating using local data while preserving the original feature set to maintain implementability. Beyond external validation, clinical impact evaluation (e.g., pragmatic trials or stepped-wedge implementation studies) would be required to determine whether risk-guided transitional care allocation improves outcomes without undue alert burden.
From a practical standpoint, in terms of system-level usability and scalability, our model primarily relies on structured fields commonly available in most inpatient EHR systems—such as vital sign flowsheets, laboratory codes, and basic administrative variables—rather than on hospital-specific free-text fields or bespoke assessment scales.16,49,50 In practice, this design may allow technical efforts during implementation to focus mainly on mapping data fields and configuring extraction pipelines, without requiring the creation of extensive new paper forms or manual data entry workflows. However, whether the model can in fact be ported smoothly across hospitals and EHR vendors remains an empirical question that depends on real-world deployment and local evaluation.
Furthermore, from the perspective of clinical implementation, the proposed model can be considered within conceptual use cases that may facilitate staged introduction into practice. Drawing on existing experience with EHR-based early-warning and readmission-risk tools,
40
we infer that a similar “phased implementation” approach could potentially be applied to HF ICU readmission models. For example, a hospital might consider the following stepwise strategy:
★ Silent phase: The model runs in the background and periodically computes 30-day readmission risk for HF ICU patients approaching discharge, but the outputs are not shown to clinicians. The primary purpose is to evaluate local data quality, calibration, and outcome rates across risk strata. ★ Limited-exposure phase: Risk scores and basic explanatory information are selectively presented during discharge planning meetings or multidisciplinary case conferences, serving as an adjunctive reference while feedback is collected from clinicians regarding usability and acceptable alert frequency. ★ Broader deployment phase: If the preceding phases are satisfactory, the hospital may consider highlighting a subset of higher-risk patients on ward dashboards or discharge lists, linking these flags to preexisting high-risk care pathways or transitional care protocols.
On this basis, one may further hypothesize a clinical usage scenario in which, as an HF ICU patient approaches discharge, the system automatically retrieves recent vital signs and laboratory results (processed with KNN imputation), computes the 30-day readmission risk, and flags patients whose risk exceeds a predefined threshold on a ward dashboard. When a patient is labeled “high-risk,” the care team can treat this as an auxiliary signal, integrating it with clinical judgment to decide whether to initiate a more intensive high-risk discharge pathway, such as prioritizing early in-person or telehealth follow-up; arranging enhanced medication review and self-management education by a pharmacist or HF nurse; or, where appropriate, referring the patient to a specialized HF program or home-based care service.40,44
If the risk score is accompanied by SHAP-based visual explanations that highlight the most influential features for the current prediction—such as elevated glucose, low SBP or DBP, increased inflammatory markers, or reduced OS—this may help clinicians understand the dominant risk signals identified by the model and incorporate that information into their overall assessment. 42
Conclusions
This study makes several contributions to the prediction of 30-day readmission among HF patients discharged from the ICU, particularly at the levels of methodological design and model development, with a focus on data utilization strategies, missing-data handling and ensemble evaluation, and model interpretability. At the same time, several limitations warrant careful consideration.
First, with respect to data and missing-data strategies, the models were built on routinely collected structured EHR data during the ICU stay, including demographic information, vital signs, and laboratory tests. This feature space is broadly consistent with that used in prior HF readmission and prognosis models.6,49,50 On this basis, we systematically compared two missing-data handling approaches: KNN imputation and listwise deletion. The AdaBoost model trained on listwise deletion achieved a relatively higher test accuracy, but its AUROC was only 0.5381, indicating limited discriminatory ability. In contrast, under KNN imputation, the Bagging classifier achieved a slightly lower accuracy yet reached an AUROC of 0.6166, providing a more balanced trade-off between sensitivity and specificity.
Subsequent ensemble modeling identified the KNN imputed Voting (3 Models) ensemble as the optimal predictive framework. While the Voting (5 Models) ensemble achieved a marginally higher accuracy (0.8432), the Voting (3 Models) ensemble yielded a significantly superior AUROC (0.6718) while maintaining a robust accuracy (0.8413). Taken together, these findings suggest that in this dataset, a KNN-based strategy that preserves incomplete cases, when combined with a selective voting ensemble (Bagging, CatBoost, LSTM), is more conducive to achieving a model that simultaneously maintains discrimination and probability calibration than strategies that simply discard incomplete records or indiscriminately aggregate weaker learners.
Second, in terms of model interpretability, this study illustrates how SHAP values can be used within a multimodel soft-voting ensemble framework to quantify the relative contribution of individual features to predicted risk. In our dataset, features such as blood glucose, blood pressure (SBP/DBP), HR, WBC, platelet count, and OS exhibited particularly prominent effects in the SHAP distributions. Higher glucose and inflammation-related indices were generally associated with increased predicted risk, consistent in direction with previous work showing that stress hyperglycemia and inflammatory states in AHF are linked to adverse outcomes.38,39,42 Lower blood pressure and OS tended to be associated with higher risk, aligning with the clinical observation that hemodynamic instability and impaired oxygenation are common in more severely ill patients. Although these SHAP-based findings are derived from a single database and a specific model architecture and are therefore insufficient for causal inference, they indicate that, in the present context, interpretable ML methods can help delineate a set of high-risk feature patterns that accord with clinical intuition and are broadly consistent with other SHAP-based HF risk modeling studies.33,42
These findings underscore the potential of interpretable AI models as integral components of CDSS. By integrating this model into EHR systems, real-time risk predictions could support proactive patient management. The transparency afforded by SHAP enhances clinical trust and allows for evidence-based, personalized interventions.
With respect to limitations, despite the use of a rigorous train–validation–test procedure within a single database and evaluation of model performance using multiple metrics (ROC/PR curves, calibration curves, and Brier scores), several caveats remain. First, the model was developed in a single dataset and a specific HF ICU population; both the feature space and care processes may differ across institutions or regions. The apparent advantages of KNN imputation and Voting ensembles observed in this study may not generalize directly to other settings. Future work should therefore conduct external and prospective validation across different hospitals, HF ICU populations, and EHR infrastructures6,16 to assess discrimination and calibration in new contexts.
Second, the discussion of system implementation and clinical application in this study remains largely conceptual. For example, the descriptions of “phased implementation” and a “high-risk discharge pathway” are proposed as plausible use cases informed by prior literature on EHR-based risk models and transitional care,42,44 rather than processes that have been deployed and evaluated in a real-world EHR environment. We have not yet examined how such implementation strategies would affect readmission rates, clinical workload, or interprofessional collaboration. Consequently, these designs should currently be regarded as conceptual blueprints that require refinement and confirmation through local pilot testing, user-centered studies, and prospective intervention research.
Finally, although the SHAP analysis provides feature-level information on relative importance, the results are derived from observational data and reflect associations rather than causal relationships. To more deeply investigate the causal roles of specific variables (e.g., changes in blood glucose or inflammatory markers) in HF readmission risk, future work will need to incorporate clinical trials or carefully designed causal inference methodologies. In addition, subsequent studies may build on the present model framework to explore more parsimonious feature subsets or regularization strategies, thereby facilitating easier implementation and maintenance of the model across diverse clinical settings.
Footnotes
Acknowledgments
The authors would like to express their gratitude to the National Science and Technology Council, Taiwan, for its support, and to National Taipei University of Technology for providing administrative and institutional assistance.
Ethics approval and consent to participate
This study used the de-identified MIMIC-III database, which was approved by the Institutional Review Boards of MIT and Beth Israel Deaconess Medical Center. As the data are fully de-identified, additional ethics approval and informed consent were not required.
Contributorship
C-MC conceived the study design, collected and organized the data, and drafted the initial manuscript. C-SW supervised the research process and provided critical revisions. H-YC performed data analysis and contributed to manuscript revision. B-YL developed the methodological framework and contributed to manuscript writing and revision. T-NC provided data and assisted with data collection and organization. All authors contributed to the refinement of the manuscript, and all authors read and approved the final version.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was supported by the National Science and Technology Council (NSTC), Taiwan, through Grant Number NSTC 113-2410-H-027-014.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Availability of data and materials
Declaration of AI and AI-assisted technologies in the writing process
AI-assisted language tools were used to improve readability and language presentation (e.g., spelling and grammar). The authors reviewed and edited the manuscript as needed and take full responsibility for the integrity, accuracy, and final content of the publication.
Appendix
Experimental environment and software versions
scikit-learn 1.5.1 imbalanced-learn 0.12
Hyperparameter configurations for all models (KNN Imputation)
Hyperparameter configurations for all models (Listwise Deletion)
