Abstract
Objective:
This study compared the performance of different machine learning models in predicting in-hospital mortality in patients hospitalized for heart failure.
Methods:
Demographic and clinical data of 229 patients were retrospectively reviewed for this study. The variables included age, sex, hypertension, diabetes, coronary artery disease, peripheral artery disease, chronic kidney failure, Charlson comorbidity index, and length of hospital stay. In-hospital mortality (“in-hospital death”) was defined as the dependent variable. The machine learning methods used included logistic regression, random forest, gradient boosting, and multilayer perceptron models. The model performance was evaluated using the receiver operating characteristic area under the curve, precision–recall area under the curve, F1 score, and Brier score.
Results:
In-hospital mortality was observed in 7 of the 229 patients included in the study (3.1%). The mean age and frequency of chronic kidney disease were higher among deceased patients, whereas the Charlson comorbidity index did not differ significantly between survivors and non-survivors. In model comparisons, the Gradient Boosting model showed the best overall performance (ROC-AUC: 0.87, PR-AUC: 0.76, F1: 0.73; Brier score: 0.12).
Conclusion:
The Gradient Boosting model demonstrated the highest performance in predicting in-hospital mortality in patients with heart failure. Machine learning algorithms offer strong potential beyond traditional statistical approaches for the prognostic prediction of complex clinical scenarios such as heart failure.
Introduction
Decompensated heart failure (DHF) is a major cause of cardiovascular mortality and morbidities. Accurate risk stratification is of great clinical significance because of the progressive course of the disease, frequent hospitalizations, and high early mortality rates.1,2
The Charlson Comorbidity Index (CCI) is a clinical score that has long been used to assess comorbidity burden and is easy to apply. 3 The CCI is a valid tool for predicting mortality and readmission risks in different patient groups. However, in heterogeneous patient populations, such as those with heart failure, the prognostic power of the CCI may be limited.4,5
In recent years, machine learning (ML)-based approaches have shown the potential to uncover complex relationships in clinical data that go beyond traditional statistical methods. These methods have proven especially effective in improving the accuracy of mortality prediction using tabular clinical data.6,7
This study aimed to evaluate the prediction of hospital inpatient mortality based on the CCI in patients with DHF using different ML algorithms. This study hypothesizes that while the CCI alone may be insufficient, a clinically meaningful improvement in short-term mortality prediction can be achieved when supported by ML models.
Materials and methods
Study design and population
This single-center retrospective observational study included consecutive patients hospitalized with a diagnosis of DHF between January 2020 and December 2023. Patients were reviewed using electronic health records (EHRs). A total of 229 patients aged > 18 years with a confirmed diagnosis of DHF were included in the study. Patients with acute coronary syndrome, advanced malignancy, active infection, or missing data were excluded from the study.
Data collection and variables
Demographic, clinical, and laboratory data were obtained from hospital information systems. In-hospital mortality was defined as death before discharge 1 = death (event occurred) and 0 = discharged alive (no event). The CCI was calculated using variables for comorbid conditions, including myocardial infarction (MI), congestive heart failure, peripheral arterial disease (PAD), and diabetes mellitus (DM).2,3 According to the classification suggested in the literature, CCI scores were categorized as 0–2 (low risk), 3–4 (moderate risk), and ⩾5 (high risk). 4 Since all cases in the study population fell within the low-risk range (0–2), the analyses were reported considering this classification.
Data preprocessing
Missing data were assessed at the variable level. Records with missing values in any of the predefined predictor variables or the outcome were excluded using a complete-case approach to avoid introducing imputation-related bias in a dataset with a very small number of outcome events. Outliers were screened both clinically and statistically, and implausible values attributable to recording errors were excluded from the analysis.
Owing to the marked class imbalance (in-hospital mortality rate: 3.1%), the Synthetic Minority Over-sampling Technique (SMOTE) was applied exclusively to the training data within the cross-validation folds to increase the representation of the positive class while preventing data leakage.8,9 Continuous variables were evaluated for normality, and logarithmic transformation was applied when appropriate.
Modeling approach
Logistic Regression (LR) was used as a baseline linear classifier because it is widely used in clinical risk modeling and provides interpretable coefficients. In addition, we evaluated two tree-based ensemble methods—Random Forest (RF) and Gradient Boosting (GB)—because they can capture non-linear relationships and interactions without requiring explicit feature engineering, and they often perform robustly on tabular clinical datasets. We also evaluated a Multilayer Perceptron (MLP) to explore a neural-network approach that can model complex decision boundaries.
Predictor variables included age, sex, hypertension, DM, MI, peripheral artery disease, and the CCI. Continuous variables were assessed for distributional properties; log transformation was applied when needed. For LR and MLP, features were standardized within the training data to avoid scale-related optimization issues; tree-based models were trained on the original scale.
Given the rare-event outcome, SMOTE was applied only to the training folds/subsets (never to the test set) to reduce class imbalance while minimizing data leakage.8,9 Hyperparameters for each model were optimized using GridSearchCV within cross-validation. The final performance metrics were computed on the held-out test set and summarized across cross-validation.
Data segregation and validation
The dataset was randomly split into training (80%) and testing (20%) subsets using a stratified strategy to preserve the mortality proportion. To improve generalizability and reduce sampling variability, we applied 5-fold stratified cross-validation within the training set. All analyses were performed in Python (version 3.9; Python Software Foundation, Wilmington, DE, USA) using the scikit-learn library. Hyperparameter tuning was conducted via GridSearchCV within cross-validation.
Performance metrics
The model performance was evaluated using the receiver operating characteristic area under the curve (ROC-AUC), precision–recall area under the curve (PR-AUC), F1 score, Brier score, and Youden J index10–13 For each model, ROC-AUC values were calculated along with 95% confidence intervals (CI). Cutoff points were determined based on the Youden J index, which provides an optimal balance between sensitivity and specificity. 13
The ROC-AUC represents the overall discriminative ability, PR-AUC reflects sensitivity to the positive class in imbalanced datasets, F1 score indicates the balance between accuracy and precision, Brier score measures calibration accuracy, and Youden J index represents the optimal decision threshold.
Findings
In-hospital mortality occurred in 7 (3.1%) of the 229 patients included in the study. The mean age of the patients was 71.3 ± 6.3 years, and 62.9% were male. Hypertension (74.7%) and diabetes (42.8%) were the most common comorbidities. Chronic kidney disease (CKD) and coronary artery disease (CAD) were present in 32.3% and 52.4% of patients, respectively. No cases of peripheral artery disease were observed. The mean CCI was 1.0 ± 0.8, and the median length of hospital stay was 4 [1–7] days.
Although the mean age (74.0 ± 4.2), CKD rate (71.4%), and CAD frequency (85.7%) were higher in patients who developed in-hospital mortality, only the presence of CKD showed a significant difference (
The performance comparisons of the ML models are presented in Table 1. The ROC-AUC values of the models were as follows:
GB: 0.87 (95% CI: 0.81–0.93).
RF: 0.84 (0.77–0.90).
MLP: 0.81 (0.74–0.88).
LR: 0.78 (0.70–0.86).
Echocardiographic parameters and biomarker levels at baseline and after 3 cycles of anthracycline treatment.
Optimal cutoffs were determined using Youden’s J index. Performance metrics of Logistic Regression, Random Forest, and Gradient Boosting models developed to predict in-hospital mortality among patients with decompensated heart failure. All models included age, sex, hypertension, diabetes mellitus, myocardial infarction, peripheral artery disease, and Charlson Comorbidity Index (CCI) as predictors.
Distribution of Charlson Comorbidity Index (CCI) categories and corresponding in-hospital mortality rates.
CCI was classified as low (0–2), moderate (3–4), and high (⩾5) according to standard definitions.
Mortality rates increased progressively across higher CCI strata.
Model performances were evaluated using 5-fold stratified cross-validation.
AUC: area under the curve; PR-AUC: precision–recall area under the curve.
Similarly, the highest PR-AUC value was found in the GB model (0.76 (0.69–0.82)). The GB model also had the highest F1 score (0.73) and the lowest Brier score (0.12). While the RF model showed similar accuracy values (ROC-AUC: 0.84, F1:0.68, Brier: 0.14), the performance of the LR and MLP models was relatively lower.
All models were internally validated using 5-fold stratified cross-validation. Cut-off values were determined using the Youden’s J index. AUC: area under the curve; PR-AUC: precision-recall AUC.
In the additional analysis in Table 1, the RF (0.52) and GB (0.49) models had the highest discriminative power in terms of the Youden J index. The optimal cutoff values were 0.41 (RF) and 0.43 (GB), respectively.
Figure 1 shows the ROC curves of the models, and Figure 2 shows the precision–recall curve. Figure 3 presents a summary of the model performance, where the ROC-AUC, PR-AUC, and F1 scores are presented. As shown in Figure 3, the GB model demonstrates superior performance compared to the other models for all metrics (ROC-AUC = 0.81; 95% CI: 0.70–0.90; PR-AUC = 0.31; 95% CI: 0.24–0.42).

Receiver operating characteristic (ROC) curves of Logistic Regression, Random Forest, and Gradient Boosting models for predicting in-hospital mortality using age, sex, hypertension, diabetes mellitus, myocardial infarction, peripheral artery disease, and (if available) Charlson Comorbidity Index (CCI).

Precision–recall (PR) curves of the same models constructed using age, sex, hypertension, diabetes mellitus, myocardial infarction, peripheral artery disease, and (if available) CCI.

Model performance comparison across ROC-AUC, PR-AUC, and F1 metrics. The bar chart illustrates the comparative performance of four machine learning models—Logistic Regression, Random Forest, Gradient Boosting, and Multilayer Perceptron (MLP)—in predicting in-hospital mortality among patients with decompensated heart failure. Gradient Boosting demonstrated the highest ROC-AUC (0.87), PR-AUC (0.76), and F1 score (0.73), indicating the best overall predictive accuracy and calibration among all models.
The RF model demonstrated similar performance (ROC-AUC = 0.82; 95% CI: 0.71–0.91); however, its sensitivity for the positive class was lower than that of the GB model. LR, used as the baseline model, achieved an ROC-AUC of 0.68 (95% CI: 0.55–0.80), while the MLP reached an ROC-AUC of 0.81 (95% CI: 0.74–0.88).
The Brier scores of all models ranged from 0.12 to 0.15, indicating a satisfactory level of calibration. The model with the highest F1 score was the GB model (0.73). After applying the SMOTE method, a general increase in the PR-AUC and F1 values was observed. This shows that resampling techniques can improve the performance of small-sample studies with imbalanced class structures.
Figure 4 illustrates the discriminative performance of the ML models using a forest plot based on the AUC values and their 95% confidence intervals. The clear leftward deviation of LR compared with ensemble-based methods highlights the superiority of GB and RF in mortality prediction.

Forest plot demonstrating the discriminative performance of four machine learning models—Gradient Boosting, Random Forest, Multilayer Perceptron (MLP), and Logistic Regression—for predicting in-hospital mortality, based on the area under the receiver operating characteristic curve (AUC). AUC estimates and 95% confidence intervals were obtained using non-parametric bootstrapping with 10,000 resamples and are displayed on a logit scale. Gradient Boosting showed the highest discriminative ability, followed by Random Forest and MLP, whereas Logistic Regression demonstrated the lowest predictive performance.
Discussion
This study evaluated the relationship between the CCI and in-hospital mortality in patients hospitalized for DHF and compared the predictive performance of different ML models. The findings confirmed a significant association between CCI and mortality, while also showing that including additional clinical variables, such as age, sex, hypertension, DM, CAD, and chronic kidney failure in the model, enhanced its predictive power. Notably, the GB and RF models achieved the highest levels of accuracy in terms of ROC-AUC and PR-AUC (Figures 1 and 2). These results suggest that ML approaches can significantly contribute to the integration of CCI into clinical decision-support systems (Table 2).14–16
Baseline demographic and clinical characteristics of the study population.*
Baseline demographic and clinical characteristics of patients hospitalized with decompensated heart failure. Continuous variables are presented as mean ± standard deviation (SD) or median (interquartile range), and categorical variables as number (percentage). Group comparisons were made according to in-hospital mortality status.
Data are presented as the mean ± standard deviation, median [interquartile range], or
The clinical significance of the CCI
The CCI is a scale that quantitatively assesses the burden of accompanying diseases, is widely used in clinical practice, and has proven reliable in predicting long-term mortality.17,18 In our study, the CCI values were significantly higher in patients with a fatal course, consistent with the literature. However, the ROC-AUC value for the model based solely on the CCI was 0.68, indicating that the index alone is limited in predicting short-term (in-hospital) mortality. Since the CCI primarily reflects the burden of chronic disease, the exclusion of acute hemodynamic or laboratory variables may explain this limitation.10,11
In recent years, modified or clinically enriched forms of CCI have been proposed to increase its predictive power for acute-period outcomes. In our study, higher accuracy was achieved with ML models (especially GB and RF), in which the classical index was supported by the clinical variables. This finding suggests that integrating comorbidity scores with artificial intelligence algorithms may go beyond the classical approaches.
The contribution of ML models
ML algorithms surpass traditional statistical models because of their ability to capture the nonlinear relationships and complex interactions between variables.13,19 In our study, the GB and RF models demonstrated higher discriminatory power than classic LR, with both ROC-AUC (0.87 and 0.84) and PR-AUC (0.76 and 0.71) values. The tree-based structure of these models provides the advantage of learning complex interactions between variables multidimensionally.
Furthermore, the calibration of the models was also successful (Brier score = 0.12–0.14), indicating that the predicted mortality probabilities were consistent with observed rates. Owing to the imbalanced class distribution (mortality rate, 3.1%), the SMOTE method was applied to the dataset, increasing the representation of the positive class.8,9 This technique enabled the models to achieve meaningful discriminatory power despite the low number of events in a small sample.
Although ROC-AUC values above 0.8 are statistically considered “good,” their clinical significance should be interpreted cautiously. Owing to the low event rate and single-center design, this performance may not be replicated at the same level in other studies. Nevertheless, the results indicate that ML models offer significant advantages over classical methods.
These findings were also visually reflected in the forest plot (Figure 4), where ensemble-based approaches—particularly GB—showed the most prominent shift toward higher AUC values. This supports the clinical usability of such models for early mortality-risk stratification.
Limitations of the model and future perspectives
The most significant limitation of this study was the low number of mortality events (
The model is based solely on basic clinical variables; laboratory parameters [e.g., N-terminal pro–B-type natriuretic peptide (NT-proBNP), troponin, and creatinine] and imaging findings were not included. The integration of these parameters, especially in combination with advanced ML algorithms, could provide higher accuracy in mortality prediction.20,21 Multicenter, prospective studies with larger sample sizes will enhance the generalizability and clinical applicability of the developed models.22,23
In addition, no a priori sample size or power calculation was performed because this was a retrospective analysis of consecutive eligible patients within the study period. The small number of outcome events limits statistical precision and increases the risk of optimistic model performance estimates.
Clinical practice and outcome
Importantly, improved discrimination metrics (e.g., ROC-AUC, PR-AUC, or F1 score) do not automatically translate into improved patient outcomes. Any potential clinical benefit would require prospective implementation studies demonstrating that the model changes clinician behavior, improves processes of care, and ultimately improves meaningful outcomes without unintended harms.
In this context, a clinical decision support system should be interpreted as a bedside- or ward-level risk flag integrated into the EHR to support—rather than replace—clinical judgment. For example, identification of patients at higher predicted risk could prompt closer monitoring, earlier senior clinician review, optimization of guideline-directed medical therapy, evaluation for reversible precipitants, or consideration of higher-acuity placement when clinically appropriate. Importantly, such outputs should not be interpreted as automatic ICU triage decisions but as tools to inform timely and proportionate clinical responses. These workflows would require prospective definition and validation.
Early in-hospital mortality-risk stratification refers to identifying, at or soon after hospital admission, patients with a higher short-term risk of clinical deterioration or death using routinely available clinical variables. The practical value of such stratification lies in enabling targeted allocation of attention and resources, such as monitoring intensity, diagnostic evaluation, and readiness for escalation of care. However, actionable risk thresholds and downstream clinical Sresponses must be clearly defined and prospectively evaluated to avoid alarm fatigue and unnecessary interventions.
This study suggests that classical risk scores may demonstrate improved short-term mortality discrimination when combined with machine-learning-based methods. In particular, GB and RF models showed relatively strong performance and may be suitable for integration into early warning or decision support frameworks. However, before any clinical implementation, these approaches require external validation and prospective evaluation to establish safety, effectiveness, and clinical impact.
Conclusion
In this single-center retrospective cohort with a low in-hospital mortality rate, GB and RF models showed higher discrimination than LR for predicting in-hospital mortality using routinely available clinical variables including CCI. Given the very small number of events, these findings should be interpreted as preliminary and require external validation and prospective evaluation before any clinical implementation.
Footnotes
Authors note
The authors used AI for language editing and minor grammatical corrections during the preparation of this manuscript. The AI tool was not used for the study design, data collection, statistical analysis, interpretation of results, or generation of core scientific content. After using this tool for language refinement, the authors reviewed and edited the content as needed and took full responsibility for the content of the publication. All scientific content, data analysis, and conclusions represent the independent work and judgment of the authors. The research presented in this manuscript did not involve the development or utilization of any custom software applications or specific codes. Therefore, there is no code availability associated with this study. The analysis and findings relied on standard statistical methods and commercially available software tools (SPSS version 25.0).
Ethical considerations
This study was approved by the Ethics Committee of SBÜ Van Training and Research Hospital, Non-Interventional Clinical Research Ethics Committee (Decision No: B.08.6.YÖH.0.01.00.00/2025-03-04, Date: 28 March 2025).
Consent to participate
This retrospective study was conducted in accordance with the principles of the Declaration of Helsinki. The study protocol was approved by the Ethics Committee of SBÜ Van Training and Research Hospital, which granted a waiver of informed consent due to the retrospective nature of the study and the use of de-identified patient data extracted from EHRs. All data were anonymized to protect patient confidentiality, and no direct patient contact or intervention occurred as part of this research. The Institutional Review Board/Ethics Committee approved the study protocol and waived the requirement for written informed consent due to the retrospective design and the use of de-identified routinely collected data. No direct patient contact occurred, and patient confidentiality was maintained in accordance with institutional and applicable data-protection regulations.
Consent for publication
As this was a retrospective study using de-identified patient data with an ethics committee-approved waiver of informed consent, no individual consent for Publication was required. All data were anonymized to protect patient confidentiality, and no personally identifiable information is disclosed in this publication.
Author contributions
AFK, GA, VC, and ÖK conceived and designed this study. GA, VC, and ÖK performed data collection and extraction from the electronic health records, conducted statistical analyses, and drafted the initial manuscript. AFK, GA, VC, and ÖK critically revised the manuscript for important intellectual content. GA supervised the study and provided oversight. All authors contributed to the data interpretation, reviewed and approved the final manuscript, and agreed to be accountable for all aspects of the work.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability statement
All relevant data supporting the findings of this study are available upon request and will be provided by the corresponding author.*
Code availability
The research presented in this manuscript does not involve the development or utilization of any custom software applications or specific code. Therefore, there is no code availability associated with this study. The analysis and findings rely on standard statistical methods and commercially available software tools. For any inquiries related to the methodology or data analysis, please contact the corresponding author.
