Sage Journals: Discover world-class research

Abstract

Objective

To develop an interpretable stacking ensemble model for predicting in-hospital mortality in intensive care unit (ICU) patients with CKD and sepsis and to deploy it as a web-based tool for bedside clinical use.

Methods

Data were extracted from the MIMIC-IV 3.0 database and split into training and test sets at a 7:3 ratio. Feature selection was performed by combining the least absolute shrinkage and selection operator (LASSO) regression with the Boruta algorithm. Eight machine learning (ML) models were trained and optimized via ten-fold cross-validation and grid search. The two models with the highest area under the curve (AUC) in the training set were combined using a stacking ensemble strategy. SHapley Additive exPlanations (SHAP) were applied to improve interpretability. Model performance was compared with the SOFA score.

Results

A total of 5344 ICU patients with CKD and sepsis were included, with an in-hospital mortality rate of 19.1%. After feature selection, 16 variables were retained. In the training set, XGBoost and LightGBM performed best. The stacking model achieved an AUC of 0.757 on the test set, outperforming SOFA (AUC = 0.668). SHAP analysis identified age, Acute Physiology Score III, Simplified Acute Physiology Score II, and respiratory rate as the top predictors. The model was also deployed as a publicly accessible web application.

Conclusion

The stacking ensemble model demonstrated good discriminatory performance and interpretability for predicting in-hospital mortality in ICU patients with CKD and sepsis. Its web-based deployment provides a convenient platform for early risk assessment, although external validation is needed to confirm its broader applicability.

Keywords

Chronic kidney disease sepsis intensive care unit mortality machine learning stacking ensemble model

Introduction

Chronic kidney disease (CKD) is an irreversible, progressive disorder of kidney function that has become a significant global public health challenge.^1,2 As the population ages and access to long-term renal replacement therapy (RRT) improves, the proportion of intensive care unit (ICU) patients with comorbid CKD continues to rise.^3,4 This population commonly exhibits chronic low-grade inflammation, immune dysfunction, and metabolic disturbances, rendering them highly vulnerable to acute decompensation during stressors such as infection, which can rapidly progress to sepsis and poor clinical outcomes.^5–7 Studies have shown that sepsis is the second leading cause of death among patients with CKD and end-stage renal disease, following cardiovascular disease.^4,8 Conversely, among critically ill patients with sepsis or septic shock, the prevalence of CKD reaches as high as 46%,⁹ and their 90-day mortality remains the highest among all chronic underlying conditions, even after adjustment for multiple confounders.¹⁰ Despite advances in clinical management, outcomes for patients with CKD and sepsis remain poor, underscoring the need for early and accurate risk stratification to improve prognosis.

Currently, the assessment of mortality risk in patients with CKD and sepsis primarily relies on traditional scoring systems such as the Sequential Organ Failure Assessment (SOFA) ¹¹ and the Simplified Acute Physiology Score II (SAPS II).¹² Although these tools reflect a certain degree of disease severity, they were not specifically designed for CKD. Furthermore, these scoring systems were largely developed based on earlier patient cohorts, and as clinical practice environments and patient characteristics have evolved, their calibration and discrimination performance have declined.^13,14 Beyond scoring systems, some studies have identified prognostic variables through retrospective analyses. For instance, Chen et al. ¹⁵ demonstrated an association between red blood cell transfusion and reduced 28-day mortality in patients with CKD and sepsis. However, analyses focusing on single variables fail to capture the complex, nonlinear interactions among clinical factors, limiting their predictive accuracy and clinical applicability.

In recent years, machine learning (ML) techniques have increasingly been applied in risk prediction for critically ill patients to overcome the limitations of traditional scoring methods.^13,16 Several studies have developed ML models to predict mortality risk in patients with sepsis.¹⁶ Some models have further targeted short-term prognosis in patients with sepsis complicated by specific organ injuries, such as liver injury or acute kidney injury.^17,18 However, risk prediction models tailored to patients with CKD and sepsis remain scarce, and most existing studies rely on single algorithms. Stacking, an ensemble learning strategy, enhances model robustness and generalizability by integrating multiple base models with complementary strengths and optimizing their outputs through a secondary learner. This approach is particularly suited for handling high-dimensional and complex clinical data.^19,20 Accordingly, this study aimed to develop a stacking ensemble model to predict in-hospital mortality in ICU patients with CKD and sepsis, enabling timely identification of high-risk individuals and optimizing clinical management.

Methods

Data source

This retrospective study used data from the Medical Information Mart for Intensive Care IV (MIMIC-IV, version 3.0) database.²¹ MIMIC-IV is maintained by Beth Israel Deaconess Medical Center, a large tertiary academic medical center in Boston, Massachusetts, USA, and contains detailed, de-identified clinical information for ICU admissions occurring between 2008 and 2022. All records are fully anonymized with no data that could directly or indirectly identify individual patients. The use of MIMIC-IV was approved by the institutional review boards of MIT and Beth Israel Deaconess Medical Center, with informed consent waived because the data are de-identified. One author (Jianjie Ju, ID: 13963218; Record ID: 66722132) completed the required training and was granted access to the database.

Study population and data extraction

CKD was identified through manual review of diagnostic records, with the corresponding ICD codes listed in Supplementary Table S1. Sepsis was defined according to the Sepsis-3.0 criteria, requiring a SOFA score ≥2 in the presence of confirmed or suspected infection.²² Patients were included if they met the following criteria: (1) first hospital admission and first ICU admission recorded in the MIMIC-IV database; and (2) confirmed diagnoses of both CKD and sepsis during the ICU stay. Patients were excluded if they met any of the following conditions: (1) age <18 years, or (2) an ICU length of stay <24 hours. Because the study aimed to maximize the number of eligible patients to support robust ML model development, no formal a priori sample size calculation was performed. After applying all predefined inclusion and exclusion criteria, the final analytic cohort consisted of 5344 patients with CKD and sepsis available in the MIMIC-IV database. The complete patient selection process is illustrated in Figure 1.

Figure 1.

Flowchart of data screening.

Data extraction and preprocessing were performed using Structured Query Language. Clinical variables were collected on the first ICU day, including demographics, vital signs, laboratory results, interventions, medications, and clinical scores, to comprehensively reflect the initial condition of the patients.

The primary outcome was in-hospital mortality, defined as death during the index hospitalization.

Data preprocessing and feature selection

To mitigate the impact of missing data on model development, variables with more than 20% missingness were excluded, and 55 features were retained for subsequent analysis. The remaining missing values (Supplementary Figure S1) were imputed using the missForest package to minimize bias and preserve sample size.²³ To evaluate whether the imputation process introduced distributional distortion, we compared the mean and standard deviation of key variables before and after imputation. The distributions remained highly consistent, with changes in most variables being less than 1%, indicating that missForest preserved the original data structure well and did not substantially alter variable characteristics (Supplementary Table S2). The data were then randomly split into training and test sets at a 7:3 ratio.

During feature selection, univariate analysis was first conducted on the training set to identify candidate variables associated with in-hospital mortality. To improve model parsimony, LASSO regression²⁴ and the Boruta algorithm²⁵ were subsequently applied. LASSO selects variables via L1 regularization, while Boruta assesses feature importance based on random forests to enhance model robustness. The final feature set was defined as the intersection of variables selected by both methods.

Model development and evaluation

Based on the tidymodels framework,²⁶ this study trained eight ML models: random forest (RF), decision tree, Light Gradient Boosting Machine (LightGBM), logistic regression (LR), support vector machine (SVM), extreme gradient boosting (XGBoost), K-nearest neighbors (KNN), and naive Bayes. Given that some models (e.g., LR, SVM, and KNN) are sensitive to the scale of input variables, continuous variables were standardized to ensure the stability of model training. Ten-fold cross-validation combined with grid search was used for hyperparameter optimization. The parameter set achieving the best performance was selected as the final configuration for each model. Model performance was evaluated based on receiver operating characteristic (ROC) curves and corresponding AUC values, while calibration curves and decision curve analysis (DCA) were plotted to assess clinical utility. The SOFA score was included as a baseline comparator for evaluating the ML models.

Model performance was evaluated using accuracy, negative predictive value (NPV), specificity, AUC, sensitivity, positive predictive value (PPV), and F1 score. All metrics, except AUC, were calculated based on the optimal threshold determined by the Youden index. Although this threshold may not correspond to the optimal cutoff for clinical application, it remains statistically informative.²⁷ AUC differences between models were assessed using the DeLong test.

Model stacking and deployment

Based on the AUC of each base model in the training set, the two top-performing models were identified, and a stacking ensemble was trained using the stacks package²⁸ to leverage their predictive strengths and improve overall model performance. A comprehensive performance evaluation was conducted on the test set, with calibration curves and DCA plotted to assess calibration and clinical utility.

To enhance interpretability, the final stacking ensemble model was further analyzed for feature importance. To visualize the marginal contribution of each variable to model predictions, a SHAP-based beeswarm plot was generated, while a waterfall plot was used to illustrate the model's decision-making process for an individual case. Finally, the model was deployed as a web-based tool, providing an interactive prediction interface to facilitate its application in clinical practice.

Statistical analysis

All data processing and analysis were performed using R (version 4.4.3) and SPSS (version 27.0). Categorical variables were presented as frequencies (%) and compared between groups using the chi-square test or Fisher's exact test. Continuous variables were assessed for normality using the Kolmogorov-Smirnov test. Variables with a normal distribution were expressed as mean ± standard deviation and compared using the independent samples t-test. In contrast, non-normally distributed variables were expressed as median (P₂₅, P₇₅) and compared using the Mann-Whitney U test. All statistical tests were two-sided, and a P value < .05 was considered statistically significant.

Results

Baseline characteristics

Among 9374 patients with CKD, 57.0% (5344/9374) had sepsis (Figure 1). The in-hospital mortality among ICU patients with CKD and sepsis was 19.1% (1019/5344). The cohort was randomly divided into a training set (n = 3741) and a test set (n = 1603) at a 7:3 ratio. No significant differences in baseline characteristics were observed between the two sets (Supplementary Table S3). The median age of the overall cohort was 75.10 years (interquartile range [IQR], 65.23–83.11), and 61.62% were male. Table 1 summarizes the demographic and clinical characteristics of survivors and non-survivors in the training cohort.

Table 1.

Baseline characteristics of survivors and non-survivors in the training set.

Variables	Total (n = 3741)	Survivors (n = 3038)	Non-survivors (n = 703)	P
Age, M (Q₁, Q₃)	75.30 (65.28, 83.20)	74.57 (64.67, 82.85)	77.74 (68.43, 84.90)	<.001
Weight, M (Q₁, Q₃)	80.15 (67.30, 96.00)	80.82 (68.00, 96.40)	78.00 (65.00, 93.45)	.002
Baseline_creatinine, M (Q₁, Q₃)	1.30 (1.00, 2.00)	1.30 (1.00, 1.90)	1.50 (1.00, 2.30)	<.001
Urine_output, M (Q₁, Q₃)	1209.71 (645.00, 1925.00)	1300.00 (732.73, 2013.00)	800.00 (369.00, 1480.00)	<.001
Hb_max, M (Q₁, Q₃)	9.90 (8.80, 11.00)	9.90 (8.81, 10.90)	9.90 (8.75, 11.10)	.890
PLT_max, M (Q₁, Q₃)	184.96 (131.00, 234.00)	184.80 (133.00, 233.17)	185.00 (125.00, 235.00)	.522
WBC_max, M (Q₁, Q₃)	12.20 (9.08, 16.30)	11.96 (8.90, 15.80)	13.88 (9.80, 18.27)	<.001
Anion_gap_max, M (Q₁, Q₃)	16.00 (13.00, 19.00)	15.00 (13.00, 18.00)	17.00 (14.00, 21.00)	<.001
Bicarbonate_max, M (Q₁, Q₃)	22.00 (20.00, 25.00)	22.00 (20.00, 25.00)	21.00 (19.00, 24.00)	<.001
BUN_max, M (Q₁, Q₃)	40.00 (27.00, 59.00)	38.00 (26.00, 56.00)	48.48 (33.00, 70.00)	<.001
Total_calcium_max, M (Q₁, Q₃)	8.40 (8.10, 8.80)	8.40 (8.10, 8.80)	8.40 (8.00, 8.90)	.227
Chloride_max, M (Q₁, Q₃)	105.00 (100.00, 109.00)	105.00 (101.00, 109.00)	104.00 (99.00, 108.00)	<.001
Glucose_max, M (Q₁, Q₃)	139.00 (112.00, 181.57)	137.00 (111.00, 175.88)	150.00 (115.50, 201.50)	<.001
Sodium_max, M (Q₁, Q₃)	139.00 (136.00, 142.00)	139.00 (136.00, 141.00)	139.00 (136.00, 142.00)	.679
Potassium_max, M (Q₁, Q₃)	4.50 (4.10, 4.94)	4.50 (4.10, 4.90)	4.60 (4.10, 5.10)	<.001
INR_max, M (Q₁, Q₃)	1.40 (1.20, 1.80)	1.40 (1.20, 1.70)	1.60 (1.30, 2.20)	<.001
PT_max, M (Q₁, Q₃)	15.60 (13.40, 19.40)	15.40 (13.30, 18.60)	16.90 (14.10, 23.50)	<.001
APTT_max, M (Q₁, Q₃)	36.40 (29.70, 53.30)	35.60 (29.50, 51.50)	41.40 (31.15, 63.07)	<.001
MBP, M (Q₁, Q₃)	74.25 (68.68, 81.17)	74.56 (69.08, 81.44)	72.79 (66.91, 79.76)	<.001
Respiratory_rate_mean, M (Q₁, Q₃)	19.24 (16.94, 22.21)	18.96 (16.78, 21.80)	20.67 (17.67, 23.85)	<.001
Temperature_mean, M (Q₁, Q₃)	36.77 (36.54, 37.05)	36.78 (36.56, 37.05)	36.73 (36.42, 37.04)	<.001
SpO₂_mean, M (Q₁, Q₃)	97.15 (95.70, 98.52)	97.18 (95.72, 98.54)	96.93 (95.49, 98.47)	.059
SOFA, M (Q₁, Q₃)	6.00 (4.00, 9.00)	6.00 (4.00, 9.00)	8.00 (6.00, 11.00)	<.001
APS III, M (Q₁, Q₃)	54.00 (43.00, 68.00)	52.00 (42.00, 65.00)	65.00 (53.00, 81.00)	<.001
SAPS II, M (Q₁, Q₃)	44.00 (37.00, 53.00)	43.00 (36.00, 51.00)	52.00 (42.00, 63.00)	<.001
OASIS, M (Q₁, Q₃)	35.00 (29.00, 41.00)	34.00 (28.00, 40.00)	39.00 (33.00, 45.00)	<.001
GCS_min, M (Q₁, Q₃)	15.00 (13.00, 15.00)	15.00 (14.00, 15.00)	15.00 (13.00, 15.00)	.004
SIRS, M (Q₁, Q₃)	3.00 (2.00, 3.00)	3.00 (2.00, 3.00)	3.00 (2.00, 4.00)	<.001
Admission type, n (%)				.002
No	867 (23.18)	736 (24.23)	131 (18.63)
Yes	2874 (76.82)	2302 (75.77)	572 (81.37)
CHD, n (%)				.116
No	2136 (57.10)	1716 (56.48)	420 (59.74)
Yes	1605 (42.90)	1322 (43.52)	283 (40.26)
Hypertension, n (%)				.040
No	477 (12.75)	371 (12.21)	106 (15.08)
Yes	3264 (87.25)	2667 (87.79)	597 (84.92)
COPD, n (%)				<.001
No	3393 (90.70)	2782 (91.57)	611 (86.91)
Yes	348 (9.30)	256 (8.43)	92 (13.09)
Asthma, n (%)				.097
No	3462 (92.54)	2801 (92.20)	661 (94.03)
Yes	279 (7.46)	237 (7.80)	42 (5.97)
Atrial fibrillation, n (%)				.012
No	2101 (56.16)	1736 (57.14)	365 (51.92)
Yes	1640 (43.84)	1302 (42.86)	338 (48.08)
Myocardial infarction, n (%)				<.001
No	2766 (73.94)	2282 (75.12)	484 (68.85)
Yes	975 (26.06)	756 (24.88)	219 (31.15)
Peripheral vascular disease, n (%)				.460
No	3038 (81.21)	2474 (81.44)	564 (80.23)
Yes	703 (18.79)	564 (18.56)	139 (19.77)
Rheumatic disease, n (%)				.603
No	3573 (95.51)	2899 (95.42)	674 (95.87)
Yes	168 (4.49)	139 (4.58)	29 (4.13)
Peptic ulcer disease, n (%)				.226
No	3600 (96.23)	2929 (96.41)	671 (95.45)
Yes	141 (3.77)	109 (3.59)	32 (4.55)
Paraplegia, n (%)				.083
No	3569 (95.40)	2907 (95.69)	662 (94.17)
Yes	172 (4.60)	131 (4.31)	41 (5.83)
Malignant tumor, n (%)				.003
No	3247 (86.79)	2661 (87.59)	586 (83.36)
Yes	494 (13.21)	377 (12.41)	117 (16.64)
Diabetes, n (%)				.188
No	1848 (49.40)	1485 (48.88)	363 (51.64)
Yes	1893 (50.60)	1553 (51.12)	340 (48.36)
Liver disease, n (%)				<.001
No	3218 (86.02)	2670 (87.89)	548 (77.95)
Yes	523 (13.98)	368 (12.11)	155 (22.05)
RRT, n (%)				.012
No	3204 (85.65)	2623 (86.34)	581 (82.65)
Yes	537 (14.35)	415 (13.66)	122 (17.35)
Ventilation, n (%)				<.001
No	532 (14.22)	471 (15.50)	61 (8.68)
Yes	3209 (85.78)	2567 (84.50)	642 (91.32)
Antibiotics, n (%)				<.001
No	509 (13.61)	381 (12.54)	128 (18.21)
Yes	3232 (86.39)	2657 (87.46)	575 (81.79)
Diuretics, n (%)				.311
No	2266 (60.57)	1852 (60.96)	414 (58.89)
Yes	1475 (39.43)	1186 (39.04)	289 (41.11)
Hydrocortisone or dexamethasone, n (%)				<.001
No	3414 (91.26)	2822 (92.89)	592 (84.21)
Yes	327 (8.74)	216 (7.11)	111 (15.79)
NSAIDs, n (%)				.004
No	2322 (62.07)	1852 (60.96)	470 (66.86)
Yes	1419 (37.93)	1186 (39.04)	233 (33.14)
Antifungals, n (%)				.058
No	3637 (97.22)	2961 (97.47)	676 (96.16)
Yes	104 (2.78)	77 (2.53)	27 (3.84)
Vasopressin, n (%)				<.001
No	3470 (92.76)	2853 (93.91)	617 (87.77)
Yes	271 (7.24)	185 (6.09)	86 (12.23)
Dopamine, n (%)				<.001
No	3596 (96.12)	2937 (96.68)	659 (93.74)
Yes	145 (3.88)	101 (3.32)	44 (6.26)
Norepinephrine, n (%)				<.001
No	2715 (72.57)	2291 (75.41)	424 (60.31)
Yes	1026 (27.43)	747 (24.59)	279 (39.69)
Epinephrine, n (%)				.482
No	3497 (93.48)	2844 (93.61)	653 (92.89)
Yes	244 (6.52)	194 (6.39)	50 (7.11)
ACEI or ARB, n (%)				.001
No	3478 (92.97)	2805 (92.33)	673 (95.73)
Yes	263 (7.03)	233 (7.67)	30 (4.27)

Note: M, median; Q ₁ , first quartile; Q ₃ , third quartile; Hb, hemoglobin; PLT, platelet; WBC, white blood cell; BUN, blood urea nitrogen; INR, international normalized ratio; PT, prothrombin time; APTT, activated partial thromboplastin time; MBP, mean blood pressure; SpO₂, peripheral capillary oxygen saturation; SOFA, Sequential Organ Failure Assessment; APS III, Acute Physiology Score III; SAPS II, Simplified Acute Physiology Score II; OASIS, Oxford Acute Severity of Illness Score; GCS, Glasgow Coma Scale; SIRS, Systemic Inflammatory Response Syndrome; CHD, coronary heart disease; COPD, chronic obstructive pulmonary disease; RRT, renal replacement therapy; NSAIDs, non-steroidal anti-inflammatory drugs; ACEI, angiotensin-converting enzyme inhibitor; ARB, angiotensin II receptor blocker.

Feature selection

Based on 39 significant variables identified through univariate analysis, LASSO regression and the Boruta algorithm were applied to further reduce dimensionality and enhance model performance. LASSO, using L1 regularization, achieved variable selection at λ.1se, retaining 18 variables (Figure 2(a)–(b)). Boruta, based on a random forest framework, identified 26 features with stable predictive contributions (Figure 2(c)). Finally, 16 variables identified by the intersection of both methods were selected as model inputs, including age, urine output, anion gap (AG), BUN, SOFA score, chloride, Acute Physiology Score III (APS III), international normalized ratio (INR), Oxford Acute Severity of Illness Score (OASIS), activated partial thromboplastin time (APTT), respiratory rate, SAPS II, Systemic Inflammatory Response Syndrome (SIRS) score, liver disease, antibiotic use, and corticosteroid use (hydrocortisone or dexamethasone) (Figure 2(d)).

Figure 2.

Feature selection workflow integrating LASSO and Boruta algorithms. (a) Ten-fold cross-validation in the LASSO model identified the optimal penalty parameter at λ.1se, resulting in the retention of 18 variables. (b) The coefficient path plot illustrates the convergence of variable coefficients with increasing regularization strength. (c) The Boruta algorithm confirmed 26 variables with stable predictive importance. (d) A total of 16 variables were retained after taking the intersection of the LASSO and Boruta selections, representing the final predictor set used in the model.

Model development and evaluation

Eight ML models were developed to predict in-hospital mortality. In the training set, LightGBM and XGBoost demonstrated the highest discrimination, with AUCs of 0.864 (95% CI, 0.850–0.878) and 0.841 (95% CI, 0.825–0.856), respectively, both clearly superior to the SOFA score (AUC, 0.654; 95% CI, 0.632–0.677) (Figure 3(a)). LightGBM also achieved the best overall balance of performance metrics, with an accuracy of 74.0%, sensitivity of 71.7%, specificity of 84.4%, and an F1 score of 0.818, followed by XGBoost (accuracy, 73.1%; F1 score, 0.811). Detailed performance metrics for all base models in the training set are reported in Supplementary Table S4. In the test set, the stacking model constructed from LightGBM and XGBoost maintained superior discrimination compared with the SOFA score, with an AUC of 0.757 (95% CI, 0.729–0.786) versus 0.668 (95% CI, 0.633–0.702) for SOFA (Figure 3(b)), and it also showed higher accuracy, sensitivity, specificity, and F1 score (Supplementary Table S5). DeLong tests indicated that, in the training set, LightGBM and XGBoost had significantly higher AUCs than most other base models, including SOFA (Supplementary Figure S2), and in the test set, the stacking model achieved a significantly higher AUC than SOFA (P < .001).

Figure 3.

Performance of the prediction models in the training and test sets. This figure illustrates the discrimination, calibration, and clinical utility of the machine learning models and the SOFA score. (a) ROC curves in the training set comparing the eight ML models with the SOFA score. (b) ROC curves in the test set comparing the stacking ensemble model, constructed from LightGBM and XGBoost, with the SOFA score. (c) Calibration plots in the training set for the eight ML models and the SOFA score. (d) Calibration plots in the test set for the stacking ensemble model and the SOFA score. (e) DCA in the training set for the eight machine learning models and the SOFA score. (f) DCA in the test set for the stacking ensemble model and the SOFA score.

Calibration plots showed that, in the training set, all models except RF and naive Bayes had good agreement between predicted and observed risks (Figure 3(c)). In the test set, the stacking model remained well calibrated, whereas SOFA displayed noticeable deviation from the ideal reference line (Figure 3(d)). DCA demonstrated that, in the training set, LightGBM and XGBoost provided greater net clinical benefit across a wide range of threshold probabilities compared with SOFA (Figure 3(e)). In the test set, the stacking model yielded higher net benefit than SOFA over thresholds from 0.10 to 0.50, and within the overlapping region, it provided an incremental net benefit of approximately 0.05, corresponding to about five additional correct clinical decisions per 100 patients (Figure 3(f)).

To assess the robustness of the model across different sepsis severity levels, we performed an interaction analysis and subgroup evaluations. The interaction between predicted mortality probability and septic shock status was not statistically significant (P =0 .166). In subgroup analyses, the model yielded an AUC of 0.730 (95% CI, 0.690–0.769) in patients with non-shock sepsis and 0.705 (95% CI, 0.653–0.758) in those with septic shock, suggesting that the model's predictive performance was stable across clinically relevant severity subgroups.

Model explainability and web deployment

Age, APS III score, respiratory rate, and SAPS II score were consistently ranked among the top predictors of in-hospital mortality across the base models (Supplementary Figure S3). The SHAP beeswarm plot based on the stacking model further confirmed these key variables, showing that most features were positively associated with mortality risk, except for urine output, antibiotic use, and chloride (Figure 4(a) and (b)). The patient-level SHAP visualization (Figure 4(c)) demonstrated how the model integrates individual clinical variables to produce risk estimates, thereby enhancing interpretability and supporting clinical decision-making. Finally, the stacking model was deployed as an interactive, web-based tool (Figure 4(d)), allowing clinicians to input key variables at the bedside and to obtain real-time mortality risk predictions to aid in the early identification and risk stratification of ICU patients. The tool is publicly available online (https://mdyy1.shinyapps.io/linshuo/), and a user guide has been provided separately as supplementary material.

Figure 4.

Interpretability and clinical deployment of the stacking model. (a) Feature importance ranking of the stacking ensemble model, showing that age, APS III score, SAPS II score, and respiratory rate were the four most influential predictors of in-hospital mortality. (b) SHAP beeswarm plot summarizing the direction and magnitude of each feature's contribution across all patients. (c) SHAP waterfall plot for a randomly selected patient, demonstrating how the model integrates individual-level features to generate a mortality risk prediction. (d) Interface of the web-based application designed for bedside use, allowing clinicians to input patient characteristics and obtain real-time mortality risk estimates.

Discussion

In this study, we developed a stacking ensemble model to predict in-hospital mortality in patients with CKD and sepsis, using a large-scale, real-world critical care database. Compared with the conventional SOFA score, the model demonstrated superior calibration and greater clinical utility. By integrating two advanced algorithms and applying feature selection through LASSO regression and the Boruta algorithm, the model effectively captured the complex clinical characteristics of this high-risk population, substantially improving predictive accuracy and robustness.

To date, ML techniques still face “black-box” challenges, which make their implementation in clinical practice questionable.²⁹ To mitigate this issue and better align with clinical needs, we introduced SHAP into the model, generating summary plots to illustrate overall feature importance and waterfall plots to explain individual prediction outcomes. It is important to note that these SHAP-derived associations reflect predictive relationships rather than causal effects; they indicate how individual features influence the model's risk estimates but should not be interpreted as evidence of causation. Furthermore, recognizing the fast-paced clinical environment and the complexity of healthcare information systems, we developed a simple web-based prototype that allows clinicians to input key variables and generate individualized mortality risk estimates, improving accessibility and supporting preliminary clinical evaluation.

Although established scoring systems such as SOFA, SAPS II, and APS III were included as predictors in our model, the stacking ensemble provides several advantages beyond the capabilities of these traditional tools. First, in our cohort, SOFA showed notably lower discrimination, indicating that traditional scores did not fully capture mortality risk in patients with CKD and sepsis. Second, traditional scores rely on linear assumptions and fixed coefficients, restricting their ability to capture nonlinear effects and higher-order interactions among clinical variables. In contrast, the stacking framework integrates complementary strengths of multiple ML algorithms, enabling the model to learn complex relationships that extend beyond what any individual score can represent. Finally, the improvement in AUC was accompanied not only by better calibration but also by a consistently higher net clinical benefit across a broad range of risk thresholds in the DCA. As articulated by Vickers et al.,³⁰ these thresholds represent the predicted mortality probabilities at which clinicians judge the expected benefit of escalating management to outweigh its potential harms, reflecting the inherent trade-off between missing high-risk patients and performing unnecessary interventions. While these results indicate that the stacking model offers a measurable performance advantage over conventional severity scores, the clinical interpretation of the DCA should be approached with caution. DCA is best viewed as a comparative assessment of potential net benefit rather than evidence that the model would directly guide or modify bedside clinical decision-making. It is also important to note that although several recent ML models developed for sepsis-associated organ injuries, including sepsis-associated liver injury¹⁷ and sepsis-associated acute kidney injury,¹⁸ have reported AUC values approaching 0.80, the substantial differences in study populations and clinical phenotypes make direct comparison with our CKD–sepsis cohort inappropriate, even though all studies used mortality as the endpoint.

In this study, age, APS III score, SAPS II score, and respiratory rate emerged as the four strongest predictors of in-hospital mortality, with all showing a positive association with adverse outcomes. Age was the most influential factor, consistent with the well-described decline in physiologic reserve and immune competence in older adults.^31–33 Sepsis further amplifies these age-related vulnerabilities, contributing to substantially higher mortality among elderly patients, with those over 80 years experiencing nearly twice the risk of death compared with individuals younger than 50.³⁴ The APS III and SAPS II scores, which carried substantial weight in the model, provide a comprehensive assessment of acute physiologic disturbances, pre-existing health status, and multi-organ function, thereby capturing systemic imbalance in this high-risk population during infectious stress. The SIRS score, which has been deprecated under Sepsis 3.0 due to poor specificity and limited ability to identify organ dysfunction, similarly demonstrated minimal prognostic value in our cohort.^22,35

Urine output is a well-established indicator of perfusion status, and values below 0.5 mL/kg/h are strongly associated with increased ICU mortality.³⁶ Early urine output has also been recognized as an important predictor in sepsis mortality models.³⁷ BUN, reflecting glomerular filtration function, is similarly incorporated into several risk scoring systems and is strongly associated with adverse outcomes, with thresholds above 21 mg/dL shown to markedly increase sepsis mortality risk.^12,37,38 Liver dysfunction is also common in sepsis, with an incidence of 34%–46%, and mortality rates can reach up to 68% in cases of progressive hepatic failure.³⁹ Coagulation abnormalities are common in sepsis.⁴⁰ Elevation of the INR reflects impaired coagulation and reduced hepatic synthetic function, both of which indicate more severe physiologic derangement in sepsis and are associated with higher mortality risk.^40–42 The AG has also been validated as a marker associated with short-term mortality in sepsis,^43,44 and in our study, low chloride at ICU admission similarly emerged as an early indicator of in-hospital mortality in patients with CKD and sepsis.

The survival benefit of early antibiotic administration in sepsis is well established, with delays linked to substantially increased mortality.^45–47 In our cohort, early antibiotic use remained a strong protective factor in patients with CKD and sepsis, consistent with the critical need for rapid pathogen control in individuals with impaired renal and immune function. The use of glucocorticoids (hydrocortisone or dexamethasone), in contrast, emerged as an independent risk factor for mortality, likely reflecting greater illness severity. It should be emphasized that this study focused on predictive modeling of early in-hospital mortality risk and does not directly negate the potential benefits of glucocorticoid therapy. Rather, it underscores the need to carefully balance the risks of resistance and adverse effects against the evidence of infection, individual patient condition, and treatment response, to guide more precise pharmacologic interventions in clinical practice.

Limitations

However, this study has several limitations. First, because it was conducted using a pre-existing database, the final sample size was determined by the number of eligible cases rather than by an a priori sample size calculation, which may reduce the statistical power. Second, although we addressed missing data using the missForest algorithm and confirmed that post-imputation distributions of key variables closely mirrored those of the original dataset, the potential for residual bias remains, given that no imputation approach is entirely free of bias. Third, the model was developed using data from the MIMIC-IV database, which represents a single-center critical care population in the United States. Although MIMIC-IV includes patients of diverse racial and ethnic backgrounds, its geographic and institutional homogeneity may limit the model's generalizability, particularly in settings with different patient demographics, clinical practices, or ICU care standards. Furthermore, because healthcare systems, sepsis management protocols, and resource availability vary substantially across countries and regions, the model's performance outside the United States cannot be assumed. External validation using multi-center and international cohorts is therefore essential to establish broader applicability. Fourth, the model was constructed using only clinical variables collected within the first 24 hours of ICU admission, providing a static representation that does not capture the dynamic trajectory of sepsis. Given the importance of temporal patterns in sepsis prognostication, the absence of longitudinal data may limit predictive accuracy and reduce clinical utility. Finally, although the model has been deployed as a web-based tool, it has not yet undergone usability testing, electronic health record integration assessment, or prospective evaluation, all of which are necessary before clinical implementation.

Conclusion

We developed a stacking-based ML model that predicts in-hospital mortality risk in ICU patients with CKD and sepsis with good performance. A web-based prototype tool was created to facilitate bedside risk assessment. However, the model's clinical applicability and generalizability require further evaluation in external cohorts.

Supplemental Material

sj-docx-1-dhj-10.1177_20552076261415938 - Supplemental material for Development and deployment of an interpretable stacking ensemble model for predicting in-hospital mortality in ICU patients with chronic kidney disease and sepsis

Supplemental material, sj-docx-1-dhj-10.1177_20552076261415938 for Development and deployment of an interpretable stacking ensemble model for predicting in-hospital mortality in ICU patients with chronic kidney disease and sepsis by Jianjie Ju, Shuo Lin, Jingjing Chen and Zhouhua Wang in DIGITAL HEALTH

Supplemental Material

sj-pdf-2-dhj-10.1177_20552076261415938 - Supplemental material for Development and deployment of an interpretable stacking ensemble model for predicting in-hospital mortality in ICU patients with chronic kidney disease and sepsis

Supplemental material, sj-pdf-2-dhj-10.1177_20552076261415938 for Development and deployment of an interpretable stacking ensemble model for predicting in-hospital mortality in ICU patients with chronic kidney disease and sepsis by Jianjie Ju, Shuo Lin, Jingjing Chen and Zhouhua Wang in DIGITAL HEALTH

Supplemental Material

sj-pdf-3-dhj-10.1177_20552076261415938 - Supplemental material for Development and deployment of an interpretable stacking ensemble model for predicting in-hospital mortality in ICU patients with chronic kidney disease and sepsis

Supplemental material, sj-pdf-3-dhj-10.1177_20552076261415938 for Development and deployment of an interpretable stacking ensemble model for predicting in-hospital mortality in ICU patients with chronic kidney disease and sepsis by Jianjie Ju, Shuo Lin, Jingjing Chen and Zhouhua Wang in DIGITAL HEALTH

Footnotes

Abbreviation

Acknowledgements

The authors acknowledge all participants in the MIMC-IV research team for survey design and data collection.

ORCID iDs

Jianjie Ju

Shuo Lin

Jingjing Chen

Zhouhua Wang

Ethics statement

The MIMIC-IV database is a publicly available dataset containing de-identified patient information; therefore, no additional ethical approval was required for its use. Data extraction for this study was approved by the Institutional Review Board of the Massachusetts Institute of Technology, with access granted to author Jianjie Ju (ID: 13963218，).

Contributorship

Jianjie Ju contributed to the study's conceptualization, methodology, data curation, original drafting, and visualization. Shuo Lin contributed to data curation, visualization, and drafting of the manuscript. Jingjing Chen provided data curation and resource support. Zhouhua Wang supervised the study, led the conceptualization and methodology design, secured funding, and contributed substantially to original drafting, review, editing, and data curation.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was supported by the Startup Fund for Scientific Research, Fujian Medical University (Grant No. 2019QH1217), and the Fujian Provincial Health Commission Scientific Research Plan Project (Grant No. 2021QNA076).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability

All data will be available from the corresponding author upon reasonable request.

Supplemental material

Supplemental material for this article is available online.

Use of artificial intelligence tools

Artificial intelligence tools (Grammarly and ChatGPT) were used only for grammar and language refinement. All AI-assisted content was reviewed by the authors, and no AI tools were used for data analysis, modeling, or interpretation.

References

Yan

Chao

Lin

. Chronic kidney disease: strategies to retard progression. Int J Mol Sci 2021; 22: 10084.

Bikbov

Purcell

Levey

, et al. Global, regional, and national burden of chronic kidney disease, 1990–2017: a systematic analysis for the global burden of disease study 2017. Lancet 2020; 395: 709–733.

Rimes-Stigare

Frumento

Bottai

, et al. Long-term mortality and risk factors for development of end-stage renal disease in critically ill patients with and without chronic kidney disease. Crit Care 2015; 19: 83.

De Rosa

Samoni

Villa

, et al. Management of chronic kidney disease patients in the intensive care unit: mixing acute and chronic illness. Blood Purif 2017; 43: 151–162.

Hassan

Duarte

Dix-Peek

, et al. Correlation between volume overload, chronic inflammation, and left ventricular dysfunction in chronic kidney disease patients. Clin Nephrol 2016; 86: 131–135.

Jiao

. The interplay between immune and metabolic pathways in kidney disease. Cells 2023; 12: 1584.

Raphael

. Metabolic acidosis in CKD: core curriculum 2019. Am J Kidney Dis 2019; 74: 263–275.

Sarnak

Jaber

. Mortality caused by sepsis in patients with end-stage renal disease compared with the general population. Kidney Int 2000; 58: 1758–1764.

Neyra

Mescia

, et al. Impact of acute kidney injury and CKD on adverse outcomes in critically ill septic patients. Kidney Int Rep 2018; 3: 1344–1353.

10.

Mansur

Mulwande

Steinau

, et al. Chronic kidney disease is associated with a higher 90-day mortality than other chronic medical conditions in patients with sepsis. Sci Rep 2015; 5: 10539.

11.

Karakike

Kyriazopoulou

Tsangaris

, et al. The early change of SOFA score as a prognostic marker of 28-day sepsis mortality: analysis through a derivation and a validation cohort. Crit Care 2019; 23: 87.

12.

Le Gall

Lemeshow

Saulnier

. A new simplified acute physiology score (SAPS II) based on a European/north American multicenter study. Jama 1993; 270: 2957–2963.

13.

Kong

Lin

. Using machine learning methods to predict in-hospital mortality of sepsis patients in the ICU. BMC Med Inform Decis Mak 2020; 20: 51.

14.

Khwannimit

Bhurayanontachai

Vattanavanit

. Validation of the sepsis severity score compared with updated severity scores in predicting hospital mortality in sepsis patients. Shock 2017; 47: 720–725.

15.

Chen

, et al. Association between red blood cells transfusion and 28-day mortality rate in septic patients with concomitant chronic kidney disease. Sci Rep 2024; 14: 23769.

16.

Yang

Cui

Song

. Predicting sepsis onset in ICU using machine learning models: a systematic review and meta-analysis. BMC Infect Dis 2023; 23: 35.

17.

Wen

Zhang

, et al. An interpretable machine learning model for predicting 28-day mortality in patients with sepsis-associated liver injury. PLoS One 2024; 19: e0303469.

18.

Gao

Nong

Luo

, et al. Machine learning-based prediction of in-hospital mortality for critically ill patients with sepsis-associated acute kidney injury. Ren Fail 2024; 46: 2316267.

19.

Liu

Zhang

Song

, et al. An improved stacking model for predicting myocardial infarction risk in imbalanced data. Health Inf Sci Syst 2025; 13: 16.

20.

Chiu

Chien

, et al. Applying an improved stacking ensemble model to predict the mortality of ICU patients with heart failure. J Clin Med 2022; 11: 6460.

21.

Johnson

AEW

Bulgarelli

Shen

, et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data 2023; 10: 1–20230103.

22.

Singer

Deutschman

Seymour

, et al. The third international consensus definitions for sepsis and septic shock (sepsis-3). Jama 2016; 315: 801–810.

23.

Stekhoven

Bühlmann

. Missforest–non-parametric missing value imputation for mixed-type data. Bioinformatics 2012; 28: 112–118.

24.

Tibshirani

. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology 1996; 58: 267–288.

25.

Kursa

Rudnicki

. Feature selection with the boruta package. J Stat Softw 2010; 36: 1–13.

26.

Kuhn

Silge

. Tidy modeling with R: a framework for modeling in the tidyverse. Sebastopol, CA: O'Reilly Media, Inc., 2022.

27.

Lin

Wang

, et al. Predictive model of acute kidney injury in critically ill patients with acute pancreatitis: a machine learning approach using the MIMIC-IV database. Renal Fail 2024; 46: 2303395.

28.

Couch

Kuhn

. Stacks: stacked ensemble modeling with tidy data principles. Journal of Open Source Software 2022; 7: 4471.

29.

Azodi

Tang

Shiu

. Opening the black box: interpretable machine learning for geneticists. Trends Genet 2020; 36: 442–455.

30.

Vickers

Elkin

. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making 2006; 26: 565–574.

31.

Kingren

Starr

Saito

. Divergent sepsis pathophysiology in older adults. Antioxid Redox Signal 2021; 35: 1358–1375.

32.

Montenont

Rondina

Campbell

. Altered functions of platelets during aging. Curr Opin Hematol 2019; 26: 336–342.

33.

Inoue

Suzuki

Komori

, et al. Persistent inflammation and T cell exhaustion in severe sepsis in the elderly. Crit Care 2014; 18: R130.

34.

Kotfis

Wittebole

Jaschinski

, et al. A worldwide perspective of sepsis epidemiology and survival according to age: observational data from the ICON audit. J Crit Care 2019; 51: 122–132.

35.

Cortés-Puch

Hartog

. Opening the debate on the new sepsis definition change is not necessarily progress: revision of the sepsis definition should be based on new scientific insights. Am J Respir Crit Care Med 2016; 194: 16–18.

36.

Heffernan

Judge

Petrie

, et al. Association between urine output and mortality in critically ill patients: a machine learning approach. Crit Care Med 2022; 50: e263–e271.

37.

Qiu

. Development and validation of an interpretable machine learning for mortality prediction in patients with sepsis. Front Artif Intell 2024; 7: 1348907.

38.

Weng

Hou

Zhou

, et al. Development and validation of a score to predict mortality in ICU patients with sepsis: a multicenter retrospective study. J Transl Med 2021; 19: 22.

39.

Yan

. The role of the liver in sepsis. Int Rev Immunol 2014; 33: 498–510.

40.

Winer

Salyer

Beckmann

, et al. Enigmatic role of coagulopathy among sepsis survivors: a review of coagulation abnormalities and their possible link to chronic critical illness. Trauma Surg Acute Care Open 2020; 5: e000462–20201016.

41.

Lyons

Micek

Hampton

, et al. Sepsis-Associated coagulopathy severity predicts hospital mortality. Crit Care Med 2018; 46: 736–742.

42.

Giustozzi

Ehrlinder

Bongiovanni

, et al. Coagulopathy and sepsis: pathophysiology, clinical manifestations and treatment. Blood Rev 2021; 50: 100864.

43.

Lou

Zeng

Huang

, et al. Association between the anion-gap and 28-day mortality in critically ill adult patients with sepsis: a retrospective cohort study. Medicine (Baltimore) 2024; 103: e39029.

44.

Jiang

Wang

, et al. Predictive value of the serum anion gap for 28-day in-hospital all-cause mortality in sepsis patients with acute kidney injury: a retrospective analysis of the MIMIC-IV database. Ann Transl Med 2022; 10: 1373.

45.

Isaranuwatchai

Buppanharun

Thongbun

, et al. Early antibiotics administration reduces mortality in sepsis patients in tertiary care hospital. BMC Infect Dis 2025; 25: 36.

46.

Evans

Rhodes

Alhazzani

, et al. Surviving sepsis campaign: international guidelines for management of sepsis and septic shock 2021. Intensive Care Med 2021; 47: 1181–1247.

47.

Liu

Fielding-Singh

Greene

, et al. The timing of early antibiotics and hospital mortality in sepsis. Am J Respir Crit Care Med 2017; 196: 856–863.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

8.70 MB

0.51 MB

0.40 MB