Multicenter validation of an explainable machine learning model for early prediction of acute kidney injury in critically ill patients with digestive system tumors

Abstract

Objective

Critically ill patients with digestive system tumors are at high risk of acute kidney injury (AKI), a complication strongly associated with increased mortality and adverse outcomes. However, early AKI risk identification in the intensive care unit (ICU) remains challenging. This study aimed to develop and externally validate an interpretable model for early AKI risk prediction in critically ill patients with digestive system tumors.

Methods

We retrospectively analyzed 3,821 patients with digestive system tumors from the MIMIC-IV 3.0 database. Routine clinical variables from the first 24 hours after ICU admission were extracted. Least absolute shrinkage and selection operator (LASSO) regression and multivariable logistic regression were used for feature selection, followed by the development and comparison of six machine learning models. Model performance was evaluated using the AUC, calibration analysis, and decision curve analysis. The optimal model was interpreted using Shapley Additive Explanations (SHAP), and an online interactive prediction tool was developed to facilitate clinical translation. External validation was conducted in three independent cohorts from the United States and China, including the eICU Collaborative Research Database (eICU), the Tangshan tumor-related AKI cohort (TS-TAKI), and the Beijing Acute Kidney Injury cohort (BAKIT).

Result

The incidence of AKI was 75.8%. Among all models, extreme gradient boosting (XGBoost) showed the best overall performance, with an AUC of 0.765 (95% CI 0.745–0.786) in the training set and 0.742 (95% CI 0.710–0.773) in the validation set, low Brier score, good calibration, and meaningful clinical net benefit. SHAP analysis identified the SOFA score, mechanical ventilation, WBC count, age, serum potassium level, sepsis, and vasoactive drug use as key predictors, with consistent interpretability at both the population and individual levels. In external validation, the model demonstrated stable discrimination across multicenter populations (AUCs of 0.719 in eICU, 0.769 in TS-TAKI, and 0.616 in BAKIT); however, calibration performance was notably affected by population heterogeneity, suggesting the need for recalibration using local data in real-world application.

Conclusion

This interpretable XGBoost-based model, based on routine ICU data, enables early AKI risk stratification in critically ill patients with digestive system tumors. The model achieves a balance between predictive performance, transparency, and clinical feasibility. It exhibited stable discriminative performance across multicenter and cross-population cohorts, while calibration was clearly affected by population heterogeneity, highlighting the need for local validation and recalibration in real-world application. An online prediction tool further supports its potential for clinical translation.

Keywords

digestive system tumors acute kidney injury prediction model explainable artificial intelligence SHAP analysis multicenter validation

Background

Tumors of the digestive system rank among the top in terms of incidence and mortality rates of malignant tumors worldwide, having become a major public health problem.¹ Critically ill patients with these malignancies are particularly vulnerable to acute kidney injury (AKI) due to the convergence of tumor-related physiological stress, treatment-associated nephrotoxicity, systemic inflammation, and hemodynamic instability.^2–4 Once AKI develops, it is associated with markedly increased in-hospital mortality, interruption or modification of anticancer therapy, and worse long-term outcomes.⁵ Notably, a nationwide survey in China identified digestive system malignancies as the most common primary cancers among patients with malignancy-associated AKI, accounting for more than half of reported cases, underscoring the clinical importance of early risk stratification in this high-risk population.⁶

Current AKI diagnosis and risk assessment rely predominantly on the Kidney Disease: Improving Global Outcomes (KDIGO) criteria, which are based on elevations in serum creatinine (SCr) or reductions in urine output.⁷ However, these criteria have inherent limitations in early risk prediction. Biomarker changes often lag behind the onset of kidney injury, and in patients with cancer, SCr and urine output may be influenced by baseline renal function, fluid status, and oncologic treatments, thereby obscuring early renal injury.

In recent years, machine learning (ML) approaches have shown considerable promise for the early prediction of AKI.^8,9 However, most existing models have been developed in mixed ICU populations, with limited focus on specific subgroups. As a result, these models may not adequately capture the unique risk profiles and pathophysiological characteristics of critically ill patients with digestive system tumors.

Moreover, many ML models remain “black boxes,” offering limited interpretability of how predictors contribute to risk estimation. This lack of transparency hampers clinical trust and adoption.¹⁰ Improving model interpretability and enabling real-time application in clinical settings therefore remain critical challenges. In addition, few studies have conducted systematic external validation across multicenter and cross-population cohorts, leaving the generalizability of existing models uncertain11 and 12. Robust external validation is essential to ensure model stability and reliability in real-world practice.

To address these limitations, this study incorporates the Shapley Additive Explanations (SHAP) framework to enhance model interpretability. An online interactive tool was further developed to enable real-time AKI risk estimation based on routinely available clinical variables, thereby supporting early risk identification and clinical decision-making. External validation was performed across multiple independent cohorts, aiming to develop a population-specific, interpretable, and clinically applicable prediction model for AKI risk stratification in critically ill patients with digestive system tumors.

Methods

Data source and ethics statement

This retrospective study was conducted in accordance with the Declaration of Helsinki and was reported following the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) statement. The model development cohort was derived from the Medical Information Mart for Intensive Care IV (MIMIC-IV 3.0) database, which contains detailed clinical data of critically ill patients in the United States.¹³ All data in MIMIC-IV are fully de-identified. Access to the database was granted after completion of the required ethics training and certification (certification ID: 59209829).

External validation was performed using three independent cohorts: the multicenter eICU Collaborative Research Database (eICU-CRD), a single-center Chinese Tangshan Cohort Database for Tumor-Associated Acute Kidney Injury (TS-TAKI), and the multicenter Beijing AKI Clinical Trial (BAKIT) database. The eICU-CRD is a publicly available database derived from multiple intensive care units across the United States and includes detailed, high-resolution clinical data from more than 200,000 ICU admissions. The TS-TAKI database is a single-center database derived from a cohort study on tumor-associated AKI that was registered at the Chinese Clinical Trial Registry (ChiCTR2500103958, registered on 9 June 2025) and conducted at Tangshan People’s Hospital. The BAKIT database is a multicenter prospective cohort focusing on the epidemiology of AKI among ICU patients in Beijing, China, and includes comprehensive clinical data collected from 30 ICUs across 28 tertiary hospitals. Data collection and usage strictly complied with ethical standards. The eICU-CRD consists of fully de-identified data and is exempt from informed consent. The TS-TAKI cohort was approved by the institutional ethics committee of Tangshan People’s Hospital (approval No. RMYY-LLKS-2024039). Use of the BAKIT database was approved by the Ethics Committee of Beijing Fuxing Hospital, Capital Medical University (approval No. 2010FXHEC-KF026), with a waiver of informed consent.

Inclusion and exclusion criteria

Patients were screened from the MIMIC-IV database to construct the study cohort. The inclusion criteria were as follows: (1) age ≥ 18 years and ≤ 89 years; (2) patients diagnosed with digestive system tumors, identified according to ICD-9/10 codes; and (3) complete records of renal function monitoring and key clinical variables after ICU admission. The exclusion criteria were: (1) repeated ICU admissions (only the first ICU admission record was retained); (2) ICU length of stay < 24 hours; (3) patients already diagnosed with AKI before or at the time of ICU admission; (4) received renal replacement therapy (RRT) prior to or at ICU admission and (5) missing values in more than 20% of secondary clinical variables. The external validation cohort applied the same inclusion and exclusion criteria as those used in the model development phase. The specific study workflow is shown in Figure 1.

Figure 1.

Flow chart of participant selection and exclusion.

Data collection

Digestive system tumors were identified according to International Classification of Diseases, Ninth and Tenth Revision (ICD-9/10) codes, including neoplastics of the esophagus, stomach, colorectum, liver, biliary tract, pancreas, and other digestive organs. Tumors were not further stratified by pathological behavior; benign, malignant, and unspecified neoplasms were all included to reflect real-world clinical practice in the intensive care setting. The prediction starting point was defined as ICU admission. The predictor variables were extracted from the first recorded measurements within the first 24 hours of ICU admission to reflect the patient’s early physiological status. The primary outcome was the new onset of AKI within 72 hours of ICU admission. To ensure temporal consistency in predictions, patients who met the KDIGO diagnostic criteria for AKI prior to or at the time of ICU admission were excluded from the analysis.

AKI diagnosis was based on the KDIGO clinical practice guidelines: a SCr increase of ≥26.5 μmol/L within 48 hours, or an increase to ≥1.5 times the baseline value, or urine output <0.5 mL·kg^-1·h^-1 for ≥6 hours. The baseline SCr was defined as the most recent measurement available within 7 days prior to ICU admission; if no record was available within this timeframe, the lowest SCr value within the first 24 hours after admission was used as the substitute baseline. Patients without any available SCr measurement within either timeframe were excluded because AKI status could not be determined. The hourly urine output records were extracted from the database; when hourly data were incomplete, a 6-hour rolling window was applied to aggregate urine volume, which was then divided by the corresponding duration and patient body weight to derive the hourly rate. Body weight was obtained from admission records, and if unavailable, was imputed using the median weight for the corresponding sex and age group within the cohort.

A total of 36 routinely available clinical variables were collected, encompassing demographic characteristics, admission-related variables, comorbidities, treatments and interventions, vital signs, and laboratory measurements (Table 1). Data extraction was performed using PostgreSQL (version 12.0).

Table 1.

Baseline characteristics of patients with digestive system tumors.

Variable	Training set (n=2,674)	Validation set (n=1,147)	Z/χ² value	P value
Outcome and demographics
Acute kidney injury, n (%)	2028 (75.8)	870 (75.9)	< 0.001	0.995
Male, n (%)	1615 (60.4)	656 (57.2)	3.417	0.065
Age, years	69.0 (61.0, 78.0)	70.0 (61.0, 78.0)	-0.193^#	0.847
Admission characteristics
Unplanned emergency admission, n (%)	1767 (66.1)	783 (68.3)	1.725	0.189
Sepsis, n (%)	1524 (57)	629 (54.8)	1.515	0.218
SOFA score	5.0 (2.0, 7.0)	4.0 (2.0, 7.0)	0.180^#	0.857
Comorbidities
Heart failure, n (%)	585 (21.9)	258 (22.5)	0.177	0.674
Coronary artery disease, n (%)	608 (22.7)	246 (21.4)	0.770	0.380
Pneumonia, n (%)	720 (26.9)	303 (26.4)	0.106	0.745
Chronic obstructive pulmonary disease, n (%)	232 (8.7)	96 (8.4)	0.096	0.757
Chronic liver disease, n (%)	355 (13.3)	130 (11.3)	2.732	0.098
Chronic kidney disease, n (%)	481 (18.0)	227 (19.8)	1.728	0.189
Malignant tumor, n (%)	1562 (58.4)	656 (57.2)	0.492	0.483
Hypertension, n (%)	1081 (40.4)	464 (40.5)	< 0.001	0.988
Diabetes mellitus, n (%)	845 (31.6)	335 (29.2)	2.155	0.142
Shock, n (%)	703 (26.3)	293 (25.5)	0.231	0.631
Stroke, n (%)	207 (7.7)	95 (8.3)	0.323	0.570
Treatments and interventions
Mechanical Ventilation, n (%)	1999 (74.8)	868 (75.7)	0.362	0.548
Glucocorticoid use, n (%)	692 (25.9)	269 (23.5)	2.510	0.113
Nephrotoxic drug use, n (%)	1841 (68.8)	804 (70.1)	0.587	0.444
Immunosuppressant use, n (%)	166 (6.2)	52 (4.5)	4.183	0.041
ACEI/ARB use, n (%)	323 (12.1)	121 (10.5)	1.830	0.176
Statin use, n (%)	269 (10.1)	106 (9.2)	0.607	0.436
Vasoactive drug use, n (%)	931 (34.8)	382 (33.3)	0.814	0.367
Laboratory variables
WBC count, ×10⁹/L	10.7 (7.3, 15.6)	10.8 (7.2, 15.6)	-0.280^#	0.779
Platelet count, ×10⁹/L	184.0 (122.0, 261.0)	193.0 (130.5, 284.0)	-2.607^#	0.009
RBC count, ×10¹²/L	3.3 (2.8, 3.8)	3.3 (2.8, 3.8)	-0.805^#	0.421
Hemoglobin, g/dL	9.7 (8.4, 11.0)	9.8 (8.4, 11.2)	-1.088^#	0.276
Serum sodium, mmol/L	138.0 (135.0, 141.0)	138.0 (135.0, 141.0)	-0.408^#	0.683
Serum potassium, mmol/L	4.2 (3.8, 4.6)	4.2 (3.7, 4.6)	0.592^#	0.554
Serum chloride, mmol/L	104.0 (100.0, 108.0)	104.0 (100.0, 108.0)	-0.791^#	0.429
Blood urea nitrogen, mg/dL	20.0 (14.0, 33.0)	21.0 (13.5, 35.0)	-0.271^#	0.787
Vital signs
Heart rate, beats/min	90.0 (78.0, 106.0)	92.0 (78.0, 107.0)	-0.674^#	0.500
Mean arterial pressure, mmHg	79.0 (68.0, 92.0)	79.0 (70.0, 91.0)	-1.071^#	0.284
Body temperature, °C	36.8 (36.5, 37.1)	36.7 (36.4, 37.0)	1.611^#	0.107
Respiratory rate, breaths/min	19.0 (15.0, 23.0)	19.0 (16.0, 23.0)	-1.677^#	0.094

^#represents the Z value.

Abbreviations: AKI, acute kidney injury; SOFA, Sequential Organ Failure Assessment. The SOFA score was calculated within the first 24 hours after ICU admission; ACEI, angiotensin-converting enzyme inhibitor; ARB, angiotensin receptor blocker; WBC, white blood cell; RBC, red blood cell.

Statistical method

All statistical analyses were performed using SPSS software (version 26.0) and R software (version 4.4.2). Continuous variables with a normal distribution are presented as mean ± standard deviation and were compared using the t test, whereas non-normally distributed variables are expressed as median (interquartile range) and were compared using the Mann–Whitney U test. Categorical variables are presented as number (percentage) and were compared using the chi-square test or Fisher’s exact test, as appropriate. A two-sided P value < 0.05 was considered statistically significant.

Model development development, evaluation, and external validation

During model development, candidate clinical variables were preprocessed, including standardization of variable types, outlier detection, and handling of missing data. Continuous variables were winsorized at the 1st and 99th percentiles to attenuate extreme outliers, and all continuous predictors were standardized using z-score normalization prior to model training. Variables with ≤20% missingness were imputed using multiple imputation by chained equations (MICE) with five imputations, and estimates were pooled using Rubin’s rules. Variables with >20% missingness or limited clinical relevance were excluded.^14,15

The dataset was randomly split into training and testing sets in a 7:3 ratio using stratified sampling. Candidate predictors were first subjected to feature selection using least absolute shrinkage and selection operator (LASSO) regression, with the optimal penalty parameter determined by 10-fold cross-validation under the one-standard-error rule. Selected variables were then entered into a multivariable logistic regression model to estimate odds ratios (ORs) with 95% confidence intervals (CIs), while variance inflation factors (VIFs) were calculated to exclude significant multicollinearity (VIF < 5).

Based on the selected features, six models were developed, including decision tree (DT), k-nearest neighbors (KNN), Light Gradient Boosting Machine (LightGBM), logistic regression (LR), random forest (RF), and extreme gradient boosting (XGBoost). All models were trained on the training set and tuned using 5-fold cross-validation, with the area under the receiver operating characteristic curve (AUC) as the primary metric, supplemented by accuracy and specificity. Hyperparameters were optimized within predefined ranges (e.g., XGBoost: depth 3–8, learning rate 0.0001–0.1; LightGBM: number of trees 100–500; KNN: neighbors 3–11).

Model performance was evaluated on the independent testing set, and the best-performing model was selected based on overall discrimination and stability. SHAP were subsequently applied to enhance model interpretability, and an online decision-support tool was developed to facilitate clinical application.

External validation was conducted using the multicenter eICU database(predominantly Western populations), the single-center TS-TAKI cohort, and the multicenter BAKIT database (predominantly Chinese populations) to assess generalizability across heterogeneous populations. Model performance was evaluated in terms of discrimination, calibration, and clinical utility using receiver operating characteristic (ROC) curves, calibration plots, and decision curve analysis (DCA).

Results

Baseline feature comparison

A total of 3,821 critically ill patients with digestive system tumors were included in this study, of whom 2,898 (75.8%) developed AKI within 72 hours after ICU admission. The cohort was randomly divided into a training set (70%, n=2,674) and an internal validation set (30%, n=1,147).

Baseline characteristics of the two cohorts are presented in Table 1. Overall, demographic characteristics, admission characteristics, comorbidities, treatments and interventions, vital signs, and most laboratory measurements were comparable between the training and validation sets. Although small statistical differences were observed in platelet count and immunosuppressant use, the absolute differences were modest and were not considered clinically meaningful. These findings indicate that the random allocation resulted in two well-balanced cohorts suitable for subsequent model development and validation.

Prediction model variable selection

LASSO regression identified seven predictors with non-zero coefficients at the optimal penalty parameter (λ=0.026) (Figure 2). The missingness of these variables were reviewed during the data preprocessing stage, with rates of 0% for SOFA score, mechanical ventilation, age, sepsis, and vasoactive drug use, and 2.1% and 1.8% for WBC count and serum potassium, respectively. All variables had missingness below 5%, reflecting high data completeness attributable to the use of routinely collected variables within the first 24 hours of ICU admission.

Figure 2.

LASSO regression for feature selection. (A) Ten-fold cross-validation plot of binomial deviance as a function of log(λ). The vertical dashed lines indicate the optimal λ value based on the minimum criteria (left) and the 1-standard-error rule (right), respectively. (B) Coefficient profiles of the candidate variables as a function of log(λ). At the optimal λ determined by the 1-SE rule, 7 variables with nonzero coefficients were retained.

In the subsequent multivariable logistic regression model, all seven predictors—sepsis, mechanical ventilation, vasoactive drug use, SOFA score, age, WBC count, and serum potassium—remained independently associated with AKI occurring within 72 hours after ICU admission(all P < 0.05; Table 2). The VIFs for these predictors ranged from 1.007 to 1.338, indicating no significant multicollinearity.

Table 2.

Multivariable logistic regression analysis for predictors of AKI.

	B	SE	OR(95% CI)	Wald	P value
Sepsis	0.252	0.091	1.29 (1.08–1.54)	2.760	0.006
Mechanical ventilation	0.818	0.087	2.27 (1.91–2.69)	9.366	<0.001
Vasoactive drug use	0.266	0.108	1.30 (1.06–1.61)	2.456	0.014
SOFA Score	0.159	0.017	1.17 (1.13–1.21)	9.465	<0.001
Age (years)	0.020	0.003	1.02 (1.01–1.03)	6.081	<0.001
WBC count (10⁹/L)	0.032	0.006	1.03 (1.02–1.04)	5.162	<0.001
Serum potassium (mmol/L)	0.241	0.058	1.27 (1.14–1.43)	4.152	<0.001

B represents the regression coefficient; SE, standard error; OR, odds ratio; CI, confidence interval. SOFA, Sequential Organ Failure Assessment; WBC, white blood cell; RBC, red blood cell.

Model performance comparison

As shown in Table 3 and Figure 3, all six machine learning models developed using the seven selected predictors demonstrated acceptable discrimination in the testing cohort(Figure 3(a)). The XGBoost model achieved the highest AUC (0.742; 95%CI =0.710–0.773), followed by logistic regression (0.740) and LightGBM (0.737), with only marginal differences among the three (ΔAUC ≤ 0.005). In contrast, the decision tree model showed relatively weaker performance (AUC=0.600). DeLong’s test (Supplementary Table 1) indicated that XGBoost significantly outperformed both the decision tree (ΔAUC=0.142, P < 0.001) and k-nearest neighbors models (ΔAUC=0.049, P < 0.001), whereas no statistically significant differences were observed compared with logistic regression (P=0.679), random forest (P=0.064), or LightGBM (P=0.117). In terms of overall predictive performance, XGBoost exhibited a well-balanced profile, with an accuracy of 0.657, sensitivity of 0.643, specificity of 0.704, and an F1 score of 0.740 (Table 3). The relatively high specificity indicates a strong ability to correctly identify non-AKI patients, while the moderate sensitivity ensures adequate detection of individuals at high risk of AKI. Furthermore, the high F1 score suggests that the model maintains good stability in handling class imbalance.

Table 3.

Performance comparison of prediction models in the test cohort.

Model	AUC (95% CI)	Brier score	Accuracy	Sensitivity	Specificity	F1 score
DT	0.600 (0.567–0.632)	0.1760	0.715	0.824	0.372	0.814
KNN	0.693 (0.659–0.726)	0.1737	0.617	0.589	0.708	0.700
LightGBM	0.737 (0.704–0.769)	0.1602	0.651	0.638	0.693	0.735
LR	0.740 (0.708–0.772)	0.1593	0.679	0.680	0.675	0.763
RF	0.727 (0.694–0.761)	0.1752	0.656	0.641	0.704	0.739
XGBoost	0.742 (0.710–0.773)	0.1586	0.657	0.643	0.704	0.740

DT, decision tree; KNN, k-nearest neighbor; LightGBM, light gradient boosting machine; LR, logistic regression; RF, random forest; XGBoost, extreme gradient boosting.

Figure 3.

Performance comparison of six prediction models for AKI in critically ill patients with digestive system tumors. (A) ROC curves. (B) Calibration curves comparing predicted and observed AKI probabilities. (C) DCA in the training set. (D) DCA in the testing set. DT, decision tree; RF, random forest; XGBoost, extreme gradient boosting; LR, logistic regression; LightGBM, light gradient boosting machine; KNN, k-nearest neighbor.

Calibration analysis demonstrated good agreement between predicted and observed probabilities, with XGBoost achieving the lowest Brier score (0.1586), indicating optimal calibration performance (Figure 3(b)).

DCA further showed that XGBoost provided greater net benefit than the other models across a wide range of clinically relevant threshold probabilities, particularly between 10% and 70% (Figure 3(c)、D).

Considering its discrimination, calibration, and clinical net benefit, XGBoost was selected as the optimal model for subsequent explainability analysis and external validation.

Model interpretation

To enhance interpretability, SHAP were applied to the optimal XGBoost model. The global SHAP summary plot (Figure 4(a)) ranked predictors by mean absolute SHAP values, identifying SOFA score, mechanical ventilation, WBC count, age, serum potassium, sepsis, and vasoactive drug use as the most influential factors. Higher SOFA score, older age, elevated WBC count, increased serum potassium, and the presence of mechanical ventilation, sepsis, or vasoactive drug use were associated with increased predicted AKI risk, whereas lower values showed negative contributions. These patterns are consistent with established AKI pathophysiology, supporting the biological plausibility of the model and its potential clinical applicability.

Figure 4.

SHAP-based interpretation of the XGBoost model. (A) Global SHAP summary (beeswarm) plot showing the relative importance and direction of effects of the selected predictors, ranked by mean absolute SHAP values. (B) SHAP force plot illustrating the contribution of individual predictors to the predicted AKI risk for a representative patient.

At the individual level, a representative case illustrated by a SHAP force plot (Figure 4(b)) showed that mechanical ventilation and elevated WBC count were the main contributors to increased AKI risk, while lower SOFA score, younger age, absence of sepsis, lower serum potassium, and no vasoactive drug use exerted protective effects. This additive decomposition translates complex model outputs into intuitive, patient-specific explanations, thereby improving clinical interpretability.¹⁶

Model visualization and clinical implementation

To facilitate clinical translation, an online interactive prediction tool was developed using the R Shiny framework (https://digestive-aki-model.shinyapps.io/digestive-aki-app1/; Figure 5). Clinicians can input patient-specific variables to obtain real-time predicted probabilities of AKI, accompanied by SHAP summary and force plots for visualized and interpretable outputs, thereby supporting risk assessment and clinical decision-making.

Figure 5.

Online aki risk prediction tool for critically ill patients with digestive system.

Tumors

External validation

External validation was performed using three independent cohorts—eICU (n=1,670), TS-TAKI (n=352), and BAKIT (n=227)—following the same inclusion and exclusion criteria as in the development phase. Standardized mean differences (SMDs) were used to assess baseline balance between the development and validation cohorts(Supplementary Table 2). Substantial heterogeneity was observed across cohorts, with significant differences in key predictors (all P < 0.001). The eICU cohort represented a lower-risk population with reduced rates of mechanical ventilation, sepsis, and SOFA scores, and a lower AKI incidence (17.7%). The TS-TAKI cohort was characterized by high mechanical ventilation but low sepsis rates and lower WBC levels, with an AKI incidence of 15.3%, reflecting differences in disease spectrum among Chinese oncology ICU patients and variability in sepsis diagnostic criteria across centers. The BAKIT cohort showed moderate similarity in some variables but remained imbalanced in sepsis and SOFA score, with an AKI incidence of 38.8%. Notably, substantial variation in sepsis prevalence and AKI incidence across cohorts (SMD ≥ 0.50; range 15.3%–75.8%) indicated marked case-mix differences.

In terms of discrimination (Figure 6(a)), the XGBoost-AKI model achieved AUCs of 0.765 (95% CI 0.745–0.786) in the training set and 0.742 (0.710–0.773) in the internal testing set. In external validation, performance was highest in the TS-TAKI cohort (AUC=0.769; 0.701–0.833), followed by eICU (AUC=0.719; 0.684–0.752), and lowest in BAKIT (AUC=0.616; 0.539–0.686). These findings suggest that model discrimination was strongly influenced by differences in disease severity and outcome prevalence; when such differences are substantial, model performance may decline even if individual predictors show small SMDs.

Figure 6.

External validation of the XGBoost model across independent cohorts. (A) ROC curves with 95% confidence intervals for the training set, internal test set, and three external validation cohorts (eICU, TS-TAKI, and BAKIT). (B) Calibration curves comparing predicted and observed probabilities of acute kidney injury in the internal test set and external validation cohorts. (C) Decision curve analysis showing the net clinical benefit of the model compared with treat-all and treat-none strategies across a range of threshold probabilities in each dataset.

Calibration analysis (Figure 6(b)) showed poor agreement in all external cohorts (Hosmer–Lemeshow test, all P < 0.001, Supplementary Table 3). The model tended to overestimate AKI risk in the lower-incidence eICU and TS-TAKI cohorts. In the BAKIT cohort, calibration curves were unstable due to internal variability in predictor distributions, highlighting the central role of systematic shifts in baseline risk and predictor distributions in driving miscalibration.

Decision curve analysis (Figure 6(c)) revealed an overall reduction in the model’s net benefit across the external validation cohorts. While a relative advantage persisted at low threshold probabilities, this benefit progressively attenuated with increasing thresholds. This pattern was primarily attributable to heterogeneity in baseline AKI incidence among cohorts. In high-baseline-risk populations, the treat-all strategy itself already conferred a substantial net benefit, thereby limiting the incremental gain achievable by the model. In low baseline risk cohorts (such as eICU and TS-TAKI), while the model can optimize risk stratification, its clinical net benefit was constrained by miscalibration.

Overall, the XGBoost-AKI model demonstrated favorable discrimination across multicenter external validation cohorts; however, its calibration performance and clinical net benefit varied significantly across heterogeneous populations. Baseline characteristic imbalance, particularly distributional shifts in AKI incidence and disease severity indicators, was the principal driver of these observed discrepancies. These findings underscore the importance of assessing population applicability using SMDs and accounting for heterogeneity in cross-center implementation. Model recalibration may be necessary to improve risk estimation and enhance its real-world clinical utility.

Discussion

AKI remains a frequent and serious complication in critically ill patients,¹⁷ with a particularly high burden among those with digestive system tumors. In this population, overlapping renal insults arising from malignancy-related metabolic stress, intensive care interventions, and early multi-organ dysfunction substantially increase the risk of AKI. In the present study, we developed and externally validated an interpretable XGBoost-based model for early AKI risk stratification in this high-risk population.

A major strength of this study lies in its explicit clinical positioning and methodological specificity. Rather than modeling tumor biology, tumor stage, or treatment-specific characteristics, our approach focused on assessing early renal vulnerability at ICU admission using routinely available clinical variables. This design addresses a critical unmet need in real-world critical care settings, where detailed oncologic information is often unavailable or incomplete during the early phase of ICU admission, whereas timely risk assessment remains essential for guiding immediate clinical decisions. Furthermore, most existing AKI prediction models have been developed in mixed ICU populations, with limited focus on specific subgroups such as patients with digestive system tumors. By concentrating on this high-risk population and conducting systematic external validation across multicenter and cross-population cohorts, this study advances beyond generic ICU risk models and provides a population-specific, interpretable, and clinically applicable tool for onco-nephrology risk stratification.

Among six candidate algorithms, the XGBoost model achieved the most favorable balance between discrimination, calibration, and clinical utility. It is important to contextualize these performance metrics within the framework of class imbalance. The AKI incidence of 75.8% in the development cohort reflects the high baseline renal vulnerability of this specific population, which inherently constrains the utility of overall accuracy as a primary performance indicator. A naive classifier predicting AKI for all patients would achieve 75.8% accuracy yet would provide no clinically actionable discrimination and would misclassify all non-AKI patients. In contrast, our XGBoost model achieved an AUC of 0.742 with meaningful specificity (70.4%) and a high F1 score (0.740), indicating robust performance in distinguishing high-risk from low-risk individuals. When benchmarked against recent machine learning models for AKI prediction in ICU subpopulations, our discriminative performance appears modest relative to those developed in respiratory failure cohorts (AUC 0.902)¹⁸ or persistent sepsis-associated AKI (AUC 0.870–0.932),¹⁹ which often leverage more dynamic physiological parameters or disease-specific biomarkers. However, it aligns closely with performance reported in other cardiac-specific ICU cohorts (AUC 0.765),²⁰ suggesting that early prediction in narrowly defined, high-risk oncologic populations presents distinct methodological challenges. Notably, unlike single-database studies with limited external validation,²¹ our model was validated across three independent cohorts spanning different countries and healthcare systems, including a prospectively collected Chinese oncology ICU cohort (TS-TAKI). This rigorous validation framework, combined with SHAP-based interpretability and deployment as an online clinical decision-support tool, prioritizes clinical feasibility and generalizability rather than maximal discriminative metrics alone. Moreover, the model demonstrated favorable calibration (Brier score 0.1586) and positive net benefit on decision curve analysis across clinically relevant threshold probabilities. These complementary metrics collectively indicate that the model offers genuine clinical utility beyond prevalence-driven guessing, supporting its value as an early warning decision-support tool.

The seven selected predictors—SOFA score, mechanical ventilation, white blood cell count, age, serum potassium, sepsis, and vasoactive drug use—were all available within the first 24 hours of ICU admission, prioritizing feasibility and early usability for real-world ICU workflows. While these variables are not oncology-specific per se, they capture the cumulative effects of malignancy-related stress, disease severity, and early organ dysfunction on renal susceptibility in this population. SHAP analysis further clarified the clinical relevance of these predictors. The SOFA score emerged as the most influential contributor, underscoring the central role of multi-organ dysfunction in the development of AKI.^22,23 Mechanical ventilation likely reflects the combined effects of hypoxemia,²⁴ positive pressure–related hemodynamic changes, and overall illness severity.²⁵ Elevated WBC counts highlight the contribution of systemic inflammation,²⁶ whereas increasing age reflects reduced renal reserve and increased susceptibility to injury.⁸ Elevated serum potassium levels may indicate early renal impairment or severe metabolic disturbances.²⁷ Sepsis and the use of vasoactive drugs showed clear positive contributions, consistent with infection-driven inflammatory cascades and circulatory instability.^28,29 The direction and magnitude of these effects align closely with established pathophysiological mechanisms of AKI, supporting the biological plausibility of the model and reinforcing the clinical interpretability afforded by SHAP.³⁰

Model generalizability was systematically evaluated through external validation in three independent cohorts spanning different countries, healthcare systems, and population structures. Acceptable discrimination was maintained in the multicenter eICU (AUC=0.719) and TS-TAKI (AUC=0.769) cohorts, supporting the model’s applicability across diverse clinical settings. Notably, the model’s superior performance in the TS-TAKI cohort, which specifically comprised tumor-associated AKI patients, supports its specific applicability to oncology populations rather than functioning merely as a generic ICU AKI predictor. In contrast, model performance declined in the BAKIT cohort, with reduced discrimination (AUC=0.616) and net clinical benefit. This heterogeneity likely reflects differences in patient case mix, baseline AKI incidence, illness severity, ICU admission criteria, and clinical practice patterns across cohorts. The eICU cohort represented a lower-risk Western population with reduced rates of mechanical ventilation, sepsis, and lower SOFA scores, whereas the BAKIT cohort, a prospective multicenter Chinese cohort, showed moderate similarity in some variables but remained imbalanced in sepsis prevalence and SOFA score distribution. Substantial variation in baseline risk and outcome prevalence across cohorts (AKI incidence ranging from 15.3% to 75.8%) indicated marked case-mix differences. Importantly, such heterogeneity does not undermine the model’s overall value but rather highlights the necessity of local validation and recalibration prior to clinical implementation.³¹ Systematic shifts in baseline risk and predictor distributions can drive miscalibration even when individual predictors appear broadly similar, emphasizing that external validation should assess not only discrimination but also calibration across diverse settings.

Beyond predictive accuracy, interpretability is essential for clinical translation. By integrating SHAP, complex machine learning outputs were transformed into intuitive, patient-level explanations. Global SHAP analyses clarified the relative importance of predictors, while individual-level visualizations illustrated how specific variables contributed to predicted risk. This approach facilitates clinician understanding, enhances trust in model outputs, and supports patient-level risk communication.³² To facilitate clinical use, the XGBoost-AKI model was deployed as an online prediction tool capable of generating real-time AKI risk estimates and SHAP-based explanations using routinely available ICU variables. This tool is intended as a clinical decision support aid rather than a replacement for clinical judgment. Its primary value lies in improving early risk recognition and enabling more targeted management—for example, prompting enhanced hemodynamic monitoring, guiding nephrotoxin avoidance, facilitating timely fluid stewardship, and triggering early nephrology consultation for high-risk patients. Such integration into ICU workflows could support a proactive rather than reactive kidney care paradigm, potentially mitigating the progression from early renal stress to overt AKI.

Limitations

Several limitations should be acknowledged. First, this study was primarily based on retrospective databases, and selection bias and residual confounding cannot be fully excluded despite multicenter external validation. Prospective studies are therefore required to evaluate real-world impact on clinical decision-making and patient outcomes. Second, although tumor stage and treatment-specific variables were intentionally excluded to enhance feasibility during early ICU admission, their absence may limit the characterization of certain oncology-related risk dimensions. Future studies incorporating dynamic oncologic information may further refine model performance. Third, calibration performance declined in some external cohorts, particularly in the BAKIT dataset, indicating that population-specific recalibration may be necessary prior to clinical deployment. Fourth, AKI was defined using KDIGO creatinine and urine output criteria, which may fail to capture subclinical kidney injury detectable by emerging biomarkers. Finally, the model utilizes data from the first 24 hours of ICU admission to predict AKI within 72 hours; while this window aligns with the need for early intervention, it does not capture dynamic physiological changes occurring beyond the initial 24-hour period. Furthermore, for patients with hyperacute kidney deterioration occurring shortly after ICU admission, the boundary between true early prediction and contemporaneous recognition of incipient AKI may narrow; this trade-off between timeliness and lead-time is inherent to early-warning models in critical care.

Conclusion

In summary, we developed and externally validated an interpretable XGBoost-based model for early AKI risk stratification in critically ill patients with digestive system tumors. By leveraging routinely available ICU data and explainable artificial intelligence techniques, the model achieves a balance among predictive performance, transparency, and clinical feasibility. Although model performance varied across populations, these findings support the potential utility of this approach for early risk identification and targeted prevention. They also emphasize the necessity of local validation and recalibration prior to widespread clinical adoption.

Supplemental material

Supplemental material - Multicenter validation of an explainable machine learning model for early prediction of acute kidney injury in critically ill patients with digestive system tumors

Supplemental material for Multicenter validation of an explainable machine learning model for early prediction of acute kidney injury in critically ill patients with digestive system tumors by DunZhu Guo, Jing Bai, Jian Zhang, Xiuming Xi, YuJuan Chen, ZhiPeng Luo, Kai Feng, JiangWei Zeng, MengXin Zhang, WeiQin Dong, XinXin Xu, Rui Wang, Yu Zhang in DIGITAL HEALTH

Footnotes

Acknowledgements

We extend our special thanks to our colleagues for their dedicated efforts in data acquisition and for providing valuable suggestions on this manuscript. We also gratefully acknowledge the assistance of artificial intelligence tools, which were used solely for language polishing and partial code optimization.

ORCID iD

DunZhu Guo

Ethical considerations

This study was conducted in accordance with the Declaration of Helsinki and relevant institutional and national ethical standards. The development cohort was obtained from the Medical Information Mart for Intensive Care IV (MIMIC-IV) database, which contains fully de-identified data. Database access was granted after completion of the required ethics training (Certification No. 59209829), and informed consent was waived. External validation was conducted using the eICU Collaborative Research Database (eICU-CRD), the Beijing Acute Kidney Injury Clinical Trial (BAKIT) database, and the Tangshan Cohort Database for Tumor-Associated Acute Kidney Injury (TS-TAKI Cohort). The eICU-CRD data are fully de-identified with informed consent waived. Use of the BAKIT database was approved by the Ethics Committee of Fuxing Hospital, Capital Medical University (Approval No. 2010FXHEC–KF026). The TS-TAKI Cohort was approved by the institutional ethics committee (Approved No. RMYY-LLKS-2024039) and registered with the Chinese Clinical Trial Registry (ChiCTR2500103958). All data were anonymized, and no identifiable personal information was accessed.

Author contributions

G.D. and B.J. conceived and designed the study, performed data extraction, developed the prediction models, and drafted the manuscript.

Z.J., F.K., Z.J.W., Z.M.X., D.W.Q., X.X.X., and W.R. Were responsible for data cleaning, statistical analysis, and interpretation of the results.

X.X.M., C.Y.J., and L.Z.P. contributed to external validation data acquisition and provided methodological guidance.

Z.Y. conceived and supervised the study, reviewed and revised the manuscript, secured funding, and served as the corresponding author.

All authors read and approved the final manuscript.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Tangshan City Municipal-level Science and Technology Project (2025, Project No. 25120204C), the Hebei Province “333 Talent Project” (2024, Project No. C2024071), and the Hebei Provincial Medical Scientific Research Project Plan (2023, Project No. 20231803).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

Data are available from the corresponding author on reasonable request.*

Supplemental material

Supplemental material for this article is available online.

Appendix

References

Sung

Ferlay

Siegel

, et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J Clin 2021; 71(3): 209–249. https://doi.org/10.3322/caac.21660

Parab

Majety

Ranganathan

, et al. Incidence of acute kidney injury and its associated risk factors in patients undergoing elective oesophagectomy surgeries at a tertiary care cancer institute - A pilot prospective observational study. Indian J Anaesth 2024; 68(6): 572–578. https://doi.org/10.4103/ija.ija_98_24

Tahir

Nawaz

Iqbal

, et al. Incidence and Clinical Outcomes of Postoperative Acute Kidney Injury in Relation to Etiological Factors and Surgical Procedures. Cureus 2025; 17(7): e89155. https://doi.org/10.7759/cureus.89155

Can

Rong

Lixia

. Incidence and risk factors of acute kidney injury in patients with malignant tumors: a systematic review and meta-analysis. BMC Cancer 2023; 23(1): 1123. https://doi.org/10.1186/s12885-023-11561-3

Kitchlu

McArthur

Amir

, et al. Acute Kidney Injury in Patients Receiving Systemic Treatment for Cancer: A Population-Based Cohort Study. J Natl Cancer Inst 2019; 111(7): 727–736. https://doi.org/10.1093/jnci/djy167

Jin

Wang

Shen

, et al. Acute kidney injury in cancer patients: A nationwide survey in China. Sci Rep 2019; 9(1): 3540. https://doi.org/10.1038/s41598-019-39735-9

Khwaja

. KDIGO clinical practice guidelines for acute kidney injury. Nephron Clin Pract 2012; 120(4): c179–c184. https://doi.org/10.1159/000339789

Zeng

Zhang

, et al. Analysis of prognostic risk factors in critically ill elderly patients with sepsis-associated acute kidney injury. BMC Nephrol 2025; 26(1): 656. https://doi.org/10.1186/s12882-025-04630-1

Jiang

Liu

Cheng

, et al. Interpretable machine learning models for early prediction of acute kidney injury after cardiac surgery. BMC Nephrol 2023; 24(1): 326. https://doi.org/10.1186/s12882-023-03324-w

10.

Henry

Kornfield

Sridharan

, et al. Human-machine teaming is key to AI adoption: clinicians' experiences with a deployed machine learning system. NPJ Digit Med 2022; 5(1): 97. https://doi.org/10.1038/s41746-022-00597-7

11.

Wang

Zhu

, et al. Interpretable machine learning model for predicting acute kidney injury in critically ill patients. BMC Med Inform Decis Mak 2024; 24(1): 148. https://doi.org/10.1186/s12911-024-02537-9

12.

Cai

Xiao

Zou

, et al. Predicting acute kidney injury risk in acute myocardial infarction patients: An artificial intelligence model using medical information mart for intensive care databases. Front Cardiovasc Med 2022; 9: 964894. https://doi.org/10.3389/fcvm.2022.964894

13.

Johnson

AEW

Bulgarelli

Shen

, et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data 2023; 10(1): 1. https://doi.org/10.1038/s41597-022-01899-x

14.

Yue

Huang

, et al. Machine learning for the prediction of acute kidney injury in patients with sepsis. J Transl Med 2022; 20(1): 215. https://doi.org/10.1186/s12967-022-03364-0

15.

Lin

Wang

, et al. Predictive model of acute kidney injury in critically ill patients with acute pancreatitis: a machine learning approach using the MIMIC-IV database. Ren Fail 2024; 46(1): 2303395. https://doi.org/10.1080/0886022X.2024.2303395

16.

Alkhanbouli

Matar Abdulla Almadhaani

Alhosani

, et al. The role of explainable artificial intelligence in disease prediction: a systematic literature review and future research directions. BMC Med Inform Decis Mak 2025; 25(1): 110. https://doi.org/10.1186/s12911-025-02944-6

17.

Hoste

Bagshaw

Bellomo

, et al. Epidemiology of acute kidney injury in critically ill patients: the multinational AKI-EPI study. Intensive Care Med 2015; 41(8): 1411–1423. https://doi.org/10.1007/s00134-015-3934-7

18.

Qin

Tan

, et al. Development of a machine learning-based prediction model for acute kidney injury associated with respiratory failure in the intensive care unit. Clin Exp Med 2025; 25(1): 326. https://doi.org/10.1007/s10238-025-01873-y

19.

Jiang

Zhang

Weng

, et al. Explainable Machine Learning Model for Predicting Persistent Sepsis-Associated Acute Kidney Injury: Development and Validation Study. J Med Internet Res 2025; 27: e62932. https://doi.org/10.2196/62932

20.

Xiao

, et al. Machine Learning for the Prediction of Acute Kidney Injury in Critically Ill Patients With Coronary Heart Disease: Algorithm Development and Validation. JMIR Med Inform 2025; 13: e72349. https://doi.org/10.2196/72349

21.

Wang

, et al. Construction of a machine learning-based interpretable prediction model for acute kidney injury in hospitalized patients. Sci Rep 2025; 15(1): 9313. https://doi.org/10.1038/s41598-025-90459-5

22.

Hua

Ding

Jing

, et al. Association between SOFA score and risk of acute kidney injury in patients with diabetic ketoacidosis: an analysis of the MIMIC-IV database. Front Endocrinol (Lausanne) 2024; 15: 1462330. https://doi.org/10.3389/fendo.2024.1462330

23.

Gao

Wang

Jiang

, et al. [A multicenter clinical study of critically ill patients with sepsis complicated with acute kidney injury in Beijing: incidence, clinical characteristics and outcomes]. Zhonghua Wei Zhong Bing Ji Jiu Yi Xue 2024; 36(6): 567–573. https://doi.org/10.3760/cma.j.cn121430-20240210-00124

24.

Lundberg

Nair

Vavilala

, et al. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nat Biomed Eng 2018; 2(10): 749–760. https://doi.org/10.1038/s41551-018-0304-0

25.

Huang

Teng

, et al. Internal and external validation of machine learning-assisted prediction models for mechanical ventilation-associated severe acute kidney injury. Aust Crit Care 2023; 36(4): 604–612. https://doi.org/10.1016/j.aucc.2022.06.001

26.

Carollo

Benfante

Sorce

, et al. Predictive Biomarkers of Acute Kidney Injury in COVID-19: Distinct Inflammatory Pathways in Patients with and Without Pre-Existing Chronic Kidney Disease. Life (Basel) 2025; 15(5): 720. https://doi.org/10.3390/life15050720

27.

Wang

Yang

Perry Wilson

, et al. Association between hyperkalemia and outcomes in hospitalized patients with acute kidney injury. Intensive Crit Care Nurs 2025; 90: 104118. https://doi.org/10.1016/j.iccn.2025.104118

28.

Liu

Xie

, et al. Rates, predictors, and mortality of sepsis-associated acute kidney injury: a systematic review and meta-analysis. BMC Nephrol 2020; 21(1): 318. https://doi.org/10.1186/s12882-020-01974-8

29.

Nian

Tao

Zhang

. Review of research progress in sepsis-associated acute kidney injury. Front Mol Biosci 2025; 12: 1603392. https://doi.org/10.3389/fmolb.2025.1603392

30.

Zhang

Wang

, et al. Prediction of acute kidney injury in intensive care unit patients based on interpretable machine learning. Digit Health 2025; 11: 20552076241311173. https://doi.org/10.1177/20552076241311173

31.

HAAH

Shah

Kant

IMJ

, et al. Perspectives on validation of clinical predictive algorithms. NPJ Digit Med 2023; 6(1): 86. https://doi.org/10.1038/s41746-023-00832-9

32.

Bergomi

Nicora

Orlowska

, et al. Which explanations do clinicians prefer? A comparative evaluation of XAI understandability and actionability in predicting the need for hospitalization. BMC Med Inform Decis Mak 2025; 25(1): 269. https://doi.org/10.1186/s12911-025-03045-0

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.53 MB