Abstract
Objective
Acute respiratory distress syndrome (ARDS) drives early mortality in severe acute pancreatitis (AP). Since conventional tools often fail to capture complex physiological interactions, we aimed to develop and validate an interpretable machine learning (ML) model for early ARDS prediction and deploy it as a web-based calculator.
Methods
This multicenter retrospective study utilized data from the MIMIC-IV database for model development and internal validation, and an independent cohort from Changshu Hospital for external validation. Optimal predictors were identified through a hybrid feature selection strategy combining LASSO regression and the Boruta algorithm. Seven ML algorithms were constructed, including random forest (RF), extreme gradient boosting, support vector machine, logistic regression, light gradient boosting machine, k-nearest neighbors, and decision trees. Model performance was evaluated by discrimination (AUC), calibration curves, and clinical utility (DCA). Model interpretability was assessed using SHapley Additive exPlanations (SHAP) and partial dependence plots (PDP).
Results
A total of 905 patients from the MIMIC-IV cohort (25.0% ARDS incidence) and 126 from the external cohort (20.6% incidence) were included. Nine independent predictors were identified: body mass index (BMI), respiratory rate, temperature, SOFA score, white blood cell count, PO2, PCO2, mechanical ventilation, and antibiotic use. The RF model demonstrated best performance (internal AUC 0.851) and maintained robust generalization in the external cohort (AUC 0.823). Calibration curves indicated good agreement between predicted and observed probabilities, and DCA showed superior net benefit across clinically relevant thresholds. SHAP analysis identified ventilation, SOFA score, BMI, PO2, and respiratory rate as the most influential predictors.
Conclusion
A high-performing, interpretable RF model was developed for early ARDS prediction in critically ill AP patients. The model effectively captured complex physiological interactions and demonstrated robustness across diverse populations. By integrating this algorithmic framework into a user-friendly web calculator, the tool supports personalized risk stratification and timely clinical decision-making.
Keywords
Introduction
As a widespread gastrointestinal condition, acute pancreatitis (AP) has seen an upward trend in incidence, leading to substantial medical and economic costs.1,2 Although most individuals experience only localized and self-resolving inflammation, approximately 20% develop severe manifestations characterized by persistent systemic inflammatory response syndrome (SIRS) and multi-organ functional decline.3,4 Among these systemic complications, acute respiratory distress syndrome (ARDS) is one of the most frequent and lethal manifestations of pulmonary dysfunction, affecting up to 30% of patients. 5 It is a major contributor to early death, responsible for roughly 60% of deaths in patients with severe acute pancreatitis (SAP) during the first week. 6 The onset of ARDS is often rapid, significantly prolonging intensive care unit (ICU) stays and increasing mortality risk. 7 These observations underscore the urgent need for early and accurate prediction of ARDS in AP patients to guide timely clinical interventions and reduce morbidity and mortality in this high-risk population.
Various predictive tools have been evaluated to facilitate early risk identification. Single clinical and laboratory parameters, such as white blood cell (WBC) counts, platelets, lactate dehydrogenase, creatinine, albumin (ALB), and triglycerides, were investigated for their prognostic value.8,9 However, these markers are often non-specific and susceptible to confounding by concurrent infections or other inflammatory conditions, limiting their individual predictive accuracy. Similarly, traditional clinical scoring systems were employed, including organ dysfunction metrics such as the Sequential Organ Failure Assessment (SOFA) and qSOFA, physiological assessments like the Acute Physiology and Chronic Health Evaluation II (APACHE-II), and AP-specific tools such as the Ranson criteria and the Bedside Index for Severity in Acute Pancreatitis (BISAP).5,9,10 Yet, these tools are often limited by delayed assessment windows, operational complexity, and suboptimal predictive accuracy or specificity for pulmonary complications.8,11 Additionally, conventional statistical models based on logistic regression (LR) and Cox proportional hazards models have been developed to integrate multiple risk factors for ARDS in AP contexts.8,9 Despite offering improvements over single indicators, these linear models frequently lack the necessary sophistication to decipher the elaborate, high-dimensional, and non-linear dynamics that are fundamentally inherent in the pathophysiology of critical illness. This limitation potentially leads to suboptimal predictive performance. 12
Machine learning (ML) has established itself as a groundbreaking methodology in the sphere of acute medicine. By analyzing large sets of clinical information, ML algorithms are able to identify non-linear and intricate links between health markers and patient outcomes that traditional models often miss.13,14 Although several studies have explored ML for predicting AP complications, significant gaps hinder their translation into clinical practice. Many existing models are derived from single-center datasets, which limits their generalizability across diverse populations. 15 Moreover, the “black box” nature of complex algorithms often covers up how predictions are made, reducing clinician trust. 16 Crucially, few studies have effectively bridged the gap between theoretical algorithms and practical bedside application.
The main goal of this research is to leverage ML to pinpoint clinical factors that predict the occurrence of ARDS in AP patients. By constructing a robust prediction model using the MIMIC-IV database and external validation, and further developing an online web calculator, we aim to achieve accurate early prediction to guide timely interventions and achieve better recovery results for these vulnerable patients.
Methods
Data source
The training and internal validation datasets were extracted from the Medical Information Mart for Intensive Care IV version 3.0 (MIMIC-IV v3.0) database, which contains extensive and well-documented data from 65366 patients admitted to the ICU at Beth Israel Deaconess Medical Center (Boston, MA, USA) between 2008 and 2022. 17 Author YZ (Record ID: 60227322) was granted access to the database after successfully completing the required Collaborative Institutional Training Initiative (CITI) examination. The establishment of the MIMIC-IV database received ethical clearance from the Institutional Review Boards of both the Massachusetts Institute of Technology and Beth Israel Deaconess Medical Center. Due to the retrospective and anonymized nature of the data, the requirement for informed consent was waived. In addition, patients with acute pancreatitis admitted to Changshu Hospital affiliated to Soochow University between January 2019 and March 2024 were included as an external validation cohort. The study protocol was reviewed and approved by the Ethics Committee of Changshu Hospital affiliated to Soochow University (Approval No. L2024026). Due to the retrospective design and the use of routine clinical data, the ethics committee waived the need for written informed consent. All study procedures adhered strictly to the ethical principles outlined in the Declaration of Helsinki.
Participants
In the MIMIC-IV database, adult patients (aged ≥ 18 years) with AP were identified using the International Classification of Diseases, 9th Revision (ICD-9, code 577.0) and 10th Revision (ICD-10, code K85%). For patients with multiple ICU admissions, only data from the first admission were included. For the external validation cohort, the diagnosis of AP was established in strict accordance with the Guidelines for diagnosis and treatment of acute pancreatitis in China (2021). 18 The primary outcome was the development of ARDS during hospitalization, which was defined according to the Berlin Definition. 19 To ensure consistency between the cohorts, the following exclusion criteria were applied: (1) age < 18 years; (2) diagnosis of ARDS prior to or within the first 24 hours of ICU admission; (3) length of ICU stay < 24 hours. This temporal exclusion prevents data leakage and ensures the model functions as an early prognostic warning system rather than a concurrent diagnostic tool, preserving a genuine preemptive window for clinical intervention.
The eligible patients from the MIMIC-IV database were randomly partitioned into a training set and an internal validation set at a ratio of 7:3. To ensure the comparability of these subgroups, a baseline characteristics analysis was conducted to confirm that the random split did not introduce distribution bias between the training and internal validation sets. The training set was utilized for variable selection and model construction, while both the internal and external validation sets were employed for model evaluation. The overall workflow of this study is illustrated in Figure 1. Flowchart of patient enrollment, model construction, and web-based calculator deployment. MIMIC: Medical Information Mart for Intensive Care Unit; ICU: intensive care unit; RF, random forest; XGBoost, extreme gradient boosting; LightGBM, light gradient boosting machine; DT, decision tree; LR, logistic regression; SVM, support vector machine; KNN, k-nearest neighbors.
Data extraction and processing
Data were extracted from the MIMIC-IV database using Structured Query Language (SQL) via pgAdmin 4 software (version 6.21). All clinical and laboratory variables were collected within the first 24 hours of ICU admission. For variables with multiple recordings during this period, only the initial measurement was included. The extracted variables were categorized as follows: (1) Demographic information: age (years), gender, race (Asian, Black, White, Hispanic, other), height (cm), weight (kg), body mass index (BMI, kg/m2), and insurance (Medicare or other); (2) Vital signs: heart rate (times/min), respiratory rate (times/min), systolic blood pressure (SBP, mmHg), diastolic blood pressure (DBP, mmHg), temperature (°C), and oxygen saturation (SpO2, %); (3) Severity scores: Glasgow Coma Scale (GCS), SOFA score, and simplified acute physiology score II (SAPS II); (4) Laboratory data: white blood cell count (WBC, K/μL), hemoglobin (g/dL), platelet (PLT, K/μL), C-reactive protein (CRP, mg/dL), alanine aminotransferase (ALT, U/L), aspartate aminotransferase (AST, U/L), albumin (ALB, g/dL), creatinine (mg/dL), sodium (mEq/L), potassium (mEq/L), chloride (mEq/L), calcium (mg/dL), blood urea nitrogen (BUN, mg/dL), hematocrit (%), glucose (mg/dL), international normalized ratio (INR), prothrombin time (PT, seconds), partial thromboplastin time (PTT, seconds), pH, partial pressure of oxygen (PO2, mmHg), partial pressure of carbon dioxide (PCO2, mmHg), base excess (BE, mEq/L), lactic acid (Lac, mEq/L), and bicarbonate (HCO3-, mEq/L); (5) Comorbidities: hypertension, diabetes, myocardial infarction, chronic obstructive pulmonary disease (COPD), asthma, acute kidney injury (AKI), sepsis, venous thromboembolism (VTE), and malignant cancer; and (6) Therapeutic interventions: central venous catheterization (CVC), cardiopulmonary bypass (CPB), continuous renal replacement therapy (CRRT), ventilation, heparin, aspirin, antibiotic, and vasopressors use. The same set of variables was extracted from the external validation cohort at Changshu Hospital affiliated to Soochow University.
Regarding missing data, variables with more than 30% missing values were excluded to minimize potential bias. For variables with less than 5% missing data, mean substitution was used. For variables with 5% to 30% missing data, multiple imputation was performed. Outliers were identified and treated as missing values, which were then handled using the imputation methods described above. Following data completion, multicollinearity among the candidate variables was assessed using Spearman’s rank correlation coefficients. A correlation matrix was constructed and visualized via a heatmap (Figure 2), where variables exhibiting a correlation coefficient exceeding 0.8 were excluded to minimize redundancy. Spearman’s rank correlation heatmap of candidate clinical variables. BMI: body mass index; HR: heart rate; RR: respiratory rate; SBP: systolic blood pressure; DBP: diastolic blood pressure; SpO2: oxygen saturation; GCS: Glasgow Coma Scale; SOFA: Sequential Organ Failure Assessment; SAPS II: simplified acute physiology score II; WBC: white blood cell count; PLT: platelet; ALT: alanine aminotransferase; BUN: blood urea nitrogen; INR: international normalized ratio; PT: prothrombin time; PTT: partial thromboplastin time; pH: potential of hydrogen; PO2: partial pressure of oxygen; PCO2: partial pressure of carbon dioxide; BE: base excess; Lac: lactic acid; HCO3-: bicarbonate; COPD: chronic obstructive pulmonary disease; ARF: acute respiratory failure; AKI: acute kidney injury; VTE: venous thromboembolism; CVC: central venous catheterization; CPB: cardiopulmonary bypass; CRRT: continuous renal replacement therapy.
Feature selection
To identify robust predictors while minimizing overfitting, a hybrid feature selection strategy was implemented. 20 The least absolute shrinkage and selection operator (LASSO) regression was initially applied to address multicollinearity and eliminate redundant variables by shrinking their coefficients to zero. The optimal regularization parameter (λ) was determined through 10-fold cross-validation based on minimum binomial deviance. Subsequently, to capture potential non-linear associations, the Boruta algorithm was employed. As a wrapper method based on the random forest (RF) classifier, Boruta confirms relevant features by comparing their importance Z-scores against those of randomized “shadow attributes.” Ultimately, to ensure model parsimony and clinical interpretability, a consensus strategy was adopted, whereby only variables independently identified by both LASSO and Boruta were retained for final model development.
Model development and evaluation
Based on the selected feature subsets, seven ML algorithms were constructed to predict the onset of ARDS: extreme gradient boosting (XGBoost), RF, support vector machine (SVM), logistic regression (LR), light gradient boosting machine (LightGBM), k-nearest neighbors (KNN), and decision trees (DT). To ensure optimal model performance and mitigate overfitting, hyperparameter tuning was conducted using a grid search strategy combined with five-fold cross-validation on the training dataset. The discriminatory power of each model was rigorously evaluated in the internal validation set and subsequently tested in the independent external validation cohort. Performance was quantified using the area under the receiver operating characteristic curve (AUC), alongside sensitivity, specificity, accuracy, precision, F1-score, and the Brier score. The F1 score was calculated as follows: 2 × precision × recall/(precision + recall). The Brier score is derived from the squared difference between the observed and predicted outcomes. It combines the aspects of discrimination and calibration, with lower scores indicating higher accuracy. A score greater than 0.25 is generally considered to be indicative of a worthless prediction. Furthermore, the clinical utility and reliability of the model were assessed using calibration curves to examine the agreement between predicted and observed probabilities, and decision curve analysis (DCA) to determine the net clinical benefit across a range of threshold probabilities. To address potential methodological circularity associated with mechanical ventilation, an additional sensitivity analysis was conducted. We performed an ablation study by removing the ventilation feature and retraining the optimal Random Forest model using the remaining eight variables. The discriminative performance of this reduced model was evaluated using AUC across all datasets to confirm the independent predictive value of the physiological markers.
Model interpretability and deployment strategy
To address the inherent “black-box” nature of machine learning and enhance clinical transparency, the SHapley Additive exPlanations (SHAP) method was employed to elucidate the model’s decision-making process. A multi-dimensional visualization strategy was adopted to interpret predictions from both global and local perspectives. Global feature importance was primarily quantified using the SHAP summary bar plot, while the bee swarm plot was concurrently utilized to visualize the directional impact and distribution of each feature’s contribution to ARDS risk. Partial Dependence Plots (PDP) were computed to visualize the marginal effects of key continuous variables, thereby capturing potential non-linear associations and clinically relevant threshold effects. For individual-level interpretation, SHAP force plots were constructed to decompose the specific risk factors driving a single prediction, demonstrating how distinct features shift the model’s output from the baseline. Ultimately, to facilitate the translation of this complex algorithm into bedside practice, the optimal model was deployed as a user-friendly, interactive web-based calculator using the Streamlit framework (https://streamlit.io).
Statistical analysis
Continuous variables were assessed for normality using the Kolmogorov-Smirnov test. Normally distributed data were expressed as mean ± standard deviation (SD) and compared using the Student's t-test, while non-normally distributed data were presented as median [interquartile range (IQR)] and compared using the Mann-Whitney U test. Categorical variables were reported as frequencies (percentages) and analyzed using the Chi-square test or Fisher’s exact test.
All statistical analyses in the current study were completed using R software (Version 4.1.2), Stata software (Version 16.0), and Python software (Version 3.13). Two-sided P-values <0.05 were considered statistically significant.
Results
Baseline characteristics
Baseline characteristics of the MIMIC-IV and external validation cohorts.
MIMIC: Medical Information Mart for Intensive Care Unit; ARDS: acute respiratory distress syndrome; BMI: body mass index; HR: heart rate; RR: respiratory rate; SBP: systolic blood pressure; DBP: diastolic blood pressure; SpO2: oxygen saturation; GCS: Glasgow Coma Scale; SOFA: Sequential Organ Failure Assessment; SAPS II: simplified acute physiology score II; WBC: white blood cell count; PLT: platelet; ALT: alanine aminotransferase; BUN: blood urea nitrogen; INR: international normalized ratio; PTT: partial thromboplastin time; pH: potential of hydrogen; PO2: partial pressure of oxygen; PCO2: partial pressure of carbon dioxide; BE: base excess; Lac: lactic acid; HCO3-: bicarbonate; COPD: chronic obstructive pulmonary disease; ARF: acute respiratory failure; AKI: acute kidney injury; VTE: venous thromboembolism; CPB: cardiopulmonary bypass; CRRT: continuous renal replacement therapy.
Additionally, the comparison of clinical parameters between the randomly assigned training and internal validation sets is presented in Supplemental Table S1. The vast majority of demographic and clinical variables showed no statistically significant differences between the two subsets. Although a few laboratory parameters, including hemoglobin, ALT, potassium, BUN, and PO2, exhibited P-values < 0.05 as statistically expected in multiple comparisons, their absolute numerical differences were clinically negligible. This confirms the overall high comparability between the training and internal validation cohorts, providing a robust dataset for model training.
Key variables
Before model construction, C-reactive protein, aspartate aminotransferase, albumin, and calcium were excluded due to a missing rate exceeding 30%. Subsequently, Spearman’s rank correlation analysis was performed to assess multicollinearity. Hematocrit and prothrombin time were excluded as they exhibited correlation coefficients greater than 0.8 with hemoglobin and international normalized ratio, respectively, to prevent redundancy.
The remaining variables were subjected to the hybrid feature selection process. The LASSO regression analysis, using 10-fold cross-validation to minimize the binomial deviance, identified nine variables with non-zero coefficients: BMI, RR, temperature, SOFA score, WBC, PO2, PCO2, ventilation, and antibiotic use (Figure 3(a) and (b)). Concurrently, the Boruta algorithm identified 22 confirmed features deemed relevant for prediction, including age, BMI, RR, SBP, temperature, SpO2, GCS, SOFA score, SAPS II, WBC, creatinine, BUN, pH, PO2, PCO2, lactate, HCO3-, sepsis, CRRT, ventilation, antibiotic use, and vasopressin (Figure 3(c)). Feature selection process using LASSO regression and the Boruta algorithm. (A) Coefficient profile plotted against the logarithm of the lambda sequence. (B) Cross-validation plot for determining the optimal penalty term. (C) Importance ranking of features identified by the Boruta algorithm. CPB: cardiopulmonary bypass; CVC: central venous catheterization; INR: international normalized ratio; AKI: acute kidney injury; HR: heart rate; DBP: diastolic blood pressure; SBP: systolic blood pressure; HCO3-: bicarbonate; CRRT: continuous renal replacement therapy; RR: respiratory rate; GCS: Glasgow Coma Scale; pH: potential of hydrogen; BMI: body mass index; SOFA: Sequential Organ Failure Assessment.
Based on the consensus strategy, the intersection of variables selected by both algorithms was retained. Consequently, nine key variables including BMI, RR, temperature, SOFA score, WBC, PO2, PCO2, ventilation, and antibiotic use were ultimately determined as the predictors for model development.
Model development and validation
Based on the nine key predictors identified, seven ML algorithms were constructed and optimized. Initial evaluation in the training cohort revealed that the RF model exhibited superior learning capabilities, achieving the highest AUC among all classifiers (Figure 4(a)). Internal validation based on ROC analysis further showed that the RF model demonstrated the best predictive performance, achieving an AUC of 0.851 (95% CI: 0.803-0.899), followed by LightGBM (AUC = 0.840), SVM (AUC = 0.837), LR (AUC = 0.833), XGBoost (AUC = 0.833), KNN (AUC = 0.807), and DT (AUC = 0.768) (Figure 4(b)). Detailed performance metrics for each model are presented in Table 2. While some models excelled in specific metrics, the RF model exhibited the most balanced and robust performance profile across all evaluation indicators, maintaining high specificity (0.887) and accuracy (0.807) alongside its superior discriminatory power. Receiver operator characteristic (ROC) curves of the test and validation sets of 7 machine learning models and 2 traditional clinical scoring systems. (A) ROC curves of the training set. (B) ROC curves of the internal validation set. (C) ROC curves of the external validation set. AUC, area under the ROC curve; XGBoost, extreme gradient boosting; LightGBM, light gradient boosting machine; SVM, support vector machine; KNN, k-nearest neighbors; SOFA, Sequential Organ Failure Assessment; SPAS II, simplified acute physiology score II. Comprehensive evaluation of machine learning model performance in the internal validation set. AUC, area under the ROC curve; XGBoost, extreme gradient boosting; RF, random forest; SVM, support vector machine; LR, logistic regression; LightGBM, light gradient boosting machine; KNN, k-nearest neighbors; DT, decision tree.
External validation using the independent cohort from Changshu Hospital further confirmed the robustness and generalizability of the RF model (Supplemental Table S2). It achieved a satisfactory AUC of 0.823 (95% CI: 0.735-0.899), maintaining stable discrimination even in a distinct population (Figure 4(c)). To benchmark the clinical utility of our proposed framework, we compared the predictive performance of the optimal RF model against established clinical severity scoring systems, specifically the SOFA and SAPS II scores. As illustrated in the Figure 4, the RF model significantly outperformed these traditional metrics across all cohorts. In the internal validation set, the RF model’s AUC of 0.851 was substantially higher than that of the SOFA score (AUC = 0.725) and SAPS II score (AUC = 0.710). Similarly, in the external validation cohort, the RF model maintained a distinct advantage (AUC = 0.823) compared to the SOFA score (AUC = 0.713) and SAPS II score (AUC = 0.772). The consistent performance observed across both internal and external validation cohorts underscores the strong clinical potential of the RF model for the early prediction of ARDS in AP patients.
Furthermore, the calibration curve of the RF model demonstrated good agreement between the predicted probabilities and the actual observed ARDS risks (Figure 5(a)), indicating high reliability. Decision curve analysis further confirmed the clinical utility of the model, revealing that the RF model yielded a superior net benefit over a broad range of threshold probabilities in comparison with the default strategies of universal intervention or no intervention (Figure 5(b)). Calibration curve of the random forest (RF) model in the internal validation set (A). Decision curve analysis (DCA) of the RF model in the internal validation set (B). XGBoost, extreme gradient boosting; LightGBM, light gradient boosting machine; SVM, support vector machine; KNN, k-nearest neighbors.
A sensitivity analysis was performed to mitigate the circularity bias related to mechanical ventilation. After excluding this feature, the retrained eight-feature model maintained robust discrimination in the internal validation set with an AUC of 0.818. Notably, the AUC in the external validation cohort increased to 0.861, as shown in Supplemental Figure S1. These findings indicate that the model’s fundamental predictive power is driven by underlying physiological and laboratory derangements such as SOFA, BMI, and PO2. The results also suggest that reliance on objective physiological metrics can enhance cross-institutional generalizability by reducing the influence of local clinical practices.
Interpretation of predictive features
The SHAP summary plot identified the relative importance of the predictors, with ventilation usage emerging as the primary factor, followed by SOFA score, BMI, PO2, RR, PCO2, WBC, temperature, and antibiotic use (Figure 6(a)). The bee swarm plot (Figure 6(b)) further elucidated the directionality of these effects, revealing that the requirement for mechanical ventilation, along with elevated values of SOFA, BMI, RR, and PCO2, contributed positively to the predicted risk of ARDS. Conversely, lower PO2 levels and the absence of mechanical ventilation were associated with a lower probability of the outcome. SHAP summary plot for clinical variables contributing to the random forest (RF) model. (a) Feature importance ranking plot based on the RF model. (b) Scatter plot of variables for SHAP analysis based on the RF model. SHAP: SHapley Additive exPlanations; SOFA: sequential organ failure assessment; BMI: body mass index; PO2: partial pressure of oxygen; RR: respiratory rate; PCO2, partial pressure of carbon dioxide; WBC: white blood cell.
To further quantify these associations, PDP were generated to visualize the marginal effect of the top six features (Figure 7). Analysis of mechanical ventilation usage indicated that it was associated with a substantial elevation in ARDS risk, rising from a baseline of approximately 16% to 43% (Figure 7(a)). The SOFA score demonstrated a strong correlation with ARDS risk, characterized by a steep upward trajectory between 5 and 13 points, after which the curve reached a plateau (Figure 7(b)). BMI exhibited a complex non-linear relationship with the outcome, where the predicted risk remained relatively baseline below 27 kg/m2, followed by a sharp escalation. This risk then stabilized between 30 and 40 kg/m2 before rising again at higher BMI values (Figure 7(c)). For oxygenation status, the model captured a precipitous rise in risk as PO2 levels dropped below approximately 170 mmHg (Figure 7(d)). Analysis of respiratory parameters revealed that the predicted probability increased significantly when the RR surpassed 25 breaths/min (Figure 7(e)). Similarly, PCO2 exhibited a predominantly monotonic increasing trend, with a more pronounced elevation in ARDS risk observed once PCO2 exceeded 45 mmHg (Figure 7(f)). These patterns confirm that the RF model successfully captured critical physiological thresholds beyond simple linear correlations. Partial dependence plots (PDP) of the top six features based on the random forest (RF) model. SOFA: sequential organ failure assessment; BMI: body mass index; PO2: partial pressure of oxygen; RR: respiratory rate; PCO2, partial pressure of carbon dioxide.
On an individual level, SHAP force plots (Figure 8) further validated these patterns by quantifying the specific contribution of each feature to a patient’s risk score. For example, in a high-risk case (Figure 8(a)), the patient’s probability of developing ARDS was elevated to 0.50, primarily driven by a high SOFA score (11.0), the requirement for mechanical ventilation, and an elevated WBC (12.6). In contrast, for a lower-risk patient (Figure 8(b)), the predicted probability was suppressed to 0.06. Despite presenting with a SOFA score of 8.0 and a PO2 of 85.0 mmHg, the overall risk remained low due to the combined protective effects of the absence of mechanical ventilation, a lower BMI (22.54), a normal RR (20.0) and PCO2 of 36mmHg. SHAP force plots illustrating individual prediction explanations based on the random forest (RF) model. (a) A high-risk example. (b) A low-risk example. SHAP: SHapley Additive exPlanations; SOFA: sequential organ failure assessment; BMI: body mass index; PO2: partial pressure of oxygen; RR: respiratory rate; PCO2:partial pressure of carbon dioxide; WBC: white blood cell.
Development of the web-based calculator
Based on the validated predictors and the optimal RF model, an interactive web-based calculator was developed to bridge the gap between algorithmic complexity and clinical application (Figure 9). To ensure complete consistency with the underlying algorithm, the interface integrates all nine key predictors, including BMI, RR, temperature, SOFA score, WBC, PO2, PCO2, ventilation status, and antibiotic use. By inputting these parameters via a user-friendly interface, clinicians can instantly obtain the predicted probability of ARDS. This tool is freely accessible at a dedicated website (https://rf-model-6t6jrrgfn4fmdaesdceunt.streamlit.app/) to support rapid, personalized risk stratification in real-time clinical settings. An online web calculator based on random forest (RF) machine learning model; SOFA: sequential organ failure assessment; BMI: body mass index; PO2: partial pressure of oxygen; RR: respiratory rate; PCO2:partial pressure of carbon dioxide; WBC: white blood cell.
Discussion
In this multicenter retrospective study, we developed and validated a ML-based framework to predict the development of ARDS in critically ill patients with AP. Our results demonstrate that the RF model outperformed six other common algorithms, achieving a robust AUC of 0.851 in the internal validation set and maintaining an AUC of 0.823 in the independent external cohort. This consistent performance is notable, as many previously reported prediction models for pancreatitis-associated ARDS were derived from single-center cohorts and showed substantial performance degradation when applied to external populations.7,11,21
Several recent studies have highlighted the challenges of generalizability in ARDS prediction, particularly in heterogeneous ICU populations where variations in disease severity, management strategies, and patient demographics can substantially influence model performance.22–24 In this context, the preserved discrimination observed in our external cohort suggests that the proposed model may better accommodate population heterogeneity, supporting its potential applicability across different clinical settings. By integrating a hybrid feature selection strategy and multiple ML algorithms, we identified nine clinically accessible predictors. Importantly, rather than assuming linear or monotonic associations, we explored the functional relationships between key predictors and ARDS risk using SHAP and partial dependence analyses, allowing a more nuanced interpretation of selected variables. 25 A web-based risk prediction tool was subsequently implemented, allowing for instantaneous risk assessment and serving as a clinically actionable resource to guide timely interventions in patients at high risk. This strategy aligns with recent recommendations emphasizing that interpretability is essential for the clinical translation of ML models in critical care, particularly for high-stakes outcomes such as ARDS.11,26
The SHAP-based interpretability analysis revealed that mechanical ventilation was the most influential predictor of ARDS development. Partial dependence analysis demonstrated a marked increase in predicted ARDS risk among patients requiring ventilatory support, with the estimated probability rising from approximately 16% to over 40%. This finding is consistent with prior observational studies reporting that invasive ventilation in AP often reflects advanced respiratory compromise and may exacerbate lung injury through ventilator-induced stress in already inflamed pulmonary tissue.5,27 Our results do not imply a direct causal role of ventilation itself, but rather reinforce its value as a composite marker of disease severity and early lung vulnerability. The SOFA score was another dominant contributor to ARDS risk, demonstrated a steep increase in risk between SOFA scores of approximately 5 and 13. This observation parallels prior reports demonstrating a strong association between escalating SOFA scores and subsequent development of ARDS and multiple organ dysfunction in critically ill patients.28,29 Notably, the non-linear risk gradient observed in this intermediate range suggests that patients may transition rapidly from compensated organ dysfunction to overt pulmonary failure, a phenomenon that may be underestimated by traditional linear risk models.
BMI emerged as an important predictor with a distinct non-linear risk profile. The predicted probability of ARDS increased sharply once BMI exceeded approximately 27 kg/m2. This finding is concordant with existing literature linking obesity to impaired respiratory mechanics, reduced functional residual capacity, and a chronic pro-inflammatory milieu that predisposes patients to lung injury during systemic inflammatory states. 30 Our results extend these observations by identifying a clinically relevant BMI range at which ARDS risk begins to accelerate in patients with AP. Furthermore, respiratory and gas exchange parameters also demonstrated clinically meaningful threshold effects. The model identified a pronounced increase in ARDS risk when arterial oxygen tension fell below approximately 170 mmHg, respiratory rate exceeded 25 breaths per minute, and arterial carbon dioxide tension rose above 45 mmHg. These thresholds are broadly concordant with recent physiologic and critical care studies describing early deterioration in ventilatory reserve preceding overt hypoxemic respiratory failure.7,19,31 While factors such as elevated BMI and impaired oxygenation are established general risks, our partial dependence analysis transforms these known concepts into objective, quantifiable clinical triggers. Identifying non-linear tipping points, such as the steep risk escalation at a BMI of 27 kg/m2 or a PO2 dropping below 170 mmHg, provides precise parameters that traditional linear scoring systems lack. These data-driven thresholds deliver actionable insights, enabling clinicians to identify impending respiratory failure and optimize the timing of preventive interventions before irreversible deterioration occurs.
When compared with conventional statistical models and clinical scoring systems, the present machine learning approach offers several advantages. Traditional tools such as APACHE II, Ranson, and BISAP rely on linear and additive assumptions and were not specifically designed to predict pulmonary complications in AP.5,9 Prior comparative studies have shown that such scores demonstrate only modest discrimination for ARDS and frequently fail to capture higher-order interactions among physiologic variables. 8 Our direct benchmark analysis confirms this limitation, as traditional metrics like SOFA and SAPS II achieved substantially lower discriminative power (AUCs ranging from 0.710 to 0.772) compared to our optimal RF model across both internal and external cohorts. In contrast, the RF algorithm can integrate weak but complementary predictors and model complex interactions without prespecified assumptions, which may explain the superior discrimination observed in our study. 32 The model’s practical value was further validated by evaluating its clinical utility, which showed a superior net gain over a broad interval of risk thresholds relative to the default approaches of universal or no intervention. These findings are consistent with recent ML-based ICU studies showing that improved discrimination can translate into meaningful clinical benefit by supporting more selective escalation of monitoring and preventive interventions.7,11 Such an approach may help optimize resource allocation while minimizing unnecessary interventions in low-risk patients.
One important benefit of this research involves its emphasis on clinical explainability and real-world application. The lack of transparency has been repeatedly cited as a major barrier to clinician acceptance of ML models in critical care.11,33 By combining global feature importance rankings, partial dependence visualization, and individualized risk estimation, our framework provides clinicians with insight into both population-level risk patterns and patient-specific drivers of prediction. The accompanying web-based calculator further facilitates translation into clinical practice by enabling rapid, individualized ARDS risk assessment using routinely available variables, an approach increasingly advocated in precision ICU medicine.13,20
Despite these strengths, certain weaknesses of our work should be noted. First, the retrospective design may introduce selection and diagnostic ascertainment bias. We excluded patients diagnosed with ARDS within the first 24 hours of ICU admission, which, while effectively preventing data leakage and ensuring the model functions as a prognostic warning system, limits its applicability to rapid-onset phenotypes. Second, our framework relies on static data from the initial 24-hour window and does not fully capture the temporal progression of critical illness. Incorporating longitudinal or time-series physiological data, such as trajectories of oxygenation or evolving organ failure scores, represents a promising future direction to further refine predictive accuracy and clinical relevance. 34 Regarding variable selection, although the inclusion of mechanical ventilation involves a degree of methodological circularity, sensitivity analysis confirmed that the underlying physiological derangements retain robust independent predictive power. Furthermore, the external validation cohort is characterized by a limited and unbalanced sample size, which results in wide confidence intervals and constrains the precision of calibration and subgroup assessments. Additionally, this study evaluates acute pancreatitis as a homogeneous entity and does not account for the divergent pathophysiological pathways of specific etiological subphenotypes. 35 Differences in ethnicity, disease etiology, and clinical practice patterns between regions may further influence model performance. 10 Finally, a significant gap remains between risk prediction and actionable clinical decision-making. Future research should employ causal inference frameworks, such as target trial emulation, to determine whether specific interventions guided by these predictions can ultimately improve clinical outcomes. 36
Conclusion
Our research introduces a generalizable, interpretable, and high-performing RF model designed to predict ARDS in critically ill patients with AP. By capturing clinically meaningful non-linear physiological thresholds and translating predictions into a transparent, user-friendly interface, this model represents a substantive improvement over traditional risk assessment tools. Its application may facilitate earlier recognition of those with high risk and support specific treatment paths, eventually contributing to improved outcomes in this vulnerable population.
Supplemental material
Supplemental material - An interpretable machine learning model for predicting acute respiratory distress syndrome in critically ill patients with acute pancreatitis: A multicenter retrospective study
Supplemental material for An interpretable machine learning model for predicting acute respiratory distress syndrome in critically ill patients with acute pancreatitis: A multicenter retrospective study by Sheng Yana, Xia Ren, Chunyang Xu, Feng Zheng, Luojie Liu, Shun Wen, Xiaodan Xu and Yan Zhanga in Digital Health.
Footnotes
Acknowledgements
We would like to express our gratitude to the clinical staff at the Department of Critical Care Medicine and the Department of Emergency Medicine at Changshu Hospital Affiliated to Soochow University for their support in clinical data curation.
Ethical considerations
The study was approved by the Institutional Review Boards of MIT, Beth Israel Deaconess Medical Center, and the Ethics Committee of Changshu Hospital affiliated with Soochow University (Approval No. L2024026). The requirement for informed consent was waived due to the retrospective and anonymous nature of the data. This research adhered to the Declaration of Helsinki.
Consent for publication
Not applicable. This study utilized anonymized data from the public MIMIC-IV database and retrospective, de-identified clinical data from our institution. The ethics committee waived the requirement for individual informed consent.
Author contributions
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was funded by the Suzhou Special Project for Clinical Key Disease Diagnosis and Treatment Technologies (LCZX202334), Key Projects of the Changshu Science and Technology Development Program (CSWS202209), and the Special Research Fund Projects of the China International Medical Foundation (Z-2014-08-2309-1).
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
Data guarantor
Yan Zhang, as the corresponding author, accepts full responsibility for the work and the conduct of the study, had access to the data, and controlled the decision to publish.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
