Sage Journals: Discover world-class research

Abstract

We develop and validate a clinical guideline-integrated LLM for enhanced sepsis mortality prediction. Using MIMIC-IV data from 24,237 ICU sepsis patients, we fine-tuned a large language model with Low-Rank Adaptation, embedding clinical guidelines into the training process. The model’s predictive performance was evaluated using accuracy, F1-score, sensitivity, specificity, and area under the receiver operating characteristic curve (AUC). Ablation studies assessed the specific contributions of clinical guideline integration. The guideline-enhanced fine-tuned LLM demonstrated moderately higher performance across all evaluation metrics including predictive accuracy (0.819), F1-score (0.815), sensitivity (0.815), specificity (0.822), and AUC (0.852) in predicting mortality risk for septic patients compared to traditional machine learning (highest accuracy: 0.774, AUC: 0.850) and deep learning methods (highest accuracy: 0.762, AUC: 0.841). Ablation experiments demonstrated that explicit integration of clinical guideline knowledge substantially improved performance over both direct prompting (accuracy: 0.709, AUC: 0.706) and fine-tuning without clinical guidelines (accuracy: 0.786, AUC: 0.801). These findings demonstrate that incorporating clinical guidelines into the fine-tuning of large language models outperforms both traditional and deep learning baselines across multiple metrics in sepsis mortality prediction, highlighting the value of explicit domain knowledge integration for clinical AI’s robustness.

Keywords

large language model lora deep learning sepsis septic shock mortality prediction model

Introduction

Sepsis is an acute organ dysfunction syndrome that results from a dysregulated host response to bacterial, viral, fungal, or parasitic infections.¹ Approximately 48.9 million sepsis cases occur worldwide annually, causing around 11 million deaths and accounting for 19.7% of total global deaths.² In the United States alone, more than a third of hospital deaths are attributed to sepsis, with associated healthcare costs reaching approximately $38 billion in 2017, making it one of the most common and costly conditions leading to hospitalisation.^3,4 Consequently, sepsis poses a severe threat to public health and imposes a substantial socioeconomic burden on healthcare systems.

To improve the quality of treatment and clinical outcomes of septic patients, the Surviving Sepsis Campaign (SSC) regularly updates clinical guidelines to standardise global clinical practice.⁵ Despite significant advances in treatment and management strategies, sepsis mortality remains elevated and overall prognosis is frequently poor.⁶ Accurate and effective prediction of epsis mortality risk could help clinicians identify high-risk patients quickly, allowing individualised treatment strategies, early initiation of palliative care discussions, and evaluating healthcare quality and effectiveness of treatment within clinical settings.⁷

In recent years, traditional machine learning methods such as logistic regression (LR), random forest (RF), and gradient boosting decision trees (GBDT) have demonstrated efficacy in predicting mortality risk among septic patients in ICU settings.^8–11 However, these algorithms exhibit intrinsic limitations in capturing complex interactions within high-dimensional, multimodal clinical datasets due to their linear assumptions and comparatively simple model structures.^12,13 Deep learning approaches, including convolutional neural networks (CNN), recurrent neural networks (RNN), and Transformers, have shown promise in handling complex data structures, but typically require extensive annotated datasets and suffer from limited interpretability, hindering clinical implementation^14–17

Existing predictive models for sepsis mortality often neglect the systematic integration of domain-specific clinical guidelines into large language models (LLMs) training processes, limiting their clinical applicability and robustness.¹⁸ Recent advances in Transformer-based LLMs have demonstrated strong logical reasoning and contextual understanding capabilities, highlighting their substantial potential for medical diagnostic and prognostic tasks.^19–21 However, current research predominantly relies on pretrained LLMs through simple prompting methods, lacking explicit incorporation of clinical expertise and guideline-based knowledge through supervised fine-tuning (SFT).

To address this critical gap, our study innovatively integrates explicit clinical guideline knowledge into supervised fine-tuning of the Qwen2.5-72B LLM using low-rank adaptation (LoRA) technology.^22,23 We systematically evaluate our proposed guideline-enhanced LLM against traditional machine learning algorithms, classical deep learning models, and direct prompting methods, aiming to significantly improve both predictive accuracy and interpretability for sepsis mortality risk. Ultimately, this research seeks to provide ICU clinicians with a robust and reliable decision-support tool to enhance individualised patient care and clinical outcomes.

Materials and method

Data source and variable extraction

Data for this study were obtained from the Medical Information Mart for Intensive Care-IV database (MIMIC-IV, version 2.2),²⁴ provided by Beth Israel Deaconess Medical Center (BIDMC), Boston, USA. This publicly accessible database contains clinical data from over 500,000 patients admitted between 2008 and 2019, including detailed information on vital signs, nursing documentation, severity of disease scores, diagnoses, treatments, and laboratory results. Dr Ruiyi Zhu from our research team completed the required database training and was authorized to access the data (record ID: 59980404).

Patients met Sepsis-3 diagnostic criteria, defined as suspected infection with Sequential Organ Failure Assessment (SOFA) score of ≥2 points or septic shock, characterised by persistent hypotension requiring vasopressors to maintain mean arterial pressure (MAP) ≥65mmHg and serum lactate levels >2 mmol/L (18 mg/dL) despite adequate fluid resuscitation. Exclusion criteria were: (1) age under 18 years; (2) ICU stay under 24 h or exceeding 100 days; (3) multiple ICU admissions. The patient selection process is detailed in Figure 1: after screening 32,970 candidate admissions, 24,237 unique adult sepsis cases were retained and stratified 8:1:1 into training (19,389), validation (2,424) and test (2,424) cohorts.

Figure 1.

Flowchart of patients’ selection.

Data extraction was performed using PostgreSQL (Version 14.0; PostgreSQL Global Development Group). The extracted variables included demographic information, vital signs, laboratory measurements, and clinical interventions recorded within 24 h of ICU admission (Appendix A). The primary outcome was all-cause in-hospital mortality occurring after the first 24 h of ICU admission. Existing severity scoring systems, including the Simplified Acute Physiology Score-II (SAPS-II), Acute Physiology Score-III (APS-III), and Sequential Organ Failure Assessment (SOFA) scores, were utilized as comparative benchmarks for evaluating the predictive performance of the developed model.

Construction of guideline-enhanced fine-tuning datasets

We utilised the 2021 adult Surviving Sepsis Campaign guideline as the sole authoritative source, providing evidence-based standards for sepsis management. The computational pipeline for converting unstructured guidelines into executable knowledge representations comprised four sequential stages. Initially, guideline sections were segmented into individual recommendations. Subsequently, entities and relations were extracted using an instruction-tuned Qwen 2.5-72B model, identifying five core entity types (Indicator, Threshold, Action, TimeFrame, Outcome) and seven relation categories (including has_threshold and recommends_action). This was followed by an expert fusion phase where two board-certified intensivists reviewed and consolidated the outputs into a sepsis-specific knowledge base; representative outcomes included structured relationships such as (Lactate, has_threshold, ≥4 mmol L^-1) and (MAP <65 mmHg, recommends_action, “vasopressor initiation”). Finally, the refined knowledge was serialized into a JSON-formatted knowledge graph (see Appendix B for complete explanation).

To explicitly evaluate the impact of integrating clinical guidelines into the fine-tuning of LLMs, we constructed two distinct datasets. The first dataset was purely data-driven, containing only raw clinical variables without explanatory context. The second dataset was enriched with explicit guideline-based interpretations for each clinical indicator, facilitating the model’s deeper understanding of clinical correlations and enhancing its interpretability and predictive capability. The data-driven dataset contains exclusively raw clinical variables including vital signs, laboratory values, and demographic features without any explanatory context or clinical interpretations. In contrast, the guideline-enhanced database enriched identical clinical variables through systematic injection of structured medical knowledge. Each data point was mapped to corresponding entities from the sepsis guideline knowledge graph (e.g., associating MAP <65mmHg thresholds with “vasopressor initiation” actions). The specific approaches to dataset construction are described in Figure 2. By constructing these two distinct datasets, we systematically assessed the value of explicitly integrating clinical guidelines into LLM fine-tuning, highlighting its potential to significantly enhance predictive accuracy and interpretability in clinical decision-making. For each patient, the worst 24-h value of every indicator is matched against SSC thresholds stored in knowledge graph. A guideline note—for example ‘Lactate 5.6 mmol L^-1 — exceeds SSC threshold (≥4 mmol L^-1), high risk’—is added only when the patient’s value crosses the relevant boundary; otherwise the phrase ‘within guideline range’ is used. These conditional comments are appended to the raw features to form the guideline-enhanced prompt.

Figure 2.

Construction of guideline-enhanced fine-tuning datasets.

To better predict mortality in sepsis patients, we employed LLM-based machine learning for training and validation across two datasets. To further clarify the enhancement effects of guideline integration on model performance, comparative analyses were conducted using three methodologies—LLM + prompt, LLM + SFT, and LLM+knowledge+SFT. The specific prompt templates were detailed in Table 1.

Table 1.

The prompt templates of different methods.

Method	Inference prompt (single example)	SFT-training pair (format)
LLM + prompt	text<br>You are an ICU decision-support assistant. Based only on the variables below, estimate the probability that the patient will die in-hospital.<br><br>Age 67 years; sex male; HR 105 bpm; MAP 58 mmHg; temp 38.1°C; RR 28/min; SpO₂ 93 %; lactate 5.6 mmol/L; WBC 18×10⁹/L; creatinine 2.1 mg/dL; SOFA 10.<br><br>Respond with a single number between 0 and 1.<br>	No SFT used
LLM + SFT	Same runtime prompt as above.	{“Input”: “Age = 67; Sex = M; HR = 105; MAP = 58; Temp = 38.1; RR = 28; SpO2 = 93; Lactate = 5.6; WBC = 18; Cr = 2.1; SOFA = 10”
LLM + SFT	Same runtime prompt as above.	“Output”: “Mortality = 1”}
LLM + Knowledge + SFT (ours)	text<br>You are an ICU decision-support assistant. Using both the values and the SSC guideline notes, estimate the probability of in-hospital death.<br><br>Age 67 years; sex Male<br>Lactate 5.6 mmol/L — exceeds SSC threshold (≥4 mmol/L) → high risk<br>MAP 58 mmHg — below SSC target (65 mmHg) → consider vasopressor<br>SOFA 10 — ≥ 2 points → organ dysfunction present<br>Other values: HR 105 bpm; temp 38.1°C; RR 28/min; SpO₂ 93 %; WBC 18×10⁹/L; creatinine 2.1 mg/dL.<br><br>Respond with a single number between 0 and 1.<br>	{“Input”: “Age = 67; Sex = M; HR = 105; MAP = 58; Temp = 38.1; RR = 28; SpO2 = 93; Lactate = 5.6; WBC = 18; Cr = 2.1; SOFA = 10 \|
		Lactate≥4→high-risk; MAP <65→vasopressor; SOFA≥2→organ-dysfunction”,
		“Output”: “Mortality = 1”}

The strategic selection of hyperparameters was critical for optimizing model performance while ensuring clinical validity. We implemented the following optimized hyperparameters (Table 2) to balance computational efficiency with model performance.

Table 2.

Hyperparameter settings.

Hyper-parameter	Value	Note
lora_alpha	32	Scaling (=2 × r)
lora_dropout	0.05	Dropout on LoRA updates
learning_rate	2 e-5	AdamW base LR
num_train_epochs	3	Early-stop enabled
per_device_train_batch_size	4	Micro-batch/GPU
gradient_accumulation_steps	8	Effective batch ≈32
warmup_ratio	0.05	5 % linear warm-up
max_seq_length	1024	Tokens per sample
Seed	42	Reproducibility

Model training

In this study, we used a pre-trained LLM (Qwen2.5-72B) to predict the risk of mortality for sepsis patients. We implemented two distinct model-inference approaches to systematically evaluate the effectiveness of fine-tuning with clinical domain knowledge. First, we applied a direct prompt-based strategy, utilizing clinical indicators from the patient as input without modifying the pre-trained parameters. Second, we employed supervised fine-tuning (SFT) enhanced by Low-Rank Adaptation (LoRA) to efficiently optimize the selected layers of the pre-trained Qwen2.5-72B model. LoRA updates it according to:

W^{'} = W + Δ W = W + B A

where B ∈ R^d×k and A ∈ R^r×k are trainable matrices, and r ≪ min (d,k) represents a small rank dimension controlling model complexity and computational efficiency.

Fine-tuning data sets were constructed from the MIMIC-IV database in two forms: a purely feature-based data set and a guideline-enhanced dataset explicitly integrating clinical guideline interpretations. Model optimization was performed using the AdamW optimizer (learning rate 2 × 10⁻⁵) with a linear scheduler and accumulation of gradients. Early stopping was implemented based on validation loss to prevent overfitting. The final optimized LoRA parameters were subsequently merged with the original model weights, resulting in a highly efficient and clinically interpretable predictive model. The entire model training framework is illustrated in Figure 3, outlining the end-to-end workflow comprising two distinct approaches: a purely data-driven version and a knowledge-enhanced version enriched with guideline annotations. The model-training phase proceeds through five sequential steps: loading the pre-trained LLM while configuring LoRA adapters; preparing and preprocessing the input data; selecting training strategies while optimizing hyper-parameters; defining the loss function and evaluation metrics; and finally training and validating the model. After validation, the LoRA-optimised parameters are merged with the base weights to produce a compact, clinically interpretable predictor.

Figure 3.

The framework of guideline-enhanced fine-tuning of LLM.

Baselines

To comprehensively evaluate the performance of our proposed LLM-driven method, we compared it with three categories of baseline methods.

Traditional Machine Learning Methods: These included Support Vector Machine (SVM), Naive Bayes, K-Nearest Neighbor (KNN), Logistic Regression (LR), Gradient Boosting Decision Tree (GBDT), Decision Tree (DT), and Random Forest (RF). These methods classify data based on linear separability (SVM), posterior probabilities (Naive Bayes), proximity of data points (KNN), logistic transformations (LR), ensemble strategies to sequentially correct prediction errors (GBDT), entropy-based partitioning (DT), and ensemble averaging to minimize overfitting (RF).

Deep Learning Methods: Deep learning approaches included Long Short-Term Memory (LSTM) networks, which capture temporal dependencies, Convolutional Neural Networks (CNNs) designed to identify local feature patterns, and Transformer models leveraging self-attention mechanisms for modeling complex, long-range dependencies in sequential data.

Large Language Model without Fine-Tuning (Prompt-based): As an additional baseline, we employed the pretrained LLM (Qwen2.5-72B) without fine-tuning, making predictions based solely on crafted input prompts.

Statistical analysis

Normally distributed measurement data are typically expressed as mean ± standard deviation (X ± S) and compared between groups using the independent samples t-test. Data that do not follow a normal distribution are represented as median (P25, P75) and compared using the Mann-Whitney U test. Categorical characteristics are expressed as frequencies (percentages) and compared via the χ² test. A two-sided p-value of less than 0.05 is considered statistically significant. Statistical analyzes were performed using SPSS software (Version 29.0).

To quantitatively evaluate the performance of the model in predicting mortality risk in patients with sepsis, we used the following metrics¹: Area Under the Receiver Operating Characteristic Curve (AUROC), indicating overall predictive capability²; Sensitivity (Recall), the proportion of correctly identified death cases³; Specificity, the proportion of correctly identified survival cases⁴; Accuracy, reflecting overall correct predictions; and⁵ F1-score, the harmonic mean of precision and recall, particularly suitable for imbalanced classification tasks.

All model training and evaluations were performed on the following hardware platform. GPUs using NVIDIA 8*A100 (80 GB memory), CPUs using Intel Xeon Gold 6248 (64 cores, 2.5 GHz frequency), 512 GB RAM, and a 4 TB NVMe SSD storage device. The software environment comprised Python 3.10, PyTorch 2.1, and CUDA 12.6 for GPU acceleration. Mini-batches use stratified 1:5 death–survival sampling.

Results

Baseline characteristics of patients

A total of 32,970 patients from the MIMIC-IV database fulfilled the criteria for sepsis, of whom 24,237 were included after applying exclusion criteria. Among these patients, 3,568 (14.7\%) died during hospitalization. Table3 summarizes the baseline demographic and clinical characteristics of survivors and non-survivors. Non-survivors were significantly older (p < 0.01), presented higher SOFA and SAPS II scores (both p < 0.01), and exhibited worse vital signs and laboratory values. Additionally, compared with survivors, non-survivors had a higher proportion receiving vasoactive medications (48.1% vs 23.6%, p < 0.01), invasive mechanical ventilation (76.5% vs 56.2%, p < 0.01), and renal replacement therapy (14.5% vs 3.0%, p < 0.01).

Table 3.

Baseline features of 24,237 sepsis patients categorized by in-hospital mortality.

Features	Total (N = 24,237)	Survivors (N = 20669)	Non-survivors (N = 3568)	p-Value
Demographic information
Male gender, n (%)	14033 (57.9%)	12040 (58.2%)	1993 (55.9%)	0.007
Age (years)	64.7 ± 16	64.1 ± 16	68.5 ± 15	<0.01
Severity of illness
SOFA [median (IQR)]	5 (3-8)	5 (3-7)	8 (5-11)	<0.01
SAPS II [median (IQR)]	38 (31-48)	37 (30-45)	50 (40-61)	<0.01
GCS [median (IQR)]	15 (13-15)	15 (13-15)	15 (12-15)	0.975
Vital signs
mean arterial pressure-min (mmHg)	52 (38-69)	58 (46-71)	49 (34-65)	0.046
Heart rate (bpm)	104 (91-119)	103 (90-117)	111 (95-127)	<0.01
Respiratory rate (bpm)	28 (24-32)	27 (24-32)	30 (26-34)	<0.01
Laboratory fndings
White blood cells (K/uL)	13 (9.1-17.9)	12.8 (9-17.5)	14.5 (9.8-20.3)	<0.01
Platelets (K/uL)	162 (111-230)	163 (113-229)	158 (90-234)	0.14
Creatinine (mg/dL)	1.1 (0.8-1.8)	1.1 (0.8-1.7)	1.6 (1.0-2.7)	<0.01
Bilirubin (mg/dL)	1.0 (0.6-1.9)	1.0 (0.6-1.7)	1.2 (0.6-2.7)	<0.01
Troponin-T (ng/mL)	0.8 (0.2-1.5)	0.8 (0.2-1.4)	0.85 (0.19-2.1)	<0.01
Albumin (g/dL)	1.8 (0.9-3.0)	2.2 (1.1-3.2)	1.6 (0.8-3.0)	<0.01
Glucose (mg/dL)	138 (112-182)	138 (111-176)	158 (122-219)	<0.01
PaO2/FiO2 (mmHg)	120 (84-190)	122 (86-190)	112 (72-185)	<0.01
hemoglubin (g/dL)	9.6 (8.2-11.1)	9.6 (8.2-11.1)	9.4 (8.0-11.1)	<0.01
PH	7.33 (7.27-7.39)	7.33 (7.28-7.39)	7.30 (7.20-7.39)	<0.01
BUN(mg/dL)	23 (15-38)	21 (14-35)	33 (21-54)	<0.01
Comorbidity, n (%)
CHD	5200 (21.5%)	4346 (21%)	854 (23.9%)	<0.01
CKD	4476 (18.5%)	3691 (17.9%)	785 (22%)	<0.01
Liver disease	2070 (8.5%)	1598 (7.7%)	472 (13.2%)	<0.01
Hypertension	2983 (12.3%)	2520 (12.2%)	463 (13%)	0.188
Diabetes	7770 (32.1%)	6640 (32.1%)	1126 (31.6%)	0.488
Treatment status within 24 h, n (%)
Vasopressor (frst 24 h)	6603 (27.2%)	4886 (23.6%)	1717 (48.1%)	<0.01
Mechanical ventilation (frst 24 h)	14343 (59.2%)	11615 (56.2%)	2728 (76.5%)	<0.01
Renal replacement therapy	1130 (4.7%)	612 (3.0%)	518 (14.5%)	<0.01

SOFA sequential organ failure assessment, SAPS II simplified acute physiology score II, GCS Glasgow coma scale, CHD coronary heart disease, CKD chronic kidney disease.

Comparative evaluation of model performance

To systematically evaluate the performance of the proposed guideline-enhanced fine-tuned LLM (Qwen2.5-72B + LoRA) for predicting mortality risk in patients with sepsis, we conducted comprehensive comparative experiments. Traditional machine learning algorithms including Support Vector Machine (SVM), Naive Bayes (Bayes), K-Nearest Neighbor (KNN), Logistic Regression (LR), Gradient Boosting Decision Tree (GBDT), Decision Tree (DT), and Random Forest (RF) were selected as baseline models. Furthermore, classic deep learning models such as Long Short-Term Memory (LSTM), Convolutional Neural Network (CNN), and Transformer were included to enrich the comparative analysis. The proposed method is denoted as LLM+Knowledge+SFT. All baseline hyper-parameters were optimised by 8-fold cross-validated grid/random search. To ensure internal robustness, the 80 % training set was further subjected to stratified eight-fold cross-validation. Model hyper-parameters were tuned on the fold-specific development subset, and the optimised model was retrained on the entire 80 % before final evaluation on the independent 10 % test cohort.

Table 4 shows that the proposed guideline-enhanced fine-tuned large language model (LLM+Knowledge+SFT) demonstrated comprehensively

Table 4.

Experimental results comparing various models on the sepsis mortality prediction task.

Model	Accuracy	F1 score	AUC	Sensitivity	Specificity
SVM	0.702	0.704	0.753	0.693	0.714
Bayes	0.569	0.009	0.506	0.997	0.005
KNN	0.659	0.647	0.706	0.704	0.599
LR	0.766	0.769	0.833	0.742	0.797
GBDT	0.774	0.772	0.85	0.784	0.76
DT	0.687	0.682	0.686	0.707	0.659
RF	0.77	0.769	0.831	0.774	0.765
MLP	0.665	0.641	0.708	0.505	0.876
LSTM	0.762	0.765	0.841	0.735	0.797
Transformer	0.673	0.671	0.815	0.585	0.788
LLM+Knowledge+SFT (Ours)	0.819	0.815	0.852	0.815	0.822

SVM, Support Vector Machine; KNN, K-Nearest Neighbor; LR, Logistic Regression; GBDT, Gradient Boosting Decision Tree; DT, Decision Tree; RF, Random Forest; LSTM, Long Short-Term Memory; LLM, Large Language Model.

Superior performance across multiple metrics compared with traditional machine learning methods (e.g., SVM, Bayes, KNN, LR, GBDT, DT, and RF) and classic deep learning approaches (LSTM, CNN, Transformer), achieving the highest Accuracy (0.819), F1 Score (0.815), and Area Under the ROC Curve (AUC = 0.852). Among traditional machine learning algorithms, ensemble methods such as Gradient Boosting Decision Tree (GBDT) and Random Forest (RF) exhibited relatively strong predictive capability, whereas deep learning models like Long Short-Term Memory (LSTM) and Transformer models showed moderate performance, likely limited by their dependency on larger annotated datasets and the inherent complexity of clinical data features.

While the absolute AUC improvement over the strongest baseline (GBDT) is modest (+0.002), the consistent gains across all evaluation metrics, particularly specificity (+8.2%) and accuracy (+5.8%), highlight the clinical utility of guideline-enhanced fine-tuning.

Furthermore, the ROC curves presented in Figure 4 clearly illustrate the superior predictive performance of the guideline-enhanced fine-tuned LLM compared with other baseline methods, highlighting its improved stability and robustness. Overall, our novel integration of clinical guidelines into supervised fine-tuning significantly enhances the accuracy, interpretability, and clinical applicability of the LLM-driven mortality risk prediction, demonstrating considerable potential for informing clinical decision-making and patient management in ICU settings.

Figure 4.

ROC curves presenting the performance of machine learning methods in the MIMIC cohort (N = 24237). ROC,receiver operating characteristics; SVM, Support Vector Machine; KNN, K-Nearest Neighbor; LR, Logistic Regression; GBDT, Gradient Boosting Decision Tree; DT, Decision Tree; RF, Random Forest; LSTM, Long Short-Term Memory; LLM, Large Language Model.

Ablation study

To deeply investigate the specific contributions of model fine-tuning and domain knowledge enhancement, we designed detailed ablation experiments, comparing three setups: LLM + Prompt (direct prompting without fine-tuning), LLM + SFT (fine-tuning without domain knowledge), and LLM+Knowledge+SFT (complete method).

As shown in Table 5, the guideline-enhanced fine-tuned model (LLM + Knowledge + SFT) achieved significantly higher predictive performance compared to both the purely prompt-based approach (LLM + Prompt) and the fine-tuned model without explicit domain knowledge (LLM + SFT). In addtion, we test the Deepseek with the same prompt templates (Deepseek + prompt). Specifically, accuracy improved from 0.709 (LLM + Prompt) to 0.786 (LLM + SFT) and further increased to 0.819 with guideline integration; F1-score and AUC similarly improved from 0.678 to 0.706 (LLM + Prompt) to 0.815 and 0.852, respectively. These results demonstrate that supervised fine-tuning substantially enhances model performance, and integrating explicit clinical guideline knowledge further optimizes predictive accuracy and robustness. Collectively, these findings highlight the essential role of domain-specific expertise in developing clinically relevant artificial intelligence tools for sepsis mortality prediction.

Table 5.

Ablation study comparing different training strategies.

Model	Accuracy	F1 score	AUC	Sensitivity	Specificity
LLM + Prompt	0.709	0.678	0.706	0.675	0.681
Deepseek + Prompt	0.758	0.743	0.744	0.725	0.762
LLM + SFT	0.786	0.778	0.801	0.765	0.792
LLM+Knowledge+SFT (Ours)	0.819	0.815	0.852	0.815	0.822

Compared with the GBDT baseline, the LLM + Knowledge + SFT model correctly re-classified 22 additional deaths and 252 additional survivors in the 4,848-case test set. A paired AUROC test yielded p = 0.02, indicating that the performance difference between the two models is statistically significant at the 95 % confidence level.

Discussion

Sepsis remains a leading cause of in-hospital mortality particularly in immunocompromised populations,²⁵ necessitating accurate and timely risk prediction for early intervention. Traditional methods for predicting sepsis mortality have primarily relied on structured clinical data and statistical models, such as logistic regression and the Sequential Organ Failure Assessment (SOFA) score.²⁶ Although these models based on structured clinical data have provided useful insights into sepsis mortality risk, their predictive performance remains limited due to inadequate incorporation of comprehensive multidimensional and multi-omics information, underscoring the need for predictive tools in infection management.²⁷ Recently, machine learning (ML) techniques have shown superior performance in mortality prediction by leveraging high-dimensional data, enabling capture of intricate patterns overlooked by traditional models.

Multiple investigations have leveraged ML models to forecast mortality from sepsis, utilizing structured data from extensive clinical databases such as MIMIC-III, MIMIC-IV, and eICU Collaborative Research Database.²⁸ Techniques including Random Forest (RF),²⁹ XGBoost,³⁰ and LightGBM have consistently outperformed traditional regression-based approaches. For instance, researchers³¹ developed an RF model in the MIMIC-IV database, achieving significantly superior results compared to conventional SOFA-based models. Notably, prior studies utilizing traditional ML approaches have achieved comparable or marginally superior discriminative ability. Some reaseachers³⁰ reported an AUC of 0.857 using XGBoost on MIMIC-III data, while some one³² attained an AUC of 0.888 with gradient-boosted methods in a large administrative cohort. However, Our guideline-enhanced LLM demonstrates robust sepsis mortality prediction (AUC: 0.852). Though these slight differences in absolute performance metrics warrant acknowledgment, our approach confers distinct advantages that extend beyond discriminative power alone. The distinct advantage lies in generating clinically interpretable guideline-anchored rationales that enhance trust and actionability beyond “black-box” predictions. While computational demands and single-center validation require targeted improvement, this approach prioritizes evidence-based clinical utility over marginal metric gains, bridging a critical gap in AI-assisted critical care decision-making. Deep learning approaches, such as Long Short-Term Memory (LSTM) networks³³ and Transformer-based architectures,³⁴ have also been explored. These methods effectively capture temporal dependencies within the ICU time series data. Although deep learning methods theoretically possess superior feature extraction abilities, their performances in our study did not surpass that of traditional methods. This outcome may arise from the complexity and heterogeneity of clinical data, where deep learning approaches require substantial amounts of annotated data to achieve effective generalization. Given the relatively limited dataset used in our study, the full potential of deep learning models might have been restricted. However, their clinical applicability is impeded by poor interpretability.³⁵ To mitigate this, some studies have incorporated Shapley Additive Explanations (SHAP) to enhance model transparency and perform more detailed feature importance analyses.³²

However, our guideline-enhanced LLM synchronously generates evidence-grounded rationales during clinical predictions through real-time knowledge graph queries to the guideline ontology. This self-contained interpretability mechanism produces clinically actionable narratives that directly support clinical decision-making, rendering supplementary validation via SHAP or attention weight analyses superfluous.

A recent advancement in sepsis mortality prediction involves the integration of large language models (LLMs) with clinical datasets. Unlike traditional machine learning (ML) approaches, LLMs can effectively process unstructured clinical notes, capturing valuable qualitative insights about patient conditions. Studies have demonstrated that combining structured clinical data with summaries derived from LLMs significantly enhances predictive accuracy.³⁶ For instance, integrating ChatGPT-generated clinical summaries with ICU data notably improved the area under the receiver operating characteristic curve (AUC), underscoring the benefits of multi-representational learning.³⁷ In contrast, we embed the 2021 Surviving Sepsis Campaign as a machine-readable knowledge graph and inject its rule-based annotations into the prompt during LoRA fine-tuning. Notably, our approach fundamentally differs from prior LLM-based frameworks like the Sepsis Early Risk Assessment (SERA)algorithm, which utilize natural language processing (NLP) to extract clinically relevant information from unstructured notes, further demonstrating the potential of leveraging LLMs alongside domain-specific knowledge. While SERA demonstrates the value of unstructured data, our model explicitly embeds guideline-based diagnostic criteria and therapeutic pathways through real-time knowledge graph traversal, ensuring alignment with evidence-based protocols^38,39 Given the promising performance of guideline-enhanced LLM,it could be leveraged within clinical workflows to facilitate early triage decisions, prioritizing high-risk septic patients for intensive care unit admission and aggressive resuscitation, or prompting earlier reassessment and treatment escalation for deteriorating patients in general wards. However, successful integration necessitates overcoming barriers such as seamless interoperability with diverse Electronic Health Record (EHR) systems and fostering clinician trust through robust validation and the provision of interpretable, explainable AI techniques.

However, most current LLM-based studies rely heavily on simple prompting strategies and have not systematically integrated domain-specific clinical guidelines into their training process, potentially limiting their predictive capabilities and clinical applicability.^40–42 To address this gap, our study proposes an innovative approach that explicitly integrates clinical guideline knowledge with supervised fine-tuning (SFT) of an LLM (Qwen2.5-72B) using low-rank adaptation (LoRA) techniques. Our results show that this guideline-enhanced fine-tuned LLM (LLM+Knowledge+SFT) significantly outperforms traditional machine learning models such as SVM, GBDT, and RF, as well as classical deep learning approaches including LSTM and Transformer, in terms of accuracy, F1 score, and AUC. Ablation studies further confirmed that supervised fine-tuning notably improved predictive performance (accuracy increased from 0.709 to 0.786), and explicit incorporation of guideline-based knowledge provided additional significant performance gains (accuracy improved to 0.819, AUC to 0.852). These findings clearly highlight the critical value of embedding expert clinical guideline knowledge into LLM fine-tuning processes to enhance predictive robustness and interpretability, representing a significant advancement in clinical decision support.

Despite these promising results, our study has limitations. First, the model was developed and validated using data from a single-center database (MIMIC-IV), potentially limiting the generalizability of findings. This single-center database, derived from an urban tertiary hospital, was characterized by a predominantly Caucasian population (median age ≈65 years) with a mean SOFA score of 5. This limits generalizability to other demographic groups and healthcare settings. Second, The substantial computational burden associated with fine-tuning the Qwen2.5-72B architecture presents significant deployment barriers in clinical settings. Even with parameter-efficient LoRA techniques, this process requires approximately 9 GPU-hours across eight A100-80 GB cards. Such resource-intensive procedures may prove impractical in resource-constrained clinical settings.⁴³ More critically, the requisite hardware configurations including multiple high-end GPUs and supporting cooling systems entail substantial initial capital investment that could impose prohibitive financial burdens on healthcare institutions, creating particularly acute challenges for intensive care units in community hospitals or developing regions.⁴⁴ Third, guideline dependency introduces potential biases. Our knowledge graph exclusively encodes the 2021 adult Surviving Sepsis Campaign guidelines, which predominantly reflect Western medical practices. This framework currently lacks adaptation for paediatric or obstetric populations, excludes considerations for traditional medicine protocols in low-resource settings, and may not accommodate rapid evidence updates, potentially diminishing performance in these contexts. Fourth, real-world integration barriers remain unaddressed. The model was evaluated on retrospectively batched data without direct interfaces to bedside monitoring systems. Future implementations must integrate real-time ICU data streaming to process live vitals and lab results, while rigorously evaluating latency, alert fatigue, and adoption thresholds to ensure timely identification of high-risk patients.

In conclusion, this study demonstrates that incorporating clinical guidelines into the fine-tuning process of large language models significantly enhances the prediction accuracy, stability, and interpretability of sepsis mortality risk assessment. Building on these results, we have begun coordinating multicentre experiments with three hospitals in our regional medical consortium to evaluate the model’s performance across different case-mixes and practice patterns. In parallel, we are collaborating with the ICU information-system vendor to create a lightweight middleware that streams real-time vital signs and laboratory data to the model and feeds the resulting risk score plus guideline-based rationale back to the bedside dashboard. These efforts will enable prospective validation and seamless deployment in routine critical-care workflows.

Conclusion

Our guideline-enhanced LLM outperformed clinical scoring systems and computational baselines across key metrics, validating that explicit domain knowledge integration is essential for clinically viable AI. This approach enhances model robustness and generalisability, providing a replicable framework for reliable mortality prediction in critical care. The methodology enables trustworthy clinical decision-support for early risk stratification and personalised interventions.

Footnotes

Author note

Anonymized information revealing author’s identity in the main manuscript: Dr Ruiyi Zhu from our research team completed the required database training and was authorized to access the database of MIMIC (record ID: 59980404).

Acknowledgement

The authors would like to acknowledge the support provided by various individuals and organizations during the course of this research. We thank Dr Miaorong Xie for his valuable insights and technical assistance. We also extend our gratitude to the National Natural Science Foundation of China for their financial support through the project (No:82374069).

ORCID iD

Zhen Zhao

Ethics approval

The Ethics Committee of Beijing Friendship Hospital Affiliated to Capital Medical University waived the need for ethics approval and patient consent for the collection, analysis and publication of the retrospectively obtained and anonymised data for this non-interventional study.

Authors Contribution

All authors contributed to the conceptualization and design of the study. Zhen Zhao and Bo An are co-first authors. They conceived the study, designed the experimental protocols, conducted the literature review, undertook the data analysis, drafted the initial manuscript and contributed to the subsequent drafts. Tianpeng Zhang performed statistical analyses, validated the data, and created visualizations. Ruiyi Zhu and Zihao Fan interpreted the data, summarized the patient of MIMIC features. Guoxing Wang provided conceptual guidance, revised the manuscript, approved the final version, secured funding and oversaw the research integrity as corresponding author. All authors read and approved the final version.

Funding

This research was supported by the National Natural Science Foundation of China (grant number 82374069).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Appendix A. Summary of Patient Data Extracted from the MIMIC.

Catagory	Content
Demographic Information	Gender, Age, Body Mass Index (BMI)
Vital Signs (first 24 hours)	Heart Rate, Mean Arterial Pressure (MAP), Respiratory Rate, Peripheral Oxygen Saturation (SpO₂), Arterial Oxygen Partial Pressure to Fractional Inspired Oxygen Ratio (PaO₂/FiO₂ Ratio), Central Venous Pressure (CVP)
Laboratory Results (first 24 hours)	White Blood Cell Count (WBC), Red Blood Cell Count (RBC), Hemoglobin, Hematocrit (HCT), Platelet Count (PLT), Prothrombin Time (PT), Activated Partial Thromboplastin Time (PTT), Lactate Levels, Arterial Blood pH, Arterial Oxygen Partial Pressure (PaO ₂), Alanine Aminotransferase (ALT), Aspartate Aminotransferase (AST), Albumin, Total Bilirubin, Blood Urea Nitrogen (BUN), Creatinine, Total Calcium, Chloride, Potassium, Sodium, Urine Output, Fluid Balance
Treatments (first 24 hours)	Use of Vasopressors, Mechanical Ventilation, Renal Replacement Therapy (RRT)
Baseline Comorbidities	Hypertension, Diabetes Mellitus, Acute Myocardial Infarction (AMI), Congestive Heart Failure, Chronic Kidney Disease (CKD)

References

Singer

Deutschman

Seymour

, et al. The third international consensus definitions for sepsis and septic shock (Sepsis-3). JAMA 2016; 315(8): 801–810.

Rudd

Johnson

Agesa

, et al. Global, regional, and national sepsis incidence and mortality, 1990-2017: analysis for the global burden of Disease Study. Lancet 2020; 395(10219): 200–211.

Rhee

Jones

Hamad

, et al. Prevalence, underlying causes, and preventability of sepsis-associated mortality in US acute care hospitals. JAMA Netw Open 2019; 2(2): e187571.

Liang

Moore

Soni

. National inpatient hospital costs: the Most expensive conditions by payer, 2017. Healthcare cost and utilization project (HCUP) statistical briefs. Agency for Healthcare Research and Quality (US), 2006.

Evans

Rhodes

Alhazzani

, et al. Surviving sepsis campaign: international guidelines for management of sepsis and septic shock 2021. Crit Care Med 2021; 49(11): e1063–e1143.

Shariff

Kwan Su Huey

Parag Soni

, et al. Unlocking the gut-heart axis: exploring the role of gut microbiota in cardiovascular health and disease. Ann Med Surg 2024; 86(5): 2752–2758.

Mary Nnagha

Kayode Ademola

Ann Izevbizua

, et al. Tackling sickle cell crisis in Nigeria: the need for newer therapeutic solutions in sickle cell crisis management - short communication. Ann Med Surg 2023; 85(5): 2282–2286.

Yao

Jin

Wang

, et al. A machine learning-based prediction of hospital mortality in patients with postoperative sepsis. Front Med 2020; 7: 445.

van Doorn

Stassen

Borggreve

, et al. A comparison of machine learning models versus clinical evaluation for mortality prediction in patients with sepsis. PLoS One 2021; 16(1): e0245157.

10.

Jiang

, et al. An explainable machine learning algorithm for risk factor analysis of in-hospital mortality in sepsis survivors with ICU readmission. Comput Methods Progr Biomed 2021; 204: 106040.

11.

Nemati

Holder

Razmi

, et al. An interpretable machine learning model for accurate prediction of sepsis in the ICU. Crit Care Med 2018; 46(4): 547–553.

12.

Rangan

Pathinarupothi

Anand

KJS

, et al. Performance effectiveness of vital parameter combinations for early warning of sepsis-an exhaustive study using machine learning. JAMIA Open 2022; 5(4): ooac080.

13.

Singh

Khan

, et al. A machine learning model for early prediction and detection of sepsis in intensive care unit patients. J Healthc Eng 2022; 2022: 9263391.

14.

Egger

Gsaxner

Pepe

, et al. Medical deep learning-A systematic meta-review. Comput Methods Progr Biomed 2022; 221: 106874.

15.

Lauritsen

Kalør

Kongsgaard

, et al. Early detection of sepsis utilizing deep learning on electronic health record event sequences. Artif Intell Med 2020; 104: 101820.

16.

Suzuki

. Overview of deep learning in medical imaging. Radiol Phys Technol 2017; 10(3): 257–273.

17.

Smith

Oakden-Rayner

Bird

, et al. Machine learning and deep learning predictive models for long-term prognosis in patients with chronic obstructive pulmonary disease: a systematic review and meta-analysis. Lancet Digit Health 2023; 5(12): e872–e881.

18.

Wornow

Thapa

, et al. The shaky foundations of large language models and foundation models for electronic health records. npj Digit Med 2023; 6(1): 135.

19.

Waldock

Zhang

Guni

, et al. The accuracy and capability of artificial intelligence solutions in health care examinations and certificates: systematic review and meta-analysis. J Med Internet Res 2024; 26: e56532.

20.

Azamfirei

Kudchadkar

Fackler

. Large language models and the perils of their hallucinations. Crit Care 2023; 27(1): 120.

21.

Sallam

. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare (Basel) 2023; 11(6): 887.

22.

Yuren

MAO

Yuhang

GEYFWX

, et al. A survey on LoRA of large language models. Front Comput Sci 2025; 19: 197605.

23.

Chua

Rusli

KDB

Aitken

. Early warning scores for sepsis identification and prediction of in-hospital mortality in adults with sepsis: a systematic review and meta-analysis. J Clin Nurs 2024; 33(6): 2005–2018.

24.

Johnson

Bulgarelli

Pollard

, et al. MIMIC-IV. PhysioNet, 2024. version 3.1. https://physionet.org/content/mimiciv

25.

Mugisha

Ghanem

Komi

OAI

, et al. Addressing cardiometabolic challenges in HIV: insights, impact, and best practices for optimal Management-A narrative review. Health Sci Rep 2025; 8(4): e70727.

26.

Nikravangolsefid

Reddy

Truong

, et al. Machine learning for predicting mortality in adult critically ill patients with sepsis: a systematic review. J Crit Care 2024; 84: 154889.

27.

Bekele

Uwishema

Bisetegn

, et al. Cholera in Africa: a climate change crisis. J Epidemiol Glob Health 2025; 15(1): 68.

28.

Johnson

AEW

Bulgarelli

Shen

, et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data 2023; 10(1): 1.

29.

Zhang

Huang

, et al. Prediction of prognosis in elderly patients with sepsis based on machine learning (random survival forest). BMC Emerg Med 2022; 22(1): 26.

30.

Hou

, et al. Predicting 30-days mortality for MIMIC-III patients with sepsis-3: a machine learning approach using XGboost. J Transl Med 2020; 18(1): 462.

31.

Tian

Cui

Song

, et al. Prediction of acute kidney injury in patients with liver cirrhosis using machine learning models: evidence from the MIMIC-III and MIMIC-IV. Int Urol Nephrol 2024; 56(1): 237–247.

32.

Park

Hsu

, et al. Predicting sepsis mortality in a population-based national database: machine learning approach. J Med Internet Res 2022; 24(4): e29982.

33.

Wernly

Mamandipoor

Baldia

, et al. Machine learning predicts mortality in septic patients using only routinely available ABG variables: a multi-centre evaluation. Int J Med Inform 2021; 145: 104312.

34.

Tang

Zhang

. A time series driven model for early sepsis prediction based on transformer module. BMC Med Res Methodol 2024; 24(1): 23.

35.

Theodosiou

Read

. Artificial intelligence, machine learning and deep learning: potential resources for the infection clinician. J Infect 2023; 87(4): 287–294.

36.

Fralick

Sacks

Muller

, et al. Large language models. NEJM Evid 2023; 2(8): EVIDstat2300128.

37.

Song

, et al. Early prediction of sepsis using chatGPT-generated summaries and structured data. Multimed Tool Appl 2024; 83(41): 89521–89543.

38.

Zhang

Sheng

Liu

, et al. A heterogeneous multi-modal medical data fusion framework supporting hybrid data exploration. Health Inf Sci Syst 2022; 10(1): 22.

39.

Goh

Wang

Yeow

AYK

, et al. Artificial intelligence in sepsis early prediction and diagnosis using unstructured data in healthcare. Nat Commun 2021; 12(1): 711.

40.

Giuffrè

Kresevic

Pugliese

, et al. Optimizing large language models in digestive disease: strategies and challenges to improve clinical outcomes. Liver Int 2024; 44(9): 2114–2124.

41.

Ghassemi

Naumann

Schulam

, et al. A review of challenges and opportunities in machine learning for health. AMIA Jt Summits Transl Sci Proc 2020; 2020: 191–200.

42.

Otokiti

Ozoude

Williams

, et al. The need to prioritize model-updating processes in clinical artificial intelligence (AI) models: protocol for a scoping review. JMIR Res Protoc 2023; 12: e37685.

43.

Uwishema

Frederiksen

Correia

IFS

, et al. The impact of COVID-19 on patients with neurological disorders and their access to healthcare in Africa: a review of the literature. Brain Behav 2022; 12(9): e2742.

44.

Uwishema

Boon

. Bridging the gaps: addressing inequities in neurological care for underserved populations. Eur J Neurol 2025; 32(2): e70073.

Integrating clinical guidelines with large language models for improved sepsis mortality prediction

Abstract

Keywords

Introduction

Materials and method

Data source and variable extraction

Construction of guideline-enhanced fine-tuning datasets

Model training

Baselines

Statistical analysis

Results

Baseline characteristics of patients

Comparative evaluation of model performance

Ablation study

Discussion

Conclusion

Footnotes

Author note

Acknowledgement

ORCID iD

Ethics approval

Authors Contribution

Funding

Declaration of conflicting interests

References