Sage Journals: Discover world-class research

Abstract

Background

Stroke-associated pneumonia (SAP) is a major cause of mortality following ischemic stroke (IS). However, existing predictive models for SAP often lack transparency and interpretability, limiting their clinical utility. This study aims to develop an interpretable Bayesian network (BN) model for predicting SAP in IS patients, focusing on enhancing both predictive accuracy and clinical interpretability.

Methods

This retrospective study included patients diagnosed with IS and admitted to the Second Affiliated Hospital of Nanchang University between January and December 2019. Clinical data collected within 48 h of admission and SAP occurrences within 7 days were analyzed. Dimensionality reduction was performed using Least Absolute Shrinkage and Selection Operator regression, while data imbalances were addressed using synthetic minority oversampling technique. A BN model was trained using a hill-climbing algorithm and compared to logistic regression, decision trees, deep neural networks, and existing risk-scoring systems. Decision curve analysis was used to assess clinical usefulness.

Results

Of the 1252 patients, 165 (13.18%) patients had SAP within 7 days of admission. The BN model identified age, risk of pressure injury (PI), National Institutes of Health Stroke Scale (NIHSS) score, and C-reactive protein (CRP) as significant prognostic factors. The BN model achieved an area under the curve of 0.85(95% CI: 0.78–0.92) on the test set, outperforming other models and demonstrating a greater net benefit in clinical decision-making.

Conclusions

Age, risk of PI, NIHSS score, and CRP are significant predictors of SAP in IS patients. The interpretable BN model demonstrates superior predictive performance and interpretability, suggesting its potential as an effective and interpretable tool for clinical decision support in SAP risk assessment.

Keywords

Bayesian network stroke-associated pneumonia ischemic stroke interpretable predictive modeling risk-scoring system

Introduction

Stroke is the second leading cause of death globally and a major cause of disability worldwide.^1,2 Among the various complications following a stroke, stroke-associated pneumonia (SAP)—defined as a respiratory tract infection occurring within the first 7 days post-stroke³—represents one of the most prevalent and concerning, with incidence rates reported to range from 9.4% to 27.8%.^4–7 SAP is associated with significantly increased mortality, prolonged hospitalization, and heightened healthcare costs.^8–10 Consequently, the early identification of patients at high risk for SAP is crucial for optimizing care, justifying enhanced monitoring, and enabling the timely implementation of tailored prophylactic interventions.

Risk prediction scores play a pivotal role in managing stroke patients and improving overall outcomes. Several risk factors for SAP, including age, gender, hyperglycemia, dysphagia, stroke severity, and atrial fibrillation, have been identified in previous studies.^11–13 Based on these factors, a number of predictive scoring systems, such as A²DS² and ISAN, have been developed.^14,15 However, the clinical utility and accuracy of these existing scoring systems for SAP remain debated, as they often fail to account for the complex interrelationships between risk factors.¹⁶ Traditional predictive models, such as logistic regression, are also limited in their capacity to model the intricate dependencies between risk factors.¹⁷ These models assume independent relationships between variables, which do not adequately reflect the complex, networked interactions that characterize many clinical phenomena.

In recent years, machine learning (ML) techniques have gained attention for their potential to enhance disease prediction. For example, Nelde et al.¹⁸ developed a prognostic ML model for SAP risk in the acute phase of stroke, achieving promising predictive performance. Among the various ML algorithms, decision trees are particularly appealing due to their transparency and interpretability, making them suitable for clinical application.¹⁹ On the other hand, deep neural networks, with their capacity to process large datasets, can provide highly accurate predictions, though they often lack the interpretability needed for clinical decision-making.²⁰

Bayesian networks (BNs), which employ directed acyclic graphs (DAGs) and conditional probability tables, offer a promising alternative to both traditional and ML-based models.²¹ By modeling the conditional dependencies among variables, BNs allow for a nuanced understanding of the interactions between risk factors. This approach has been successfully applied in medical diagnosis, statistical decision-making, and predictive modeling, providing interpretable and clinically relevant insights into complex health conditions.^22–24

Despite the development of various prediction methods for SAP, significant challenges persist in understanding the underlying relationships between risk factors and quantifying their individual contributions. Many existing models lack interpretability, hindering their clinical applicability. This study aims to address these limitations by constructing an interpretable BN model that not only identifies key risk factors for SAP but also predicts its occurrence in ischemic stroke (IS) patients. To evaluate the effectiveness of our approach, we perform a comparative analysis against two widely used interpretable models—traditional logistic regression, decision trees, and deep neural networks—along with existing risk-scoring systems. Furthermore, we assess the clinical utility of our BN model, demonstrating its potential for improving clinical decision-making over traditional risk models.

Methods

Study design and participants

This retrospective study utilized data extracted from the electronic medical record system of the Second Affiliated Hospital of Nanchang University, covering the period from January to December 2019. Patients included in the analysis were diagnosed with acute ischemic stroke (AIS) based on the criteria established by the Cerebrovascular Diseases Group of the Chinese Medical Association Neurology Society in 2014²⁵ and were admitted within 48 h of symptom onset. All patients were aged 18 years or older. The exclusion criteria were (1) clinical signs of infection upon admission; (2) a history of pulmonary infection, non-infectious pulmonary interstitial disease, pulmonary tuberculosis, lung tumor, pulmonary edema, pulmonary embolism, or atelectasis prior to the onset of IS; (3) incomplete clinical data.

Data collection

Data were collected upon hospital admission, including patient demographics (e.g. sex and age) and known risk factors for cerebrovascular disease, such as a prior stroke, hypertension, diabetes, atrial fibrillation, coronary heart disease, smoking, and alcohol consumption. Clinical variables included stroke severity, assessed using the National Institutes of Health Stroke Scale (NIHSS); functional status, measured by the Modified Rankin Scale; consciousness disturbance (DOC); and assessments using the Braden Scale, Glasgow Coma Scale (GCS), A²DS² score, and ISAN score. Laboratory data were obtained within 24 h of admission through venous blood sampling. Key laboratory results encompassed a range of biomarkers, including white blood cell count, neutrophil count, lymphocyte count, neutrophil-to-lymphocyte ratio, red blood cell count, thrombocytocrit, hemoglobin, platelet count, plasma fibrinogen, total protein, albumin, albumin-to-globulin ratio, total cholesterol, triglycerides, high-density lipoprotein (HDL), low-density lipoprotein, non-HDL cholesterol, and C-reactive protein (CRP). This comprehensive dataset allows for a thorough analysis of the factors contributing to the risk of SAP in IS patients.

Outcome assessment

The primary outcome was the occurrence of hospitalized pneumonia within the first 7 days after admission. Pneumonia was diagnosed based on the 2019 Chinese Expert Consensus on the Diagnosis and Treatment of Stroke-Associated Pneumonia.²⁶ The clinical diagnostic criteria for SAP included: (1) at least one of the following: fever ≥38°C, leukopenia (≤4000 × 10⁹/L), leukocytosis (≥ 10,000 × 10⁹/L), age ≥70 years with altered consciousness; and (2) at least two of the following: newly emerged purulent sputum or increased respiratory secretion, newly emerged or aggravated cough/dyspnea/shortness of breath (respiratory rate >25/min), rales/crackles on lung auscultation, impaired gas exchange (e.g. hypoxemia with PaO₂/FiO₂ ≤ 300, increased oxygen demand), or chest imaging showing infiltrating shadows or progression of lung opacities.

Bayesian network modeling

The BN is a probabilistic graphical model that uses DAGs to describe the conditional dependencies between variables.²⁷ BNs provide a clear and intuitive representation of joint probability distributions, making them particularly useful for causal reasoning and risk prediction in medical contexts.^28,29 The BN model was learned in two stages: (1) structural learning (identifying the network structure based on the data), and (2) parameter learning (estimating the conditional probability distributions for the nodes).³⁰ The model was trained using the hill-climbing algorithm, a score-based method that iteratively explores possible network structures by adding, deleting, or reversing arcs to maximize the model's score.³¹ The performance of the BN model was validated using fivefold cross-validation, with the log-likelihood loss function used to evaluate model accuracy.

Risk-scoring systems

For comparison purposes, two established risk-scoring systems were used: the A²DS² score, developed from the Berlin Stroke Register to predict pneumonia during stroke hospitalization,¹⁴ and the ISAN score, derived from a national UK registry to predict pneumonia risk within the first 7 days of IS.¹⁵ These systems were selected based on the variables available in the current dataset and their relevance to SAP prediction.

Model building and interpretation

The overall model-building process is illustrated in Figure 1. The dataset was randomly split into a training set (70% of the patients) and a test set (30%). To address missing data, we applied the missForest imputation algorithm in R. Since class imbalance is a common issue in medical datasets, we employed the synthetic minority oversampling technique (SMOTE) to balance the distribution of the target variable.³² SMOTE generates new instances for the minority class by interpolating between nearby examples, thus improving model robustness without overfitting.³³ SMOTE was applied only to the training set to avoid data leakage. To reduce dimensionality, Least Absolute Shrinkage and Selection Operator (LASSO) regression was used.³⁴ In addition to the BN model, we also explored other statistical and ML algorithms, including logistic regression, decision trees, and a multilayer perceptron (deep neural network). The performance of these models was assessed using multiple evaluation metrics: area under the curve (AUC) with 95% confidence intervals, calibration curves, and decision curves. Decision curve analysis was used to evaluate clinical utility, with a higher net benefit indicating better clinical applicability.³⁵

Figure 1.

Overall process of model building. BN: Bayesian network; DNN: deep neural networks.

Statistical analysis

Categorical variables were expressed as percentages. Differences between the SAP and non-SAP groups were assessed using Student's t-test (for normally distributed data), the Mann–Whitney U test (for non-normally distributed data), and chi-square or Fisher's exact test (for categorical variables). The bnlearn package was used to develop the BN model, while decision trees and deep neural networks were implemented using the party and keras packages in R. Model performance for decision trees and deep neural networks was evaluated using fivefold cross-validation to ensure robustness and prevent overfitting. All statistical tests were two-tailed, with significance set at a p-value of 0.05. Data were analyzed using R 4.3.1 (The R Foundation, Vienna, Austria).

Results

Baseline characteristics and potential risk factors for SAP

A total of 1252 patients were included in this study, comprising793 (63.34%) males and 459 (36.66%) females. Of these, 165（13.18%）patients developed SAP within 7 days of admission. The specific proportions of missing data for variables are summarized in the Table S1. Table 1 provides an overview of the baseline characteristics of the study cohort. Patients with SAP were older and exhibited a higher prevalence of risk factors. Statistically significant differences were observed in most laboratory results between patients with and without SAP. To identify significant prognostic factors, LASSO regression was employed for dimensionality reduction (Figure 2). After considering clinical relevance and the practical context of the study, we selected the following nine significant predictors: age, CRP levels, NIHSS score, GCS score, DOC, and Braden Scale score for risk of pressure injury (PI).

Figure 2.

LASSO regression for dimensionality reduction. LASSO: Least Absolute Shrinkage and Selection Operator.

Table 1.

Baseline characteristics of the study population.

Variable	N	Non-SAP(N = 1087)	SAP(N = 165)	p-value^a
Male, n (%)	1252	689 (63%)	104 (63%)	0.999
Age, median (IQR) (years)	1252	65 (56, 74)	72 (63, 79)	<0.001
White blood cell count, median (IQR) (×10⁹/L)	1243	6.87 (5.61, 8.60)	7.97 (6.29, 10.25)	<0.001
Neutrophil count, median (IQR) (×10⁹/L)	1243	4.48 (3.41, 6.05)	6.19 (4.50, 8.47)	<0.001
Lymphocyte count, median (IQR) (×10⁹/L)	1243	1.58 (1.19, 2.02)	1.14 (0.81, 1.49)	<0.001
Neutrophil to lymphocyte ratio, median (IQR)	1243	2.7 (1.9, 4.2)	5.7 (3.3, 8.6)	<0.001
Red blood cell count, median (IQR) (×10¹²/L)	1243	4.40 (4.03, 4.77)	4.27 (3.89, 4.58)	0.001
Thrombocytocrit, median (IQR) (%)	1225	0.22 (0.19, 0.26)	0.20 (0.16, 0.25)	<0.001
Hemoglobin, median (IQR) (g/L)	1243	133 (122, 144)	129 (117, 140)	0.002
Blood platelet count, median (IQR) (×10⁹/L)	1243	204 (167, 245)	188 (147, 231)	0.001
Plasma fibrinogen, median (IQR) (g/L)	1212	2.74 (2.35, 3.27)	3.03 (2.46, 3.84)	<0.001
Total protein, median (IQR) (g/L)	1162	65.1 (61.4, 69.1)	64.3 (59.7, 68.0)	0.010
Albumin, median (IQR) (g/L)	1162	37.2 (35.1, 39.4)	35.8 (33.2, 38.1)	<0.001
Ratio of albumin to globulin, median (IQR)	1160	1.33 (1.19, 1.48)	1.25 (1.12, 1.41)	<0.001
Total cholesterol, median (IQR) (mmol/L)	1170	4.59 (3.85, 5.26)	4.19 (3.48, 5.02)	<0.001
Triacylglycerol, median (IQR) (mmol/L)	1170	1.30 (0.99, 1.83)	1.03 (0.78, 1.31)	<0.001
High-density lipoprotein, median (IQR) (mmol/L)	1170	1.10 (0.92, 1.32)	1.17 (0.93, 1.43)	0.035
Low-density lipoprotein, median (IQR) (mmol/L)	1170	2.63 (2.10, 3.19)	2.33 (1.85, 2.85)	<0.001
Non-high-density lipoprotein, median (IQR) (mmol/L)	1170	3.43 (2.77, 4.13)	3.05 (2.38, 3.65)	<0.001
C-reactive protein, median (IQR) (mg/L)	1121	3 (2, 7)	9 (4, 36)	<0.001
Coronary heart disease, n (%)	1252	58 (5.3%)	11 (6.7%)	0.500
COPD, n (%)	1252	1 (<0.1%)	3 (1.8%)	0.008
Diabetes, n (%)	1252	380 (35%)	41 (25%)	0.010
Hypertension, n (%)	1252	802 (74%)	123 (75%)	0.800
Hyperlipidemia, n (%)	1252	230 (21%)	18 (11%)	0.002
Atrial fibrillation, n (%)	1252	70 (6.4%)	40 (24%)	<0.001
Heart failure, n (%)	1252	45 (4.1%)	20 (12%)	<0.001
GCS, median (IQR)	1252	15 (14, 15)	12 (8, 14)	<0.001
Braden score, median (IQR)	1227	21 (18, 23)	14 (12, 19)	<0.001
Fall risk assessment, median (IQR)	1228	35 (25, 45)	45 (30, 55)	<0.001
DOC, n (%)	1252	34 (3.1%)	60 (36%)	<0.001
NIHSS, median (IQR)	1252	3 (1, 6)	9 (4, 17)	<0.001
Smoke, n (%)	1252	212 (20%)	29 (18%)	0.600
Drink, n (%)	1252	200 (18%)	25 (15%)	0.300
mRS > 2, n (%)	1252	327 (30%)	110 (67%)	<0.001

Pearson's Chi-squared test; Wilcoxon rank sum test; Fisher's exact test.

Abbreviations: COPD: chronic obstructive pulmonary disease; GCS: Glasgow Coma Scale; DOC: Disturbance of consciousness; NIHSS: National Institute of Health Stroke Scale; mRS: Modified Rankin Scale score; SAP: stroke-associated pneumonia.

Bayesian network structure of SAP assessment

The BN model, developed using the training set, showed no statistical difference in performance between training and test sets.

The final BN model consisted of seven nodes and 12 directed edges, emphasizing the importance of key variables such as age, CRP, NIHSS score, and Braden Scale score in directly influencing SAP risk. Indirect relationships were observed for GCS and DOC. Notably, age and CRP were found to influence SAP both directly and indirectly through their associations with NIHSS levels. The NIHSS score, in turn, influenced SAP both directly and via its connections with GCS and DOC. The BN structure is depicted in Figure 3.

Figure 3.

The Bayesian network structure.

Our findings highlight the significant predictors of SAP in IS patients. The conditional probability distribution from the BN model indicates a 94% likelihood of SAP in patients with moderate stroke severity (NIHSS > 15), age over 65 years, elevated CRP levels, and moderate or higher risk of PI (Braden Scale score ≤ 14). These insights underscore the importance of early identification of high-risk patients, as detailed in Table 2. Early identification of high-risk patients for SAP allows clinicians to implement preventive measures such as feeding modification, oral care, airway management, and position management.³⁶

Table 2.

Conditional probabilities of the final Bayesian network.

Age	Braden_score	NIHSS	CRP	SAP
Age	Braden_score	NIHSS	CRP	no	yes
≤65	no risk of pressure injury	0–4	≤7.40	0.95	0.05
≤65	no risk of pressure injury	0–4	>7.40	0.91	0.09
>65	no risk of pressure injury	0–4	≤7.40	0.78	0.22
>65	no risk of pressure injury	0–4	>7.40	0.61	0.39
≤65	mild risk	0–4	≤7.40	0.68	0.32
≤65	mild risk	0–4	>7.40	1.00	0.00
>65	mild risk	0–4	≤7.40	0.74	0.26
>65	mild risk	0–4	>7.40	0.56	0.44
≤65	moderate risk	0–4	≤7.40	0.43	0.57
≤65	moderate risk	0–4	>7.40	1.00	0.00
>65	moderate risk	0–4	≤7.40	0.44	0.56
>65	moderate risk	0–4	>7.40	0.33	0.67
≤65	high risk	0–4	≤7.40	0.00	1.00
>65	high risk	0–4	≤7.40	0.00	1.00
≤65	no risk of pressure injury	5–15	≤7.40	0.84	0.16
≤65	no risk of pressure injury	5–15	>7.40	0.46	0.54
>65	no risk of pressure injury	5–15	≤7.40	0.88	0.12
>65	no risk of pressure injury	5–15	>7.40	0.18	0.82
≤65	mild risk	5–15	≤7.40	1.00	0.00
≤65	mild risk	5–15	>7.40	0.59	0.41
>65	mild risk	5–15	≤7.40	0.85	0.15
>65	mild risk	5–15	>7.40	0.34	0.66
≤65	moderate risk	5–15	≤7.40	0.53	0.47
≤65	moderate risk	5–15	>7.40	0.25	0.75
>65	moderate risk	5–15	≤7.40	0.50	0.50
>65	moderate risk	5–15	>7.40	0.37	0.63
≤65	high risk	5–15	≤7.40	0.67	0.33
≤65	high risk	5–15	>7.40	0.23	0.77
>65	high risk	5–15	≤7.40	0.22	0.78
>65	high risk	5–15	>7.40	0.14	0.86
≤65	no risk of pressure injury	>15	≤7.40	1.00	0.00
>65	mild risk	>15	≤7.40	0.25	0.75
>65	mild risk	>15	>7.40	0.00	1.00
≤65	moderate risk	>15	≤7.40	1.00	0.00
≤65	moderate risk	>15	>7.40	0.33	0.67
>65	moderate risk	>15	≤7.40	0.07	0.93
>65	moderate risk	>15	>7.40	0.06	0.94
≤65	high risk	>15	>7.40	0.09	0.91
>65	high risk	>15	≤7.40	0.00	1.00
>65	high risk	>15	>7.40	0.06	0.94

Abbreviations: Barden score, The Braden Scale classifications were categorized according to the scores: no risk of pressure injury, 19 or higher; mild risk, 15 to 18; moderate risk, scores from 13 to 14; and high risk, scores 12 or lower; CRP: C-reactive protein; NIHSS: National Institute of Health Stroke Scale; SAP: stroke-associated pneumonia.

Model performance evaluation

The discrimination performance of the BN model, along with other predictive models—logistic regression, decision tree, deep neural network, and two existing risk-scoring systems (A²DS² and ISAN scores)—is presented in Figure 4 and Table 3. In the test set, the BN model achieved an AUC of 0.85 (95% CI: 0.78–0.92), outperforming the A²DS² and ISAN scores (AUC = 0.75 and 0.74, respectively), as well as the logistic regression, decision tree, and deep neural network models (AUC = 0.80, 0.75, and 0.76, respectively). Notably, the BN model demonstrated superior predictive performance, particularly in terms of accuracy and specificity. Figure 5 illustrates the relationship between the predicted risk of SAP derived from the BN model and the actual proportion of SAP observed in the training and test sets. A strong concordance between the predicted and observed risks was observed, indicating the model's reliable performance. The Brier scores for the BN model were 0.141 in the training set and 0.142 in the test set, suggesting good calibration in both datasets.

Figure 4.

Receiver operating characteristic (ROC) curves. BN: Bayesian network; DNN: deep neural networks.

Figure 5.

Calibration plots for the BN prediction models in training and test sets. BN: Bayesian network.

Table 3.

The performance of predictive models.

	Accuracy	Sensitivity	Specificity	F1 score	AUC(95%CI)	DeLong's test^a
	Accuracy	Sensitivity	Specificity	F1 score	AUC(95%CI)	Z	P
Training set
BN	0.80	0.67	0.88	0.72	0.87(0.84–0.89)
Logistic	0.77	0.51	0.94	0.64	0.83(0.81–0.86)	4.79	<0.001
Decision tree	0.79	0.71	0.84	0.73	0.80(0.78–0.82)	9.07	<0.001
DNN	0.79	0.96	0.53	0.85	0.87(0.85–0.89)	0.43	0.670
ISAN	0.73	0.63	0.80	0.64	0.79(0.77–0.82)	4.49	<0.001
A²DS²	0.69	0.46	0.85	0.54	0.72(0.69–0.75)	7.90	<0.001
Testing set
BN	0.91	0.41	0.98	0.53	0.85(0.78–0.92)
Logistic	0.85	0.52	0.90	0.46	0.80(0.72–0.88)	2.46	0.014
Decision tree	0.79	0.57	0.82	0.39	0.75(0.67–0.83)	4.37	<0.001
DNN	0.89	0.95	0.43	0.94	0.76(0.67,0.85)	1.60	0.111
ISAN	0.79	0.55	0.82	0.38	0.75(0.67–0.83)	1.83	0.068
A²DS²	0.82	0.50	0.86	0.40	0.74(0.66–0.82)	2.00	0.046

Comparison with BN model.

BN: Bayesian network; DNN: deep neural networks; AUC: area under the curve.

Figure 6 illustrates the decision curves for the BN model, A²DS² score, and ISAN score across both the training and test sets. The BN model consistently outperforms the A²DS² and ISAN scores, indicating a greater net benefit across the range of threshold probabilities. These findings suggest that the BN model provides superior clinical utility for SAP prediction and may support more effective intervention strategies.

Figure 6.

Decision curve of the Bayesian network (BN) model, ISAN score and A²DS² score.

Discussion

Given the increasing morbidity and mortality associated with SAP, early-stage prediction of SAP risk is crucial. In this study, we aimed to identify clinical and laboratory factors associated with SAP and to develop an interpretable BN model for assessing SAP risk. Our findings highlight the direct association of age, Braden score, NIHSS level, and CRP with SAP, while GCS and DOC showed indirect associations. Notably, the BN model demonstrated superior predictive robustness compared to two commonly used interpretable models (logistic regression and decision tree). Moreover, compared to traditional SAP prediction scores, including ISAN and A²DS², the BN model showed a better net benefit, highlighting its clinical utility.

Our results are consistent with previous studies linking age to an increased risk of SAP in stroke patients.³⁷ Older age is linked to a higher risk of post-stroke pneumonia,¹³ likely due to increased comorbidities and impaired swallowing function in older individuals.³⁸ As age increases, organ function and immunity decline, making the body more susceptible to infections, thereby increasing the risk of pulmonary infection.³⁷ This finding corroborates the importance of age in traditional SAP prediction models like ISAN and A²DS², which highlight age as a key predictor of pneumonia risk.

In addition to age, the BN model identified elevated CRP levels as a critical risk factor for SAP. Stroke triggers an inflammatory response, potentially reducing immunity and increasing the risk of lung infections, thus exacerbating inflammatory and immune response disorders. These events can further elevate CRP levels, thereby increasing pneumonia risk.¹⁶ Previous studies have indicated that elevated CRP levels are associated with the development of SAP.³⁹ The association between elevated CRP and the increased risk of SAP highlights the importance of monitoring inflammatory markers, such as CRP, to predict and intervene early in at-risk patients.

Furthermore, the severity of IS, as measured by the NIHSS score, was also a strong predictor of SAP. A higher NIHSS score is associated with more severe neurological deficits, which may lead to compromised respiratory function and reduced ability to clear respiratory secretions, thus increasing the risk of pneumonia. These findings are consistent with previous studies that report a correlation between stroke severity and the development of SAP.⁴⁰ In this context, incorporating NIHSS score into clinical practice as part of a risk assessment for SAP could help identify patients who need more intensive monitoring and preventive care.

PI risk, assessed via the Braden Scale, was another critical factor in predicting SAP. Stroke patients, particularly those who are bedridden or have impaired consciousness, are at higher risk of developing pressure injuries, which can compromise skin integrity and contribute to infections, including pneumonia.⁴¹ Moreover, prolonged immobility associated with stroke can impair normal mucociliary clearance, facilitating bacterial colonization and increasing the risk of respiratory infections.⁴² This finding suggests that assessing and managing PI risk early on may not only help prevent skin breakdown but also reduce the likelihood of SAP in stroke patients.

The psychological status of stroke patients, particularly depression, may influence the development of SAP. Depression has been identified as a risk factor for stroke,⁴³ and as a severe post-stroke complication, it adversely affects the prognosis of IS patients.⁴⁴ While our study did not include psychological indicators, existing research demonstrates that depression is commonly observed in stroke survivors.⁴⁵ Additionally, depression may influence immune function,⁴⁶ potentially increasing the risk of infections such as pneumonia. Incorporating psychological assessments into future studies could enhance stroke rehabilitation and improve long-term health outcomes.

One of the key strengths of the BN model is its interpretability, distinguishing itself from complex ML methods that rely on supplementary tools for explanation.^47,48 By representing relationships through a graphical structure and conditional probability distributions, BN provides clear insights into the direct and mediated effects of risk factors on SAP.^49–51 This interpretability is particularly valuable in clinical settings, where actionable and trustworthy insights are essential.

The BN model also provides a nuanced understanding of how variables interact. For example, age and CRP are both directly associated with SAP, but they also indirectly affect SAP risk through their relationship with NIHSS and the Braden score. This capability to model complex interdependencies among factors offers a more comprehensive view of the predictors of SAP, which is invaluable for decision-making in clinical practice.

This study had several limitations. Firstly, there was an imbalance in the number of patients with and without SAP, with a notably smaller sample size for patients exhibiting SAP. We used the SMOTE to balance the class sizes, which helped improve model performance. However, this technique may not fully eliminate class imbalance, and prior research has indicated that variability within the minority class may be underestimated, potentially degrading model generalizability.⁵² To enhance the robustness of future studies, researchers should prioritize the use of balanced datasets that better reflect model performance in practical applications. Furthermore, combining SMOTE-based techniques, such as FLEX-SMOTE,⁵³ which adapts to diverse minority class distributions via density-guided synthetic sample generation, may further improve model performance. Secondly, due to data limitations, we were unable to compare the BN model with other widely used SAP prediction scales, such as the AIS-APS, which could provide a broader context for evaluating the model's performance. Future research should not only include comparisons with a wider range of existing models but also focus on improving the collection of relevant variables. By enhancing the comprehensiveness of variable data, future studies can adopt more robust methods for comparison to validate the BN model's superiority in predicting SAP. Third, the discretization of continuous variables for BN modeling may have resulted in a reduction of valuable predictive information. Future studies could explore alternative methodologies, such as Gaussian BNs or Conditional Gaussian BNs,^54,55 which are better suited for integrating continuous variables. Such approaches could potentially enhance model performance, improving its overall predictive accuracy and precision. Finally, while the BN model demonstrated strong performance in the training and test sets, the findings are based on a single-center dataset, which may limit the results’ generalizability. Future research should include multi-center studies with diverse patient populations to assess the model's external validity, generalizability, and robustness across different clinical settings.

Conclusion

In conclusion, our study demonstrates that an interpretable BN model can effectively predict the risk of SAP in IS patients. The model's superior predictive performance, coupled with its ability to clarify the interdependencies between risk factors, offers a significant advantage over traditional statistical methods and ML approaches. By providing a more detailed understanding of SAP risk, the BN model could help clinicians identify high-risk patients early, enabling timely interventions and reducing the incidence of SAP. Future research should focus on external validation using multi-center datasets to assess the model's generalizability. Additionally, ongoing efforts to refine the BN model, incorporating more diverse variables and advanced techniques, will enhance its predictive accuracy and applicability across different clinical settings.

Footnotes

ORCID iDs

Xingyu Liu

Jiali Mo

Zuting Liu

Yanqiu Ge

Tian Luo

Jie Kuang

Ethical considerations

This retrospective study was approved by the Biomedical Research Ethics Committee of the Second Affiliated Hospital of Nanchang University (IRB approval number: 20240522) and received a waiver of informed consent due to the use of nonidentifiable data.

Author contributions/CRediT

JK contributed to the study conception and design. Material preparation, data collection and analysis were performed by YG and TL. The first draft of the manuscript was written by XL, JM and ZL; all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was supported by the National Natural Science Foundation of China (82360667，82160645) and Natural Science Foundation of Jiangxi Province (20212BAB206091).

Conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability

The data that support the findings of this study are not publicly available. Anonymous data are available on reasonable request from the corresponding author.

References

Feigin

Owolabi

. World stroke organization–lancet neurology commission stroke collaboration group. Pragmatic solutions to reduce the global burden of stroke: a world stroke organization-lancet neurology commission. Lancet Neurol 2023; 22: 1160–1206.

GBD 2019 Stroke Collaborators. Global, regional, and national burden of stroke and its risk factors, 1990–2019: a systematic analysis for the Global Burden of Disease Study 2019. Lancet Neurol 2021; 20: 795–820.

Guo

Fan

Liu

, et al. Patient's care bundle benefits to prevent stroke associated pneumonia: a meta-analysis with trial sequential analysis. Front Neurol 2022; 13: 950662.

de Jonge

van de Beek

Lyden

, et al. Temporal profile of pneumonia after stroke. Stroke 2022; 53: 53–60.

Wang

Wen

Jiang

, et al. The clinical value of neutrophil-to-lymphocyte ratio (NLR), systemic immune-inflammation index (SII), platelet-to-lymphocyte ratio (PLR) and systemic inflammation response index (SIRI) for predicting the occurrence and severity of pneumonia in patients with intracerebral hemorrhage. Front Immunol 2023; 14: 1115031.

Gradek-Kwinta

Slowik

Dziedzic

. The use of anticholinergic medication is associated with an increased risk of stroke-associated pneumonia. Aging Clin Exp Res 2022; 34: 1935–1938.

Westendorp

Dames

Nederkoorn

, et al. Immunodepression, infections, and functional outcome in ischemic stroke. Stroke 2022; 53: 1438–1448.

Eltringham

Kilner

Gee

, et al. Factors associated with risk of stroke-associated pneumonia in patients with dysphagia: a systematic review. Dysphagia 2020; 35: 735–744.

Zhu

Gao

, et al. Risk factors and outcomes of stroke-associated pneumonia in patients with stroke and acute large artery occlusion treated with endovascular thrombectomy. J Stroke Cerebrovasc Dis 2020; 29: 105223.

10.

Teh

Smith

Barlas

, et al. Impact of stroke-associated pneumonia on mortality, length of hospitalization, and functional outcome. Acta Neurol Scand 2018; 138: 293–300.

11.

Zhang

Tang

Zhang

, et al. A clinical prediction model based on post large artery atherosclerosis infarction pneumonia. Neurologist 2023; 28: 19–24.

12.

Tao

Lou

, et al. Higher stress hyperglycemia ratio is associated with a higher risk of stroke-associated pneumonia. Front Nutr 2022; 9: 784114.

13.

Huang

Lin

, et al. Individualized prediction of stroke-associated pneumonia for patients with acute ischemic stroke. Clin Interv Aging 2019; 14: 1951–1962.

14.

Hoffmann

Malzahn

Harms

, et al. Development of a clinical score (A2DS2) to predict pneumonia in acute ischemic stroke. Stroke 2012; 43: 2617–2623.

15.

Smith

Bray

Hoffman

, et al. Can a novel clinical risk score improve pneumonia prediction in acute stroke care? A UK multicenter cohort study. J Am Heart Assoc 2015; 4: e001307.

16.

Wang

Xia

Shan

, et al. Predictive value of the Oxford acute severity of illness score in acute stroke patients with stroke-associated pneumonia. Front Neurol 2023; 14: 1251944.

17.

Liu

Jia

, et al. Association between malnutrition and stroke-associated pneumonia in patients with ischemic stroke. BMC Neurol 2023; 23: 290.

18.

Nelde

Krumm

Arafat

, et al. Machine learning using multimodal and autonomic nervous system parameters predicts clinically apparent stroke-associated pneumonia in a development and testing study. J Neurol 2024; 271: 899–908.

19.

Imura

Iwamoto

Inagawa

, et al. Decision tree algorithm identifies stroke patients likely discharge home after rehabilitation using functional and environmental predictors. J Stroke Cerebrovasc Dis 2021; 30: 105636.

20.

Fujita

Sato

Narita

, et al. Use of a multilayer perceptron to create a prediction model for dressing independence in a small sample at a single facility. J Phys Ther Sci 2019; 31: 69–74.

21.

Park

Chang

Nam

. A Bayesian network model for predicting post-stroke outcomes with available risk factors. Front Neurol 2018; 9: 699.

22.

Song

Qiu

Qing

, et al. Using Bayesian network model with MMHC algorithm to detect risk factors for stroke. Math Biosci Eng 2022; 19: 13660–13674.

23.

Mazhar

Mohamed

Patel

, et al. Bayesian networks identify determinants of outcomes following cardiac surgery in a UK population. BMC Cardiovasc Disord 2023; 23: 70.

24.

Shinada

Matsuoka

Koami

, et al. Bayesian network predicted variables for good neurological outcomes in patients with out-of-hospital cardiac arrest. PLoS One 2023; 18: e0291258.

25.

Chinese Medical Association Neurology Branch, Chinese Medical Association Neurology Branch Cerebrovascular Disease Group. Guidelines for the diagnosis and treatment of acute ischemic stroke in China 2014 (in Chinese). Chin J Neurol 2015; 48: 246–257.

26.

Wang

Chen

, et al. An updated Chinese consensus statement on stroke-associated pneumonia 2019. Asian Pac J Trop Med 2019; 12: S1–S11.

27.

Ghahramani

. Probabilistic machine learning and artificial intelligence. Nature 2015; 521: 452–459.

28.

Pang

, et al. Discover high-risk factor combinations using Bayesian network from cohort data of national stoke screening in China. BMC Med Inform Decis Mak 2019; 19: 67.

29.

Song

Qin

, et al. Using Bayesian networks with Tabu-search algorithm to explore risk factors for hyperhomocysteinemia. Sci Rep 2023; 13: 1610.

30.

Arora

Boyne

Slater

, et al. Bayesian networks for risk prediction using real-world data: a tool for precision medicine. Value Health 2019; 22: 439–445.

31.

Fuster-Parra

Yañez

López-González

, et al. Identifying risk factors of developing type 2 diabetes from an adult population with initial prediabetes using a Bayesian network. Front Public Health 2023; 10: 1035025.

32.

Kim

Mun

Lee

, et al. Prediction of metabolic and pre-metabolic syndromes using machine learning models with anthropometric, lifestyle, and biochemical factors from a middle-aged population in Korea. BMC Public Health 2022; 22: 664.

33.

Fan

Liu

Sun

. An interpretable machine learning framework for diagnosis and prognosis of COVID-19. PLoS One 2023; 18: e0291961.

34.

Yang

Wang

, et al. Machine learning models for predicting early neurological deterioration and risk classification of acute ischemic stroke. Clin Appl Thromb Hemost 2023; 29: 10760296231221738.

35.

Suo

Huang

Zhong

, et al. Development and validation of a Bayesian network-based model for predicting coronary heart disease risk from electronic health records. J Am Heart Assoc 2024; 13: e029400.

36.

Liu

Wei

, et al. Reducing the incidence of stroke-associated pneumonia: an evidence-based practice. BMC Neurol 2022; 22: 297.

37.

You

Bai

, et al. Risk factors for pulmonary infection in elderly patients with acute stroke: a meta-analysis. Heliyon 2022; 8: e11664.

38.

Petroianni

Ceccarelli

Conti

, et al. Aspiration pneumonia. Pathophysiological aspects, prevention and management. A review. Panminerva Med 2006; 48: 231–239.

39.

Kalra

Smith

Hodsoll

, et al. Elevated C-reactive protein increases diagnostic accuracy of algorithm-defined stroke-associated pneumonia in afebrile patients. Int J Stroke 2019; 14: 167–173.

40.

Lee

Jang

. A simple nomogram for predicting stroke-associated pneumonia in patients with acute ischemic stroke. Healthcare (Basel) 2023; 11: 3015.

41.

Schott

Golin

de Jesus

, et al. Dysphagia, immobility, and diet acceptance: main factors associated with increased risk of pressure injury in patients hospitalized after stroke. Adv Skin Wound Care 2020; 33: 527–532.

42.

Ahmad

Ayaz

Sinha

, et al. Risk factors for the development of pneumonia in stroke patients: a systematic review and meta-analysis. Cureus 2024; 16: e57077.

43.

Ashraf

Mustafa

Shafique

, et al. Association between depression and stroke risk in adults: a systematic review and meta-analysis. Front Neurol 2024; 15: 1331300.

44.

Liu

Gong

Zhao

, et al. Validity of evaluation scales for post-stroke depression: a systematic review and meta-analysis. BMC Neurol 2024; 24: 286.

45.

Hackett

Pickles

. Part I: frequency of depression after stroke: an updated systematic review and meta-analysis of observational studies. Int J Stroke 2014; 9: 1017–1025.

46.

Beurel

Toups

Nemeroff

. The bidirectional relationship of depression and inflammation: double trouble. Neuron 2020; 107: 234–256.

47.

Wang

Chen

, et al. Machine learning based androgen receptor regulatory gene-related random forest survival model for precise treatment decision in prostate cancer. Heliyon 2024; 10: e37256.

48.

Kanda

Okami

Kohsaka

, et al. Machine learning models predicting cardiovascular and renal outcomes and mortality in patients with hyperkalemia. Nutrients 2022; 14: 4614.

49.

Shin

Lee

, et al. Probabilistic graphical modelling using Bayesian networks for predicting clinical outcome after posterior decompression in patients with degenerative cervical myelopathy. Ann Med 2023; 55: 2232999.

50.

Fuster-Parra

Tauler

Bennasar-Veny

, et al. Bayesian network modeling: a case study of an epidemiologic system analysis of cardiovascular risk. Comput Methods Programs Biomed 2016; 126: 128–142.

51.

Lucas

van der Gaag

Abu-Hanna

. Bayesian networks in biomedicine and health-care. Artif Intell Med 2004; 30: 201–214.

52.

Blagus

Lusa

. SMOTE For high-dimensional class-imbalanced data. BMC Bioinformatics 2013; 14: 106. Published 2013 Mar 22.

53.

Bunkhumpornpat

Boonchieng

Chouvatut

, et al. FLEX-SMOTE: synthetic over-sampling technique that flexibly adjusts to different minority class distributions. Patterns (N Y) 2024; 5: 101073.

54.

Grzegorczyk

. An introduction to Gaussian Bayesian networks. Methods Mol Biol 2010; 662: 121–147.

55.

McGeachie

Chang

Weiss

. CGBayesnets: conditional Gaussian Bayesian network learning and inference with mixed discrete and continuous data. PLoS Comput Biol 2014; 10: e1003676.

Enhancing stroke-associated pneumonia prediction in ischemic stroke: An interpretable Bayesian network approach

Abstract

Background

Methods

Results

Conclusions

Keywords

Introduction

Methods

Study design and participants

Data collection

Outcome assessment

Bayesian network modeling

Risk-scoring systems

Model building and interpretation

Statistical analysis

Results

Baseline characteristics and potential risk factors for SAP

Bayesian network structure of SAP assessment

Model performance evaluation

Discussion

Conclusion

Footnotes

ORCID iDs

Ethical considerations

Author contributions/CRediT

Funding

Conflicting interests

Data availability

References