Identifying risk factors and predicting stroke using Bayesian networks: Evidence from NHANES 2011

Abstract

Background

Stroke is a leading cause of morbidity and mortality worldwide, representing a major cerebrovascular disorder. Early identification of stroke-related risk factors is essential for implementing effective prevention and management strategies. This study aimed to develop an interpretable Bayesian network (BN)-based predictive model to identify key risk factors associated with stroke and to elucidate their complex interdependencies.

Methods

This study analyzed cross-sectional data derived from the National Health and Nutrition Examination Survey (NHANES) spanning the period 2011–2020. Feature selection was performed using univariate and multivariate logistic regression analyses. The BN structure was constructed using the hybrid HPC algorithm (H2PC), with conditional probability distributions estimated via maximum likelihood estimation. Both qualitative and quantitative analyses were conducted to examine node probabilities and elucidate dependencies between stroke and associated risk factors. Model performance was primarily assessed using the area under the receiver operating characteristic curve (AUROC) and compared against established machine learning algorithms.

Results

The final analytical sample comprised 20,535 individuals. Bayesian network analysis identified five variables with direct dependency relationships to stroke occurrence: age, sleep disorders, alcohol consumption, coronary heart disease, and diabetes. The BN model demonstrated superior predictive performance with an AUROC of 0.803 (95% CI: 0.773–0.833), significantly outperforming other machine learning approaches.

Conclusions

The developed BN model provides an intuitive visualization of the probabilistic interdependencies among stroke risk factors while achieving competitive predictive accuracy. These findings demonstrate its exploratory value in unmasking complex risk pathways and suggest its potential to inform future stroke risk assessment and prevention strategies upon further longitudinal validation.

Keywords

Bayesian network stroke risk factors diseases prediction machine learning

Introduction

Stroke represents a major global health challenge, constituting the second leading cause of death and third leading cause of disability worldwide according to the World Stroke Organization 2025 report.¹ The global burden of stroke has increased substantially over recent decades, with stroke incidence, mortality, and disability-adjusted life years (DALYs) rising by 70%, 44%, and 32%, respectively between 1990 and 2021—a trend that continues unabated.² This disease affects individuals across all demographic groups, transcending traditional age boundaries and no longer being confined to elderly populations.³ The majority of patients who developed hemorrhagic stroke during COVID-19 infection presented with underlying chronic conditions, such as hypertension or diabetes mellitus, which are recognized as established risk factors for stroke.⁴ Most stroke survivors experience persistent neurological sequelae of varying severity, necessitating intensive, long-term healthcare, rehabilitation, and social support. Such long-term disability creates substantial burdens for patients and healthcare systems alike.⁵ However, more than 85% of initial stroke events are potentially avoidable through effective primary prevention strategies.⁶ Therefore, early identification of high-risk populations and comprehensive management of modifiable risk factors are essential for reducing stroke incidence and disability burden while mitigating strain on healthcare resources.

Current stroke risk assessment paradigms rely predominantly on traditional prediction models that only incorporate demographic data, medical history, and general clinical parameters. Established tools such as the Framingham stroke risk profile,⁷ CHA2DS2-VASc score,⁸ and QStroke score,⁹ logistic regression,¹⁰ have provided valuable frameworks for clinical decision-making. However, traditional statistical frameworks frequently fail to account for the intricate non-linear dependencies among multiple risk determinants. This limitation compromises their predictive precision and cross-population generalizability, particularly when confounded by disparities in socioeconomic status, ethnicity, and regional healthcare infrastructures, which ultimately fuels substantial inter-study heterogeneity. The emergence of machine learning (ML) technologies has opened new avenues for addressing these limitations. ML approaches demonstrate superior capability in handling high-dimensional datasets, capturing non-linear relationships, and identifying subtle patterns that may be overlooked by traditional statistical methods.^11,12 Ischemic stroke pathogenesis involves a complex web of interdependent risk factors rather than isolated, independent variables.¹³ Unlike traditional logistic regression models, which often rely on the assumption of variable independence, Bayesian networks (BNs) construct graphical architectures through data-driven learning to explicitly represent complex interactions.¹⁴ This approach facilitates a more efficient utilization of multidimensional data and provides deep insights into the intricate, multifactorial dependencies underlying stroke occurrence.¹⁵ Crucially, the graphical nature of BNs ensures a transparent decision-making process, enabling clinicians to trace the underlying reasoning pathways rather than merely receiving a “black-box” prediction.^16–18 This clinical utility is exemplified by previous work in post-stroke outcomes,¹⁹ suggesting that BNs are uniquely suited for integration into complex medical decision-support systems.

Although numerous predictive models for stroke have been developed, substantial challenges persist in elucidating the intrinsic relationships among risk factors and in quantifying their contributions. This study aims to develop a BN model for stroke prediction that captures complex interactions and dependencies among risk factors to advance understanding of stroke pathogenesis and inform evidence-based prevention strategies. The established model will enable individualized stroke risk assessment using patient-specific clinical variables and probabilistic inference, thereby supporting clinical decision-making and targeted interventions.

Methods

Data source and participants

This study utilized data from the National Health and Nutrition Examination Survey (NHANES) conducted by the National Center for Health Statistics (NCHS) of the Centers for Disease Control and Prevention (CDC). NHANES is a nationally representative, cross-sectional survey designed to assess the health and nutritional status of the non-institutionalized civilian population of the United States. The survey employs a complex, multistage probability sampling design to ensure representative estimates for the US population. Data from five consecutive NHANES cycles spanning 2011–2020 were included in this analysis. NHANES data collection encompasses demographic information, socioeconomic status, dietary intake, health status, physical measurements, and laboratory analyses of blood and urine specimens. The health interview is administered in participants’ homes, whereas the physical examination and laboratory assessments are conducted at Mobile Examination Centers (MECs) by trained medical personnel. The survey protocol was approved by the NCHS Research Ethics Review Board, and all participants provided written informed consent before participation. According to the NHANES data use guidelines, all data are de-identified and released under the public domain for unrestricted research use.

Data processing and feature selection

This study excluded participants aged less than 20 years and those with missing demographic data. A total of 20,535 participants were included in the final analysis. The complete data extraction flowchart is presented in Supplementary Figure S1. Variables with missing data rates exceeding 60% were excluded from the analysis to ensure data quality and statistical robustness. The final dataset comprised 31 variables: age, sex, race, education level, marital status, ratio of family income to poverty (RFIP), body mass index (BMI), waist circumference, systolic blood pressure (SBP), diastolic blood pressure (DBP), pulse, albumin creatinine ratio (ACR), direct high density lipoprotein cholesterol (HDL), triglyceride, low density lipoprotein cholesterol (LDL), glycohemoglobin, fasting glucose, diabetes, drinking, smoking, serum total folate, work hours, sleep disorder, aspirin use, coronary heart disease (CHD), thyroid disorders, liver condition, exercise increase, salt reduction, and fat reduction. The primary outcome for stroke was operationalized as a binary indicator (yes/no), based on participants’ self-reported physician diagnosis. This was derived from the standardized questionnaire item: “Has a doctor or other health professional ever told you that you had a stroke?”. Additionally, participants whose stroke-related responses were recorded as “Refuse”, “Don’t know”, or were missing were removed from the analysis.

For variables with remaining missing values, multiple imputations by chained equations (MICE) were performed using the predictive mean matching (PMM) method. This approach maintains the distributional properties of the original data while providing robust estimates for missing values. Continuous variables were discretized based on established clinical guidelines and expert consensus (Supplementary Table S1). This approach was adopted to maximize clinical interpretability and to allow the model to capture complex non-linear interactions without the restrictive assumption of multivariate normality. By utilizing thresholds recognized in clinical practice, the resulting conditional probability distributions remain directly applicable to medical decision-making. To balance clinical interpretability with model parsimony, predictors were initially screened via univariate and multivariate logistic regression ( $p < 0.05$ ) to mitigate the “curse of dimensionality” during subsequent BN learning. This pre-filtering constrained the search space to clinically robust variables, thereby preventing spurious edges and enhancing the stability of the resulting dependency structure.

Model construction and evaluation

BNs, also referred to as Bayesian belief networks or directed acyclic graphical models, represent a powerful probabilistic framework for modeling complex relationships among variables under uncertainty. A BN consists of two essential components: a directed acyclic graph (DAG) and a set of conditional probability distributions. The DAG comprises nodes representing random variables and directed edges encoding conditional associations between variables. Each node is associated with a conditional probability table (CPT) that quantifies the probabilistic relationship between the node and its parent variables. Without temporal ordering or experimental intervention, these associations should not be interpreted as evidence of causality. Formally, a BN defines a joint probability distribution over a set of variables $X = (X_{1}, X_{2}, \dots, X_{n})$ through the chain rule decomposition: $P (X_{1}, X_{2}, \dots, X_{n}) = \prod_{i = 1}^{n} P (X_{i} | P a r e n t s (X_{i}))$ , where $P a r e n t s (X_{i})$ denotes the set of parent nodes of variable $X_{i}$ in the graph structure.

BNs support two primary computational tasks: learning (Parameter learning and Structure learning) and probabilistic inference. Parameter learning focuses on estimating conditional probability distributions from data, typically accomplished through maximum likelihood estimation or Bayesian parameter estimation. Structure learning, the more challenging task, involves discovering the optimal graph topology from observational data using score-based methods (e.g., Bayesian information criterion), constraint-based approaches (e.g., PC algorithm), or hybrid techniques. Probabilistic inference involves computing posterior probabilities of query variables given observed evidence. A hybrid HPC algorithm (H2PC) was utilized for BN construction. This approach integrates both constraint-based conditional independence testing and score-based optimization techniques. The selection of H2PC was motivated by its ability to significantly reduce the search space through local structural constraints while ensuring global score optimization, thereby enhancing both computational efficiency and structural accuracy in complex datasets. Model selection was performed through comparative analysis of log-likelihood scores, and to ensure biological plausibility, structural constraints were applied to prohibit directionality that contradicts temporal or clinical logic (e.g., preventing edges from disease outcomes to age or sex).

The dataset was divided into training and testing subsets using an 8:2 split for model assessment based on the identified features. To evaluate the predictive superiority of the BN, its performance was benchmarked against five distinct ML algorithms, including extreme gradient boosting (XGBoost), random forest (RF), support vector machine (SVM), K-nearest neighbor (KNN), and artificial neural network (ANN). All models were developed and evaluated using the same standardized pipeline to ensure a fair comparison. To account for class distribution disparities, the Youden Index (sensitivity + specificity-1) was employed to establish optimal probability thresholds for classification. Model performance was evaluated using the area under the receiver operating characteristic curve (AUROC) to assess overall discriminative ability. Given the class imbalance in the dataset, the area under the precision-recall curve (AUPRC), recall, and F1-score were also employed to provide a more robust evaluation of the model's capacity to identify the minority class (stroke cases) and to balance precision with sensitivity. Decision curve analysis (DCA) was employed to assess clinical utility by quantifying net clinical benefit across a range of threshold probabilities. Calibration curves were constructed to evaluate model reliability, specifically by examining the agreement between predicted probabilities and observed outcomes—an aspect of particular importance in the presence of class imbalance. To ensure robust and unbiased performance estimation, a 10-fold cross-validation strategy was implemented, with all reported metrics averaged across the test folds to mitigate stochastic variability.

Statistical analysis

All statistical computations were conducted using R software (version 4.3.3). All data were weighted according to NHANES analytic guidelines. Examination sample weights (WTMEC2YR) from five 2-year cycles (2011–2020) were combined and rescaled by a factor of 0.2 so that the total weight equals the size of the U.S. population represented by a single 2-year cycle. Variance estimates were obtained with the Taylor series linearization method to account for the complex, stratified, multistage cluster sampling design. Study participants were stratified into two cohorts according to stroke occurrence status. Categorical data were summarized as counts and proportions, with between-group comparisons performed using chi-square tests. Statistical significance was defined as $p < 0.05$ . Cramer's V effect size was calculated to determine the magnitude of the difference. The Scott–Knott (SK) effect size difference test was employed to partition the performance metrics into distinct, non-overlapping groups and evaluate the statistical significance of their differences.²⁰ This hierarchical clustering-based approach inherently controls for Type I error inflation without the need for traditional post-hoc adjustments, ensuring robust results across multiple model comparisons. To visualize the BN topology and perform BN inference, GeNIe2.0 software was employed.

Results

Baseline characteristics and potential risk factors for stroke

The detailed clinical characteristics of patients are presented in Table 1. Among 20,535 participants, 19,617 had no stroke history, and 918 had experienced stroke. The results indicate that age, race, education level, and other factors may affect on the prevalence of stroke. Age represents the risk factor for stroke occurrence, with risk increasing multiplicatively with each advancing age group. Among stroke patients, 56.6% were aged >65 years. Males and females accounted for 51.7% and 48.3% in stroke patients, respectively, but no statistically significant differences were observed (p > 0.05). The magnitude of the association between categorical risk factors and stroke was assessed using Cramer's V, with values below 0.2 considered indicative of a small effect size.

Table 1.

Baseline characteristics of patients.

Variables	Non-Stroke(N = 19617)	Stroke(N = 918)	Effect size	P value	Variables	Non-Stroke(N = 19617)	Stroke(N = 918)	Effect size	p.value
Age (%)			0.198	<0.001	Pulse (%)			0.039	<0.001
20∼35	5356 (27.3)	24 (2.6)			<60	3611 (18.4)	237 (25.8)
36∼50	4979 (25.4)	84 (9.2)			60∼100	15707 (80.1)	668 (72.8)
51∼65	5184 (26.4)	290 (31.6)			>100	299 (1.5)	13 (1.4)
>65	4098 (20.9)	520 (56.6)			ACR >=30 (%)	2448 (12.5)	266 (29.0)	0.101	<0.001
Sex = F (%)	9520 (48.5)	443 (48.3)	0.001	0.899	HDL (%)			0.017	0.045
Race (%)			0.064	<0.001	<40	3982 (20.3)	215 (23.4)
Mexican	2298 (11.7)	63 (6.9)			40∼60	10401 (53.0)	455 (49.6)
Other Hispanic	1958 (10.0)	63 (6.9)			>60	5234 (26.7)	248 (27.0)
White	7311 (37.3)	405 (44.1)			Trigly (%)			0.024	0.003
Black	4794 (24.4)	296 (32.2)			<50	2315 (11.8)	97 (10.6)
Other Race	3256 (16.6)	91 (9.9)			50∼150	13107 (66.8)	582 (63.4)
Education (%)			0.073	<0.001	>150	4195 (21.4)	239 (26.0)
<9th	1609 (8.2)	114 (12.4)			LDL (%)			0.060	<0.001
9∼11th	2444 (12.5)	167 (18.2)			<100	7966 (40.6)	502 (54.7)
High school	4433 (22.6)	266 (29.0)			100∼130	6346 (32.3)	242 (26.4)
some college	6164 (31.4)	242 (26.4)			>130	5305 (27.0)	174 (19.0)
college graduate	4967 (25.3)	129 (14.1)			Glyco (%)			0.088	<0.001
Marital (%)			0.097	<0.001	<4	13 (0.1)	1 (0.1)
Married	12437 (63.4)	448 (48.8)			4∼6	15704 (80.1)	577 (62.9)
Separated	3499 (17.8)	332 (36.2)			>6	3900 (19.9)	340 (37.0)
Never married	3681 (18.8)	138 (15.0)			FGlucose (%)			0.080	<0.001
RFIP (%)			0.057	<0.001	<70	81 (0.4)	6 (0.7)
<1	4402 (22.4)	258 (28.1)			70∼110	14452 (73.7)	519 (56.5)
1∼2	5069 (25.8)	296 (32.2)			>110	5084 (25.9)	393 (42.8)
2.1∼3	2855 (14.6)	141 (15.4)			Folate (%)			0.054	<0.001
>3	7291 (37.2)	223 (24.3)			<5	381 (1.9)	16 (1.7)
BMI (%)			0.022	0.018	5∼25	14955 (76.2)	601 (65.5)
<18.4	314 (1.6)	21 (2.3)			>25	4281 (21.8)	301 (32.8)
18.4∼25	5469 (27.9)	224 (24.4)			Smoking (%)			0.025	0.002
25.1∼30	6181 (31.5)	279 (30.4)			Every day	6736 (34.3)	297 (32.4)
>30	7653 (39.0)	394 (42.9)			Some days	1756 (9.0)	56 (6.1)
Waist >=80 (%)	17394 (88.7)	862 (93.9)	0.034	<0.001	No	11125 (56.7)	565 (61.5)
SBP (%)			0.103	<0.001	Diabetes = Y (%)	2667 (13.6)	314 (34.2)	0.121	<0.001
<90	171 (0.9)	12 (1.3)			Drinking = Y (%)	2966 (15.1)	247 (26.9)	0.067	<0.001
90∼120	9508 (48.5)	259 (28.2)			Sleeping = Y (%)	5046 (25.7)	380 (41.4)	0.073	<0.001
121∼140	6663 (34.0)	343 (37.4)			Aspirin = Y (%)	914 (4.7)	58 (6.3)	0.016	0.025
>140	3275 (16.7)	304 (33.1)			CHD = Y (%)	689 (3.5)	169 (18.4)	0.154	<0.001
DBP (%)			0.038	<0.001	Thyroid = Y (%)	2034 (10.4)	160 (17.4)	0.047	<0.001
<60	2797 (14.3)	177 (19.3)			Liver = Y (%)	847 (4.3)	68 (7.4)	0.031	<0.001
60∼80	12287 (62.6)	519 (56.5)			InExercise = Y (%)	11576 (59.0)	479 (52.2)	0.029	<0.001
81∼90	3277 (16.7)	140 (15.3)			RedSalt = Y (%)	10528 (53.7)	602 (65.6)	0.049	<0.001
>90	1256 (6.4)	82 (8.9)			RedFat = Y (%)	11194 (57.1)	536 (58.4)	0.006	0.448

RFIP: ratio of family income to poverty; BMI: body mass index; SBP: systolic blood pressure; DBP: diastolic blood pressure; ACR: albumin creatinine ratio; HDL: direct high density lipoprotein cholesterol; LDL: low density lipoprotein cholesterol; Glycol: glycohemoglobin; Fglucose: fasting glucose; SleepDis: sleep disorder; CHD: coronary heart disease; InExercise: exercise increase; RedSalt: salt reduction; RedFat: fat reduction; Trigly: triglyceride.

Univariate and multivariate LR analyses were performed for feature selection, with results presented in Table 2.The Top-5 risk factors associated with stroke were age ([OR,2.510; 95%CI, 2.272∼2.777] vs. [OR, 2.755; 95%CI, 2.542∼2.994]), diabetes ([OR,1.526; 95%CI, 1.233∼1.888] vs. [OR, 3.304; 95%CI, 2.863∼3.806]), drinking ([OR,1.591; 95%CI, 1.342∼1.882] vs. [OR, 2.067; 95%CI, 1.774∼2.400]), sleeping disorder ([OR,1.544; 95%CI, 1.334∼1.786] vs. [OR, 2.040; 95%CI, 1.781∼2.334]), CHD ([OR,2.448; 95%CI, 1.993∼2.995] vs. [OR, 6.199; 95%CI, 5.148∼7.430]) in multivariate LR and univariate LR, and all $p < 0.001$ . Sex, DBP, and HDL were removed from the candidate features due to non-significant p-values, therefore this study included 26 variables.

Table 2.

Univariate and multivariate binary logistic regression analysis.

Num	Variables	Multivariate LR			Univariate LR
Num	Variables	OR	95% CI	P value	OR	95% CI	P value
1	Age	2.510	2.272∼2.777	<0.001	2.755	2.542∼2.994	<0.001
2	Sex	0.855	0.729∼1.003	0.055	0.989	0.866∼1.129	0.872
3	Race	1.126	1.051∼1.206	<0.001	1.054	0.996∼1.115	0.069
4	Education	0.921	0.865∼0.98	0.009	0.769	0.731∼0.81	<0.001
5	Marital	1.156	1.049∼1.273	0.003	1.181	1.09∼1.279	<0.001
6	RFIP	0.874	0.815∼0.937	<0.001	0.805	0.761∼0.851	<0.001
7	BMI	1.020	0.926∼1.126	0.684	1.087	1.005∼1.176	0.037
8	Waist	1.062	0.777∼1.471	0.713	1.967	1.511∼2.615	<0.001
9	SBP	1.006	0.907∼1.116	0.915	1.783	1.641∼1.939	<0.001
10	DBP	1.075	0.971∼1.19	0.163	0.974	0.89∼1.065	0.571
11	Pulse	0.788	0.674∼0.924	0.003	0.669	0.578∼0.777	<0.001
12	ACR	1.407	1.188∼1.662	<0.001	2.861	2.463∼3.316	<0.001
13	HDL	0.939	0.834∼1.056	0.293	0.942	0.855∼1.038	0.227
14	Trigly	0.958	0.835∼1.099	0.541	1.199	1.068∼1.347	0.002
15	LDL	0.810	0.738∼0.889	<0.001	0.701	0.642∼0.764	<0.001
16	Glyco	0.808	0.655∼0.996	0.047	2.360	2.053∼2.709	<0.001
17	FGlucose	1.089	0.91∼1.301	0.350	2.088	1.826∼2.387	<0.001
18	Folate	1.106	0.953∼1.282	0.183	1.672	1.456∼1.917	<0.001
19	Diabetes	1.526	1.233∼1.888	<0.001	3.304	2.863∼3.806	<0.001
20	Drinking	1.591	1.342∼1.882	<0.001	2.067	1.774∼2.4	<0.001
21	Smoking	0.805	0.74∼0.876	<0.001	1.084	1.008∼1.166	0.030
22	SleepingDis	1.544	1.334∼1.786	<0.001	2.040	1.781∼2.334	<0.001
23	Aspirin	1.006	0.749∼1.329	0.965	1.380	1.039∼1.799	0.021
24	CHD	2.448	1.993∼2.995	<0.001	6.199	5.148∼7.43	<0.001
25	Thyroid	1.104	0.908∼1.335	0.315	1.825	1.525∼2.171	<0.001
26	Liver	1.125	0.851∼1.467	0.396	1.773	1.36∼2.274	<0.001
27	InExercise	0.867	0.745∼1.008	0.063	0.758	0.664∼0.866	<0.001
28	RedSalt	1.285	1.075∼1.539	0.006	1.645	1.432∼1.892	<0.001
29	RedFat	0.791	0.661∼0.948	0.011	1.056	0.924∼1.208	0.428

OR: odds ratio, $p < 0.05$ was regarded as significance difference. RFIP: ratio of family income to poverty; BMI: body mass index; SBP: systolic blood pressure; DBP: diastolic blood pressure; ACR: albumin creatinine ratio; HDL: direct high density lipoprotein cholesterol; LDL: low density lipoprotein cholesterol; Glycol: glycohemoglobin; Fglucose: fasting glucose; SleepDis: sleep disorder; CHD: coronary heart disease; InExercise: exercise increase; RedSalt: salt reduction; RedFat: fat reduction; Trigly: Triglyceride.

Model performance evaluation

Figure 1 and Supplementary Table S2 present the performance metrics of six ML models for stroke prediction, and the decision curves and calibration curves were also analyzed. Hyperparameters were optimized using a grid search strategy; the detailed hyperparameter configurations for each model are presented in Supplementary Table S3. Among the evaluated models, BN achieved the highest discriminative ability, with an AUROC of 0.803 (95% CI: 0.773–0.833). This was followed by RF (0.795; 95% CI: 0.763–0.826) and ANN (0.784; 95% CI: 0.748–0.820). While XGBoost demonstrated moderate performance (0.737; 95% CI: 0.701–0.772), SVM and KNN exhibited relatively lower discriminative power (Figure 1A). Notably, the BN model achieved the highest recall (0.864), which significantly outperformed RF (0.787) and XGBoost (0.775). This superior sensitivity in identifying stroke cases is particularly crucial for clinical screening applications, where minimizing false negatives is a priority. Furthermore, the BN model also achieved a comparatively favorable AUPRC of 0.139 (Figure 1B) and F1-score (0.170). The decision curve showed that the net benefit of all models decreases as the threshold probability increased. The BN model also demonstrated good clinical suitability (Figure 1C). Meanwhile, the calibration curve (The closer the Apparent line is to the dashed line, the better the agreement between the predicted and actual values is) showed that BN exhibited the best fit between the actual diagnosis and the predicted diagnosis (Figure 1D). Ten-fold cross-validation was conducted to evaluate model stability and reduce overfitting risk. Table 3 summarizes the performance metrics averaged across all ten folds. The BN attained the highest AUROC (0.780) and recall (0.790), along with a competitive F1-score (0.176). These results indicate that the BN model is particularly effective at identifying true stroke cases (minimizing false negatives), which is a primary goal in clinical screening. According to the Scott Knott test (Supplementary Figure S2), no statistically significant difference was found in AUROC, AUPRC, or recall between the BN and RF models; however, both models significantly outperformed the other evaluated models. Overall, the BN emerged as the optimal model for stroke prediction in this cohort.

Figure 1.

Comprehensive analysis of ML model on test set. A: AUROC (The area under receiver operating characteristic Curve), B: AUPRC (The area under precision-recall curve), C: decision curves, D: calibration curves.

Table 3.

The predictive performance of six ML models using 10-fold cross-validation.

Model	Specificity	Precision	F1	Recall	Accuracy	AUROC	AUPRC
XGBoost	0.670	0.095	0.167	0.725	0.672	0.741	0.115
RF	0.665	0.101	0.177	0.781	0.670	0.772	0 . 139
ANN	0.681	0.099	0.173	0.729	0.684	0.736	0.134
KNN	0.850	0.104	0.162	0.369	0.829	0.609	0.088
SVM	0.593	0.077	0.137	0.692	0.598	0.672	0.106
BN	0.662	0.099	0.176	0.790	0.667	0.780	0.138

Bayesian network structure of stroke

To enhance the biological plausibility of the derived network, specific structural constraints were imposed. For instance, demographic variables (e.g., age and race) were defined as root nodes, as they cannot be influenced by other physiological or clinical variables in this context. Consequently, all edges directed toward age or race were prohibited. Furthermore, Stroke was defined as a terminal outcome variable, precluding it from influencing other risk factors. All other edges were learned directly from the data without prior constraints. The BN was constructed using 80% of the patient data allocated as the training dataset. To enhance the robustness and reliability of the network structure, 100 bootstrap resampling iterations were performed on the training dataset. Only edges with an occurrence frequency exceeding 80% across all resampling runs were retained to filter spurious associations. The final BN, depicting the probabilistic dependencies between stroke and its potential predictors, is illustrated in Figure 2. In this network, it was observed that age, diabetes, sleeping disorder, CHD and drinking created direct connections with stroke, representing the most proximal risk determinants within the model hierarchy. Other variables were related to stroke indirectly through intermediate nodes. For instance, dietary fat reduction exhibited an indirect relationship with stroke risk, mediated by alcohol consumption. To facilitate clinical interpretation, pink nodes highlight direct risk factors, while green nodes identify upstream modifiable lifestyle factors, providing a visual roadmap for precision prevention strategies. The detailed dependency explanations are presented in Supplementary Table S4. Moreover, there were five pathways that mediated the indirect association between age and the stroke outcomes: (1) age→CHD→Stroke, (2) age→Diabetes→Stroke, (3) age→Diabetes→CHD→Stroke, (4) age→Smoking→RedFat→Drinking→Stroke, (5) age→RFIP→Smoking→RedFat→Drinking→Stroke.

Figure 2.

The BN topology of Stroke. RFIP, ratio of family income to poverty; BMI, body mass index; SBP, systolic blood pressure; ACR, albumin creatinine ratio; LDL, low density lipoprotein cholesterol; Glycol, glycohemoglobin; Fglucose, fasting glucose; SleepDis, sleep disorder; CHD, coronary heart disease; InExercise, exercise increase; RedSalt, salt reduction; RedFat, fat reduction; Trigly, triglyceride. Pink nodes represent direct risk determinants; Green nodes represent upstream modifiable lifestyle factors. Edges (arrows) represent probabilistic dependencies between variables, with the direction indicating the structural flow of risk propagation. Note: Several direct determinants (e.g., diabetes, drinking, and sleep disorders) are also clinically modifiable, but are highlighted in pink to emphasize their structural proximity to Stroke in the network.

Model inference and sensitivity analysis

Unlike standard ML models like random forest, which provide a static ranking of feature importance (Supplementary Figure S3), highlighting the key predictors used for classification, the BN enables dynamic simulations and provides superior structural interpretability; a detailed comparison is presented in Supplementary Table S5. Based on patient demographics and available clinical records (Supplementary Figure S4A), BNs enabled inference of individualized stroke occurrence probabilities. For example, a 70-year-old patient with comorbid diabetes and coronary heart disease who also presented with sleep disorders and alcohol consumption demonstrated a stroke probability of 32%. Intervention analysis revealed differential risk reduction patterns: alcohol cessation alone reduced stroke risk to 17% (a 15% reduction), while sleep disorder management alone decreased risk to 26% (a 6% reduction). However, concurrent implementation of both interventions—alcohol cessation and sleep disorder management—resulted in a stroke risk of 20%, representing a 12% reduction from baseline (Supplementary Figures S4B-4D illustrate this reasoning process). The observed non-additive effect of combined interventions suggests complex synergistic interactions between alcohol consumption and sleep disorders in stroke pathogenesis.

Figure 3 shows the sensitivity analysis of Stroke. Nodes colored in red contain parameters that are important for the calculation of the posterior probability distributions in those nodes that are marked as targets (Stroke has marked as targets). Gray-colored nodes do not contain any parameters that are used in the calculation of the posterior probability distributions over the target variables (Figure 3A). The tornado diagram (Figure 3B) shows the top-10 most sensitive parameters for a selected state of the target node. One-way sensitivity analysis using ±10% parameter perturbations demonstrated that baseline stroke probability varied modestly from 4.14% to 4.53%. Age distribution was the predominant factor influencing stroke probability. A 10% increase in the proportion of the oldest age group (age4) raised stroke probability to 4.53% (sensitivity coefficient +0.087), while increased prevalence in the youngest age group (age1) exerted a protective influence. Among risk factors, diabetes mellitus ranked after age: a 10% increase in diabetes prevalence yielded measurable increments in stroke risk. Sleep disorder prevalence demonstrated a substantially smaller effect.

Figure 3.

Sensitivity analysis for Stroke. A:Risk reasoning of BN model for stroke.Age:age1∼age4 correspond to age groups 20∼35,36∼50,51∼65,>65; Race: race1∼race5 correspond to race groups Mexican, other Hispanic, White, Black, and other race; BMI:bmi1∼bmi4 correspond to BMI groups <18.4,18.4∼25,25.1∼30,>30;Education:edu1∼edu5 correspond to education level groups <9^th,9∼11^th,High School, Some college, college graduate; Marital, mar1∼mar3 correspond to marital status groups married, separated, and never married; RFIP: fip1∼fip4 correspond to RFIP groups <1, 1∼2,2.1∼3,>3; B: Tornado diagram. Sorted from the most to least sensitive, the horizontal axis shows the absolute change in the posterior probability of Stroke = Y when each of the parameter's changes by that percentage, red expresses negative and green positive change.

Discussion

Utilizing an extensive NHANES dataset spanning a decade, this study presents a robust evaluation of stroke risk within a comparative ML framework. The findings underscore the capacity of BNs to effectively capture conditional dependencies among sociodemographic and lifestyle factors, thereby elucidating five critical risk nodes and their associated mediating pathways. This approach moves beyond simple risk factor identification to a systems-level understanding of stroke etiology. By providing a biologically plausible and interpretable structure, this framework serves as a valuable tool for personalized risk assessment and the optimization of public health interventions.

BN analysis identified five critical risk factors exhibiting direct associations with stroke: age, diabetes, sleep disorders, CHD, and drinking. While these findings align with established epidemiological evidence, our model provides novel insights by framing these factors as actionable clinical gateways. Age represents a non-modifiable risk factor that consistently emerges as the strongest predictor across stroke prevention studies, reflecting the cumulative vascular damage and physiological decline associated with aging.^21,22 The identification of diabetes²³ and CHD²⁴ as direct risk factors corroborates extensive literature demonstrating their roles in accelerating atherosclerosis and promoting thrombotic events through chronic inflammatory pathways and endothelial dysfunction.²⁵ Notably, sleep disorders emerged as a direct, independent determinant, whereas traditional risk scores often treat sleep hygiene as a peripheral concern.²⁶ Our model positions sleep as a primary intervention target, hypothesized to act through intermittent hypoxia and sympathetic activation.^27,28 Similarly, alcohol consumption was identified as a direct arc to stroke within a complex comorbidity web, highlighting its immediate impact on blood pressure and coagulation pathways.^29,30 By unmasking these direct dependencies, the BN identifies high-leverage points where clinicians can disrupt risk progression more effectively than is possible with static, linear assessment tools.

The learned DAG unmasks robust pathways that function as a structural roadmap for precision prevention (Figure 2). In this framework, age functions as both a distal root cause and a catalyst for intermediate cardiometabolic and behavioral cascades. CHD and diabetes mellitus emerged as the predominant mediators, accounting for the majority of age-attributable stroke risk. Existing evidence demonstrates that both conditions synergistically elevate stroke incidence.^31,32 Their sequential arrangement (age → diabetes → CHD → stroke) suggests that chronic hyperglycemia accelerates atherosclerotic progression, thereby amplifying cerebrovascular risk.^33,34 The network also captured lifestyle-oriented cascades initiated by socioeconomic status. The upstream positioning of the ratio of family income to poverty (RFIP) relative to smoking indicates that lower socioeconomic status propagates stroke risk through cascading adverse health behaviors rather than isolated exposures.^35–37 Notably, dietary fat reduction functions as an intermediate mediator between smoking and alcohol consumption; this complex behavioral pattern may reflect compensatory dietary modifications.^38,39 Collectively, these pathways indicate that effective stroke prevention strategies must simultaneously target cardiometabolic control (diabetes and CHD management) while addressing upstream socioeconomic determinants that influence smoking and alcohol consumption behaviors.^40,41 As stroke arises from a multifaceted interplay of factors—including advanced age, systemic comorbidities, and emerging biomarkers—each requiring further refinement to establish definitive prognostic utility in clinical practice.⁴² Future interventional studies should utilize this causal framework to quantify the comparative effectiveness of cardiometabolic versus socioeconomic policy interventions, thereby optimizing resource allocation and maximizing population-level stroke risk reduction.

This study presents several distinct advantages. First, leveraging a decade of population-based data from the NHANES database ensures substantial statistical power and high representativeness, facilitating a robust identification of stroke predictors that are generalizable to the broader U.S. population. Second, by employing a comparative benchmarking framework, this study demonstrated that BNs can achieve high predictive accuracy—comparable to state-of-the-art “black-box” models like XGBoost and Random Forest—while uniquely maintaining structural interpretability. Unlike traditional ML approaches, this probabilistic framework allows for bi-directional inference and “what-if” scenario analysis, enabling clinicians to move beyond static risk scores toward a more dynamic and nuanced understanding of patient-specific risk profiles. Traditional risk scores often fail to account for the synergistic effects of lifestyle and clinical factors. The BN structure addresses this by unmasking the dependency relationships, thereby offering a more granular tool for stroke risk assessment than conventional scoring systems. Third, the integration of a comprehensive spectrum of sociodemographic, clinical, and lifestyle variables offers a holistic perspective on stroke determinants. Notably, the identification of modifiable factors, such as sleep disorders and alcohol consumption, within a network-based structure provides actionable, evidence-based targets that account for the complex interdependencies inherent in stroke etiology.

Several limitations warrant consideration. First, residual confounding from unmeasured variables may influence the observed associations. Additionally, the applied feature selection strategy may have introduced selection bias by excluding variables with structural importance but low independent significance. Second, stroke outcomes in NHANES are primarily ascertained through self-reported data based on prior clinical diagnoses, which are susceptible to recall bias and misclassification. The absence of standardized neuroimaging confirmation or detailed clinical stroke subtype classification (e.g., ischemic vs. hemorrhagic) further limits the precision and clinical granularity of outcome assessment. Third, while BNs efficiently model complex probabilistic dependencies, their capacity to capture dynamic clinical progressions is constrained by the cross-sectional design of the NHANES data. Specifically, the lack of longitudinal tracking precludes the observation of temporal stroke stages, such as the transition from acute to chronic phases. Consequently, the identified pathways should be interpreted as probabilistic structural associations rather than validated causal trajectories. Fourth, despite using a nationally representative sample, this study lacks external validation in independent geographic or clinical cohorts. While our internal validation demonstrated high stability, the transportability of this BN-based risk architecture to diverse global populations requires further prospective validation before clinical implementation. Finally, the categorization of several continuous variables (e.g., age, BMI, SBP) into discrete bins, while facilitating interpretability in the BN, may oversimplify underlying dose–response relationships and attenuate the model's ability to capture nuanced gradients in risk.

Conclusions

This study demonstrates that an interpretable BN model serves as a valuable exploratory tool for mapping the complex interplay of risk factors contributing to stroke. Through a large-scale analysis of NHANES data, five critical factors—age, diabetes, sleep disorders, coronary heart disease, and alcohol consumption—were identified, with their probabilistic interdependencies systematically elucidated. While these findings are derived from cross-sectional data and require further validation, the BN framework offers a transparent alternative to traditional “black-box” models, facilitating the generation of testable hypotheses regarding risk pathways. This underscores the model's potential to inform risk stratification and the development of targeted prevention strategies. Future research is essential to validate these exploratory findings using longitudinal, multi-center datasets to establish temporal relationships and ensure the generalizability of the identified risk architecture across diverse clinical settings.

Supplemental Material

sj-docx-1-dhj-10.1177_20552076261434648 - Supplemental material for Identifying risk factors and predicting stroke using Bayesian networks: Evidence from NHANES 2011–2020

Supplemental material, sj-docx-1-dhj-10.1177_20552076261434648 for Identifying risk factors and predicting stroke using Bayesian networks: Evidence from NHANES 2011–2020 by Ju Zhao, Mingyang Zhang and Hongnian Wang in DIGITAL HEALTH

Footnotes

Acknowledgements

We thank the National Center for Health Statistics for providing the NHANES database and all participants who contributed to the NHANES study.

ORCID iDs

Mingyang Zhang

Hongnian Wang

Ethics approval and consent to participate

This study used publicly available de-identified data from NHANES. The original NHANES study was approved by the National Center for Health Statistics Research Ethics Review Board. No additional ethics approval was required for this secondary analysis.

Consent for publication

Not applicable.

Authors’ contributions

JZ conceived the study, performed the analysis, and drafted the manuscript. MZ processed the data, developed the methodology, and conducted visualization. HW supervised the project and revised the manuscript. All authors reviewed the manuscript and approved the final version for publication.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the Joint Co-construction Project of the Henan Provincial Medical Science and Technology Program (LHGJ20250479) and the Sichuan Science and Technology Program (2026NSFSC1446).

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability

The NHANES data are publicly accessible at .

Guarantor

HW.

Clinical trial number

Not applicable.

Supplemental material

Supplemental material for this article is available online.

References

Feigin

Brainin

Norrving

, et al. World stroke organization: global stroke fact sheet 2025. Int J Stroke 2025; 20: 132–144.

Feigin

Owolabi

Abd-Allah

, et al. Pragmatic solutions to reduce the global burden of stroke: a world stroke organization–lancet neurology commission. Lancet Neurol 2023; 22: 1160–1206.

Feigin

Abate

, et al. Global, regional, and national burden of stroke and its risk factors, 1990-2021: a systematic analysis for the global burden of disease study 2021. Lancet Neurol 2024; 23: 973–1003.

Small

Mehkri

Panther

, et al. Coronavirus disease-2019 and stroke: pathophysiology and management. Can J Neurol Sci 2023; 50: 495–502.

Chiaramonte

Civello

Laganga Senzio

, et al. Multidimensional stratification of severe disability: demographic, clinical, geographic, socio-economic profiles and healthcare pathways in a cross-sectional Italian cohort. Healthcare 2025; 13: 3200.

Law

Matuja

Heldner

, et al. Regional burden and region-specific stroke risk factors in lower and middle-income countries. In: Ozturk

(eds) The global burden of stroke and changing risk factors. Rijeka: IntechOpen, 2025.

Dufouil

Beiser

McLure

, et al. Revised Framingham stroke risk profile to reflect temporal trends. Circulation 2017; 135: 1145–1159.

Lip

Nieuwlaat

Pisters

, et al. Refining clinical risk stratification for predicting stroke and thromboembolism in atrial fibrillation using a novel risk factor-based approach: the euro heart survey on atrial fibrillation. Chest 2010; 137: 263–272.

Hippisley-Cox

Coupland

Brindle

. Derivation and validation of QStroke score for predicting risk of ischaemic stroke in primary care and comparison with other risk scores: a prospective open cohort study. Br Med J 2013; 346: f2573.

10.

Han

Zhang

Luo

, et al. Relationship between stroke and estimated glucose disposal rate: results from two prospective cohort studies. Lipids Health Dis 2024; 23: 92.

11.

Zhang

Wang

Zhao

. Use machine learning models to identify and assess risk factors for coronary artery disease. Plos one 2024; 19: e0307952.

12.

Wang

Zhang

Mai

, et al. An effective multi-step feature selection framework for clinical outcome prediction using electronic medical records. BMC Med Inform Decis Mak 2025; 25: 1–15.

13.

Salaudeen

Bello

Danraka

, et al. Understanding the pathophysiology of ischemic stroke: the basis of current therapies and opportunity for new ones. Biomolecules 2024; 14: 05.

14.

Park

Chang

H-J

Nam

. A Bayesian network model for predicting post-stroke outcomes with available risk factors. Front Neurol 2018; 9: 99.

15.

Arora

Boyne

Slater

, et al. Bayesian Networks for risk prediction using real-world data: a tool for precision medicine. Value Health 2019; 22: 439–445.

16.

Zhang

Dai

, et al. Development and validation of a multi-causal investigation and discovery framework for knowledge harmonization (MINDMerge): a case study with acute kidney injury risk factor discovery using electronic medical records. Int J Med Inf 2024; 191: 105588.

17.

Fan

Z-X

Wang

C-B

Fang

L-B

, et al. Risk factors and a Bayesian network model to predict ischemic stroke in patients with dilated cardiomyopathy. Front Neurosci 2022; 16: 1043922.

18.

Delucchi

Spinner

Scutari

, et al. Bayesian Network analysis reveals the interplay of intracranial aneurysm rupture risk factors. Comput Biol Med 2022; 147: 105740.

19.

Park

Chang

Nam

. A Bayesian network model for predicting post-stroke outcomes with available risk factors. Front Neurol 2018; 9: 99.

20.

Jelihovschi

Faria

. ScottKnott: a package for performing the Scott-Knott clustering algorithm in R. TEMA (São Carlos) 2014; 15: 3–17.

21.

Soriano-Tárraga

Lazcano

Jiménez-Conde

, et al. Biological age is a novel biomarker to predict stroke recurrence. J Neurol 2021; 268: 285–292.

22.

Hunter

Kelleher

. Age specific models to capture the change in risk factor contribution by age to short term primary ischemic stroke risk. Front Neurol 2022; 13: 803749.

23.

Maida

Daidone

Pacinella

, et al. Diabetes and ischemic stroke: an old and new relationship an overview of the close interaction between these diseases. Int J Mol Sci 2022; 23: 2397.

24.

Odat

Idrees

Jain

, et al. Risk of stroke in patients with congenital heart disease: a systematic review and meta-analysis. BMC Neurol 2024; 24: 65.

25.

Howard

Banach

Kissela

, et al. Age-Related differences in the role of risk factors for ischemic stroke. Neurology 2023; 100: e1444–e1e53.

26.

Mayer-Suess

Ibrahim

Moelgg

, et al. Sleep disorders as both risk factors for, and a consequence of, stroke: a narrative review. Int J Stroke 2024; 19: 490–498.

27.

Cai

Wang

Yang

. Sleep disorders in stroke: an update on management. Aging Dis 2021; 12: 570–585.

28.

Brunetti

Rollo

Broccolini

, et al. Sleep and stroke: opening our eyes to current knowledge of a key relationship. Curr Neurol Neurosci Rep 2022; 22: 767–779.

29.

Ali

Baranchuk

. Editorial commentary: the relationship between alcohol intake and cardiovascular health: gaps in knowledge. Trends Cardiovasc Med 2025; 35: 254–257.

30.

Liu

Ding

Zhang

, et al. Association between alcohol consumption and risk of stroke among adults: results from a prospective cohort study in Chongqing, China. Bmc Public Health 2023; 23: 1593.

31.

Mosenzon

Cheng

Rabinstein

, et al.

Diabetes and stroke: what are the connections?

J Stroke 2023; 25: 26–38.

32.

Tsao

Aday

Almarzooq

, et al. Heart disease and stroke statistics—2023 update: a report from the American Heart Association. Circulation 2023; 147: e93–e621.

33.

Peters

SAE

Huxley

Woodward

. Diabetes as risk factor for incident coronary heart disease in women compared with men: a systematic review and meta-analysis of 64 cohorts including 858,507 individuals and 28,203 coronary events. Diabetologia 2014; 57: 1542–1551.

34.

Chen

Guan

Yan

, et al. Prognostic value of red blood cell distribution width-to-albumin ratio in ICU patients with coronary heart disease and diabetes mellitus. Front Endocrinol (Lausanne) 2024; 15: 1359345.

35.

Pantoja-Ruiz

Akinyemi

Lucumi-Cuesta

, et al. Socioeconomic status and stroke: a review of the latest evidence on inequalities and their drivers. Stroke 2025; 56: 794–805.

36.

Kim

Twardzik

Judd

, et al. Neighborhood socioeconomic status and stroke incidence: a systematic review. Neurology 2021; 96: 897–907.

37.

Lee

D-Y

. Prevalence and risk factors of stroke in Korean older adults: focusing on demographic and health behavior factors. Korean Soc Phys Med 2024; 19: 103–110.

38.

Bekele

Trijsburg

Brouwer

, et al. Dietary recommendations for Ethiopians on the basis of priority diet-related diseases and causes of death in Ethiopia: an umbrella review. Adv Nutr 2023; 14: 895–913.

39.

Kontogianni

Panagiotakos

. Dietary patterns and stroke: a systematic review and re-meta-analysis. Maturitas 2014; 79: 41–47.

40.

Jeong

S-M

Lee

Han

, et al. Association of change in alcohol consumption with risk of ischemic stroke. Stroke 2022; 53: 2488–2496.

41.

Boden-Albala

Sacco

. Lifestyle factors and stroke risk: exercise, alcohol, diet, obesity, smoking, drug use, and stress. Curr Atheroscler Rep 2000; 2: 160–166.

42.

Hoh

Amin-Hanjani

, et al. 2023 Guideline for the management of patients with aneurysmal subarachnoid hemorrhage: a guideline from the American Heart Association/American Stroke Association. Stroke 2023; 54: e314–ee70.