Development of an explainable machine learning model for predicting poststroke anxiety: A multicenter study using Shapley Additive Explanations and nomogram visualization

Abstract

Objective: Neuropsychiatric complications following a stroke can impede recovery and reduce the quality of life. Current predictive methods for poststroke anxiety (PSA) are limited by inadequate feature selection and lack of interpretability. This study aimed to develop an interpretable machine learning model utilizing a wide range of clinical data to detect high-risk PSA patients early, enabling personalized interventions. Methods: This retrospective multicenter study included 238 stroke patients from 10 Chinese hospitals spanning from 1 January 2022 to 11 June 2025. Data encompassing demographic, clinical, biochemical, and psychosocial factors were gathered. Feature selection involved univariate analysis followed by least absolute shrinkage and selection operator (LASSO) regression. Seven machine learning models—logistic regression, extreme gradient boosting (XGBoost), light gradient boosting machine (LightGBM), random forest, decision tree, K-nearest neighbors, and stacking—were constructed and assessed using cross-validation. Feature importance was determined using SHAP (Shapley Additive Explanations), and a nomogram was developed based on the final model. Results: Among the 238 patients, 109 were diagnosed with PSA. In the test set, the logistic regression model exhibited the best performance, achieving an area under the curve (AUC) of 0.981, accuracy of 0.917, sensitivity of 0.867, specificity of 0.952, and an F1 score of 0.897. SHAP analysis identified recurrent stroke, income level, payment type, occupational stress, overwork, sleep quality, continuous drinking history, history of hypertension, diabetes, hyperlipidemia, hyperhomocysteinemia, white blood cell (WBC) count, total cholesterol (TC), low-density lipoprotein (LDL), fibrinogen (FIB), activated partial thromboplastin time (APTT), National Institutes of Health Stroke Scale (NIHSS) score, and Barthel index as crucial predictors. A nomogram incorporating the top 10 SHAP-ranked features was devised to assist in clinical decision-making. Conclusion: The machine learning model demonstrated high accuracy and interpretability in predicting PSA risk. Through the integration of SHAP analysis and nomogram visualization, it offers a practical tool for clinicians to recognize high-risk PSA patients and customize management strategies to improve poststroke outcomes.

Keywords

Poststroke anxiety machine learning SHAP explainable AI risk prediction nomogram stroke rehabilitation

Introduction

Poststroke anxiety (PSA) is a prevalent yet frequently disregarded neuropsychiatric concern among stroke survivors,¹ linked to heightened disability, reduced quality of life, and increased rehospitalization rates.² Timely identification is crucial for prompt psychological and pharmacological interventions.

State-of-the-art: Predictive approaches for PSA

Conventional PSA prediction tools have primarily relied on clinical characteristics such as age, sex, stroke severity, and comorbid depression. Although these logistic-regression-based models are simple and interpretable, they generally capture only a narrow set of predictors and fail to reflect the multifactorial mechanisms of PSA, resulting in modest predictive accuracy.³

Other studies have examined biological markers, including serum cortisol, inflammatory cytokines, and structural neuroimaging indicators, to predict PSA risk. While these biomarker-based approaches provide mechanistic insights, their clinical translation is limited by high costs, technical complexity, and lack of routine availability.⁴

More recently, machine learning (ML) and deep learning (DL) algorithms have been introduced to integrate a broader range of clinical, psychosocial, and laboratory variables, achieving superior discrimination compared with traditional models.^5,6 However, these methods are often developed using small, single-center datasets and lack external validation, which raises concerns about overfitting and generalizability. In addition, many ML/DL models remain “black boxes,” limiting their acceptance among clinicians.

Recent progress in artificial intelligence (AI) and ML now facilitates the creation of predictive models that incorporate a wide range of variables to offer personalized risk assessments. The emergence of explainable AI (XAI) methods, such as SHAP (Shapley Additive Explanations), has further enhanced model transparency and interpretability, addressing the opacity issue in ML and supporting clinical adoption.⁷

This investigation aims to construct and validate an interpretable ML-based risk prediction model for PSA in stroke patients utilizing multicenter clinical data.

Study data

General information

This retrospective cross-sectional study involved 10 hospitals, namely, The First Affiliated Hospital of Henan University of Chinese Medicine, Luoyang Central Hospital, Xi’an Hospital of Traditional Chinese Medicine, The Third Affiliated Hospital of Fujian University of Traditional Chinese Medicine, Zhoukou Hospital of Traditional Chinese Medicine, Henan Provincial People's Hospital, Luohe Hospital of Traditional Chinese Medicine, Kaifeng Central Hospital, Jiaxian Traditional Chinese Medicine Hospital and Anyang Hospital of Traditional Chinese Medicine. Stroke patients admitted between 1 January 2022 and 11 June 2025 were divided into PSA and non-PSA groups based on clinical diagnosis. Data on demographics, stroke characteristics, laboratory indices, and psychological assessments were collected for analysis. This multicenter study was approved by the Ethics Committee of The First Affiliated Hospital of Henan University of Chinese Medicine (approval number: 2024HL-457) and waived the requirement for informed consent due to its retrospective nature, in accordance with national regulations.

Diagnostic criteria

PSA was assessed using the Hospital Anxiety and Depression Scale—anxiety subscale (HADS-A), which comprises seven items each scored 0–3 (total 0–21); in this study, PSA was defined as HADS-A ≥ 8.⁸ Two or more attending physicians autonomously performed all patient evaluations within 7 days of the stroke hospitalization.

Inclusion and exclusion criteria

Inclusion criteria were as follows: (1) age ≥ 18 years; (2) ischemic or hemorrhagic stroke diagnosis per established guidelines^9,10; (3) first-time or recurrent stroke; (4) HADS completion during hospitalization, with full data.

Exclusion criteria were as follows: (1) preexisting anxiety or depression diagnosis before stroke; (2) severe cognitive impairment, aphasia, or altered consciousness hindering assessment; (3) severe comorbid neurological or psychiatric disorders; and (4) incomplete clinical records or significant missing data.

Methods

This study's detailed workflow, encompassing data collection, feature selection, model development, evaluation, and interpretation, is depicted in Figure 1.

Figure 1.

The study's workflow detailing data collection, feature selection, machine learning modeling, and evaluation processes for predicting the risk of poststroke anxiety (PSA). The ultimate model choice was informed by both test performance and clinical interpretability. LASSO: least absolute shrinkage and selection operator; CV: cross-validation; XGBoost: extreme gradient boosting; LightGBM: light gradient boosting machine; KNN: K-nearest neighbor; AUC: area under the curve; DCA: decision curve analysis.

Observational indicators

This study was carried out between 1 January 2022 and 11 June 2025, in 10 tertiary hospitals in China. A standardized case report form (CRF) was utilized to facilitate data collection by trained researchers, covering five primary variable categories: (1) demographic and lifestyle factors, comprising age, gender, education, height, weight, body mass index (BMI), comorbidities (hypertension, diabetes, hyperhomocysteinemia, transient ischemic attack (TIA)), and lifestyle factors (smoking, alcohol consumption, working status, overwork, sleep patterns); (2) laboratory parameters, including complete blood count (white blood cell (WBC), red blood cell (RBC), hemoglobin, platelets), biochemical assessments (total cholesterol, triglycerides, low-density lipoprotein (LDL), high-density lipoprotein (HDL), glycated hemoglobin (HbA1c), uric acid, creatinine), and coagulation profiles (fibrinogen (FIB), activated partial thromboplastin time (APTT), prothrombin time (PT), thrombin time (TT)); (3) imaging indicators, specifically carotid atherosclerotic plaques detected through ultrasonography or computed tomography angiography (CTA); (4) clinical assessment tools, encompassing NIHSS (National Institutes of Health Stroke Scale, assessed at admission), Barthel index (BI, evaluated at discharge), modified Rankin scale (mRS, also assessed at discharge), and HADS-A (see Diagnostic Criteria section). Sleep quality was derived from the Pittsburgh Sleep Quality Index (PSQI), with poor sleep defined as PSQI > 5. Overwork was defined as ≥55 self-reported working hours per week during the month prior to stroke onset; (5) data quality assurance procedures, with all information extracted from hospital electronic medical records and laboratory systems to ensure data reliability and consistency.

ML and statistical analysis

Statistical analyses were conducted using SPSS 27, R 4.5, and Python 3.13. The Python libraries included Scikit-learn (v1.3.2), extreme gradient boosting (XGBoost) (v1.7.6), light gradient boosting machine (LightGBM) (v4.1.0), and SHAP (v0.44.0). Continuous variables were expressed as mean ± standard deviation and compared using independent t-tests or Mann–Whitney U tests. Categorical variables were presented as counts (percentages) and analyzed through chi-square or Fisher's exact tests. Observational indicators encompassed demographic, clinical, psychosocial, and laboratory features, which are comprehensively summarized in Table 1. Feature selection followed a two-step process: univariate analysis (p < 0.05) using SPSS, followed by least absolute shrinkage and selection operator (LASSO) regression for dimensionality reduction. Out of 49 initial candidate variables, 18 were retained as core predictors. The dataset (N = 238) was randomly partitioned into training (70%), validation (15%), and test (15%) sets. Seven algorithms were applied: logistic regression, decision tree, random forest, K-nearest neighbors (KNNs), XGBoost, LightGBM, and stacking. Models were trained using 10-fold cross-validation, and performance was evaluated via area under the curve (AUC), accuracy, sensitivity, specificity, F1 score, and average precision. Calibration curves were also used to assess prediction consistency. Hyperparameters for each algorithm were optimized via grid search and are presented in Table 2, using conventional terminology.

Table 1.

Summary of variable categories and measurement scales.

Category	Variables	Measurement/scale
Demographic	Age, sex, education, income, insurance type	Structured questionnaire
Clinical history	Hypertension, diabetes, hyperlipidemia, hyperhomocysteinemia, recurrent stroke	Medical records
Stroke severity	NIHSS score, BI	Standardized scales
Psychosocial factors	Sleep quality, occupational stress, overwork, chronic alcohol use	Validated self-report tools (PSQI, JCQ)
Laboratory indices	WBC, TC, LDL, FIB, APTT	Routine blood/biochemical assays

NIHSS: National Institutes of Health Stroke Scale; BI: Barthel index; PSQI: Pittsburgh Sleep Quality Index; JCQ: Job Content Questionnaire; WBC: white blood cell; TC: total cholesterol; LDL: low-density lipoprotein; FIB: fibrinogen; APTT: activated partial thromboplastin time.

This table summarizes all variables included in model construction, grouped by domain (demographic, clinical, psychosocial, and laboratory). Measurement methods and data sources are indicated for each category.

Table 2.

Final hyperparameter settings of each machine learning model.

Model	Tuned hyperparameters	Final settings
Logistic regression	Regularization strength	C = 0.1
Decision tree	Tree maximum depth; minimum samples per split	Depth = 5; min split = 10
Random forest	Number of trees; tree maximum depth	100 trees; depth = 7
K-nearest neighbors	Number of neighbors	9
XGBoost	Tree maximum depth; learning rate; class weight ratio	Depth = 7; LR = 0.1; weight ≈ 1.18
LightGBM	Tree maximum depth; learning rate; class weight ratio	Depth = 7; LR = 0.1; weight ≈ 1.18
Stacking	Base learners; meta-learner	Logistic regression, random forest, XGBoost; meta = logistic regression

This table lists the optimized hyperparameters for all algorithms after grid-search tuning with 10-fold cross-validation, ensuring optimal discrimination and generalizability.

To address the moderately imbalanced PSA/non-PSA ratio (∼46%/54%), two complementary strategies were applied. First, class weighting was implemented. For logistic regression and tree-based models (decision tree, random forest), the class_weight parameter in Scikit-learn was set to “balanced,” automatically adjusting weights inversely to class frequencies. For XGBoost and LightGBM, the scale_pos_weight parameter (∼1.18, calculated as non-PSA/PSA) was optimized via GridSearchCV to improve sensitivity without sacrificing specificity.

Second, robustness was validated using the synthetic minority oversampling technique (SMOTE) on the training set. Although SMOTE slightly improved sensitivity, it reduced generalization on the independent test set (logistic regression AUC: 0.965 vs. 0.981 with class weighting) and increased overfitting risk. Therefore, class weighting was retained as the primary strategy, as it achieved a more stable balance between sensitivity (0.867) and specificity (0.952). To further evaluate performance under imbalanced conditions, precision–recall (PR) curves (Figure 3) were analyzed in addition to receiver operating characteristic (ROC) curves. The high average precision of the logistic regression model (0.92 in the test set) confirmed the effectiveness of the chosen approach.

SHAP was employed for enhancing interpretability, and decision curve analysis (DCA) was conducted to evaluate clinical utility. The top 10 variables were selected for inclusion in the final nomogram based on SHAP importance rankings, aiming to balance predictive performance and usability (see Supplementary Table 1).

Results

Sample size justification

A total of 109 patients met PSA criteria (HADS-A ≥ 8), and 18 final predictors were identified, yielding an events-per-variable (EPV) ratio of 6.1, slightly below the conventional EPV threshold of ≥10 for logistic regression. Nonetheless, employing LASSO and cross-validation mitigates the risks of overfitting. Past studies have demonstrated the viability of ML modeling with modest sample sizes (n ≈ 200–300), affirming the suitability of this dataset for predictive modeling.¹¹ For the final nomogram, 10 predictors were retained, corresponding to an EPV of 10.9, thereby meeting the recommended threshold for model stability and interpretability.

Comparison of clinical characteristics between groups

This investigation comprised 238 stroke patients, with 109 patients meeting the PSA diagnostic criteria and 129 classified into the non-PSA group, yielding a PSA prevalence of 45.80%. Eighteen variables exhibited statistically significant differences between the groups (p < 0.05): recurrent stroke, income level, payment type, occupational stress, overwork, sleep quality, continuous drinking history, history of hypertension, diabetes, hyperlipidemia, hyperhomocysteinemia, WBC, TC, LDL, FIB, APTT, NIHSS score, and BI. A detailed comparative analysis is presented in Table 3.

Table 3.

A comparison of baseline characteristics between stroke patients with poststroke anxiety (PSA) and those without PSA.

Characteristic		Control group(N = 127)	Case group(N = 109)	p-value
Demographic and general information
Outcome of recurrent stroke events		0 (0.00)	63 (1.00)	0.00
Age		63.33 ± 11.50	66.12 ± 11.39	0.06
Gender	Male	47 (0.51)	46 (0.49)	0.42
Gender	Female	80 (0.56)	63 (0.44)	0.42
Height (cm)		165.74 ± 5.78	166.58 ± 7.23	0.33
Weight (kg)		66.95 ± 9.55	69.11 ± 13.64	0.16
BMI		24.08 ± 2.77	24.49 ± 3.07	0.28
Education level	Junior high school or below	82 (0.51)	79(0.49)	0.19
Education level	High school and above	45 (0.60)	30 (0.40)	0.19
Household annual income (RMB 10,000)	<8	127 (0.58)	92 (0.42)	0.00
Household annual income (RMB 10,000)	≥8	0 (0.00)	17 (1.00)	0.00
Occupation in the 6 months prior to onset	Mental workers	17 (0.61)	11 (0.39)	0.44
Occupation in the 6 months prior to onset	Physical workers	110 (0.53)	98 (0.47)	0.44
Payment type	Basic medical care	104 (0.59)	73 (0.41)	0.01
Payment type	Urban medical care	23 (0.39)	36 (0.61)	0.01
Marital status	Married	119 (0.54)	100 (0.46)	0.56
Marital status	Separated	8 (0.47)	9 (0.53)	0.56
Working status	High pressure	44 (0.96)	2 (0.04)	0.00
Working status	Low pressure	83 (0.44)	107 (0.56)	0.00
Overwork		42 (0.93)	3 (0.07)	0.00
Sleep quality		45 (0.82)	10 (0.18)	0.00
Body temperature		37.33 ± 9.81	36.54 ± 0.34	0.40
Respiratory rate		19.67 ± 7.29	19.33 ± 4.80	0.68
Heart rate		77.06 ± 9.79	78.86 ± 12.51	0.22
Systolic blood pressure		137.18 ± 21.67	141.75 ± 23.50	0.12
Diastolic blood pressure		81.70 ± 14.78	84.08 ± 13.83	0.21
Medical history
Hypertension		123 (0.63)	73 (0.37)	0.00
Diabetes mellitus		115 (0.61)	72 (0.39)	0.00
Heart disease		19 (0.46)	22 (0.54)	0.63
Hyperlipidemia		2 (0.09)	20 (0.91)	0.00
Hyperhomocysteinemia		3 (0.14)	19 (0.86)	0.00
History of TIA		10 (0.93)	7 (0.08)	0.67
Continuous smoking history		34 (0.54)	29 (0.46)	0.98
Continuous drinking history		32 (0.68)	15 (0.32)	0.03
Blood biochemical indicators
RBC count		5.51 ± 8.79	4.56 ± 0.82	0.26
WBC count		6.83 ± 2.92	7.55 ± 2.66	0.05
PLT		325.35 ± 731.24	234.23 ± 93.25	0.20
HGB		259.98 ± 807.62	148.90 ± 71.72	0.15
TC		3.93 ± 0.91	4.29 ± 1.50	0.03
TG		1.63 ± 0.76	1.59 ± 0.91	0.68
LDL		2.11 ± 0.71	2.46 ± 0.81	0.00
HDL		1.03 ± 0.33	1.08 ± 0.29	0.20
PT		9.47 ± 4.89	8.94 ± 5.29	0.42
FIB content		2.79 ± 1.11	3.41 ± 1.26	0.00
APTT		26.68 ± 7.29	29.93 ± 6.03	0.00
TT		17.58 ± 10.88	15.88 ± 1.85	0.11
HbA1c		7.33 ± 3.23	6.82 ± 1.50	0.14
Cr		67.57 ± 23.36	72.18 ± 87.57	0.57
UA		276.41 ± 84.86	297.62 ± 105.78	0.09
Hcy		15.18 ± 18.32	15.84 ± 8.24	0.73
Scale scores
mRS scale score		1.95 ± 1.22	2.29 ± 1.67	0.07
NIHSS scale		2.72 ± 2.80	6.81 ± 6.68	0.00
BI total score		83.71 ± 25.36	64.27 ± 32.14	0.00

Variables demonstrating significant between-group variances (p < 0.05) are highlighted in bold. PSA is an abbreviation for poststroke anxiety.

BMI: body mass index; TIA: transient ischemic attack; RBC: red blood cell; WBC: white blood cell; PLT: platelet count; HGB: hemoglobin concentration; TC: total cholesterol; TG: triglyceride; LDL: low-density lipoprotein; HDL: high-density lipoprotein; PT: prothrombin time; FIB: fibrinogen; APTT: activated partial thromboplastin time; TT: thrombin time; HbA1c: glycated hemoglobin; Cr: creatinine; UA: uric acid; Hcy: homocysteine; mRS: modified Rankin scale; NIHSS: National Institutes of Health Stroke Scale; BI: Barthel index.

Development and evaluation of PSA risk prediction models using seven ML algorithms

Eighteen core predictors, including recurrent stroke, income level, payment type, occupational stress, overwork, sleep quality, continuous drinking history, history of hypertension, diabetes, hyperlipidemia, hyperhomocysteinemia, WBC count, total cholesterol, LDL, FIB, APTT, NIHSS score, and BI, were utilized to construct seven ML classification models. These models encompassed logistic regression, XGBoost, LightGBM, random forest, decision tree, KNN, and stacking ensemble models.

In the training set, the majority of models attained exceptional AUC values of 1.000. Within the validation set, the stacking model demonstrated outstanding performance (AUC = 0.967; accuracy = 0.829; sensitivity = 0.667; specificity = 1.000; F1 score = 0.800), closely trailed by the logistic regression model (AUC = 0.964; with identical metrics).

In the test set, both the logistic regression and random forest models exhibited robust generalizability. The logistic regression model achieved an AUC of 0.981, accuracy of 0.917, sensitivity of 0.867, specificity of 0.952, and F1 score of 0.897. Similarly, the random forest model performed commendably (AUC = 0.968; accuracy = 0.917; sensitivity = 0.933; specificity = 0.905; F1 score = 0.903). For further details, please consult Table 4 and Figure 2 .

Figure 2.

Receiver operating characteristic (ROC) curves for seven distinct machine learning models utilized in poststroke anxiety (PSA) prediction. Panel A represents the training set, Panel B depicts the validation set, and Panel C showcases the test set. Each curve delineates the balance between sensitivity and specificity, with the area under the curve (AUC) values serving as indicators of discrimination performance. XGBoost: extreme gradient boosting; LightGBM: light gradient boosting machine; KNN: K-nearest neighbor.

Figure 3.

Precision–recall (PR) curves that compare PSA prediction models across different datasets. Panel A represents the training set, Panel B the validation set, and Panel C the test set. These curves illustrate the performance of the models in scenarios with imbalanced classification, emphasizing precision at different levels of recall. PSA: poststroke anxiety; XGBoost: extreme gradient boosting; LightGBM: light gradient boosting machine; KNN: K-nearest neighbor; AUC: area under the curve.

Table 4.

Displays the performance metrics of seven machine learning models for predicting PSA levels across training, validation, and test datasets.

Model	Stage	AUC	Accuracy	Sensitivity	Specificity	F1 score
Decision tree	Training	1.000	1.000	1.000	1.000	1.000
KNN	Training	0.985	0.927	0.895	0.955	0.919
LightGBM	Training	1.000	1.000	1.000	1.000	1.000
Logistic	Training	0.989	0.976	0.961	0.989	0.973
Random forest	Training	1.000	1.000	1.000	1.000	1.000
Stacking	Training	1.000	1.000	1.000	1.000	1.000
XGBoost	Training	1.000	1.000	1.000	1.000	1.000

Model	Stage	AUC	Accuracy	Sensitivity	Specificity	F1 score
Decision tree	Validation	0.832	0.829	0.722	0.941	0.812
KNN	Validation	0.956	0.914	0.833	1.000	0.909
LightGBM	Validation	0.944	0.857	0.833	0.882	0.857
Logistic	Validation	0.964	0.829	0.667	1.000	0.800
Random forest	Validation	0.941	0.829	0.722	0.941	0.812
Stacking	Validation	0.967	0.829	0.667	1.000	0.800
XGBoost	Validation	0.931	0.800	0.667	0.941	0.774

Model	Stage	AUC	Accuracy	Sensitivity	Specificity	F1 score
Decision tree	Testing	0.859	0.806	0.867	0.762	0.788
KNN	Testing	0.951	0.917	0.867	0.952	0.897
LightGBM	Testing	0.990	0.889	1.000	0.810	0.882
Logistic	Testing	0.981	0.917	0.867	0.952	0.897
Random forest	Testing	0.968	0.917	0.933	0.905	0.903
Stacking	Testing	0.968	0.889	0.933	0.857	0.875
XGBoost	Testing	0.971	0.861	0.933	0.810	0.848

The metrics assessed include AUC, accuracy, sensitivity, specificity, and F1-score. PSA: poststroke anxiety; AUC: area under the curve; KNN: K-nearest neighbor; LightGBM: light gradient boosting machine; XGBoost: extreme gradient boosting.

Pairwise comparisons of AUCs using DeLong's test revealed that the logistic regression model did not differ significantly from the stacking (AUC = 0.983, p = 0.42), XGBoost (AUC = 0.975, p = 0.18), LightGBM (AUC = 0.973, p = 0.21), or random forest models (AUC = 0.968, p = 0.27). However, logistic regression significantly outperformed the KNN (AUC = 0.892, p = 0.004) and decision tree (AUC = 0.873, p < 0.001) models (see Supplementary Table 2).

Although the stacking model achieved a slightly higher AUC on the validation set, the logistic regression model exhibited superior generalization on the test set (AUC = 0.981) with more consistent and stable performance across all evaluation metrics. Given its reduced complexity, improved interpretability, and alignment with SHAP-based clinical interpretation, logistic regression was selected as the ultimate model. Subsequently, a nomogram was developed for both visual representation and clinical utilization.

Evaluation of PSA risk prediction models across seven algorithms

ROC curve analysis based on 10-fold cross-validation

Figure 4 depicts the ROC curve analysis results for each model. In the training set (Figure 4A), all models, except logistic regression and KNN with identical AUC values of 0.98, exhibited an AUC of 1.00, indicating a nearly perfect fit. Moving to the validation set (Figure 4B), the stacking model demonstrated the highest AUC of 0.96, closely trailed by logistic regression (AUC = 0.96) and XGBoost (AUC = 0.94). LightGBM and random forest also performed commendably, achieving AUCs of 0.94 and 0.95, respectively, while KNN (AUC = 0.91) and decision tree (AUC = 0.83) displayed a slightly inferior performance. In the test set (Figure 4C), logistic regression, XGBoost, random forest, and stacking each attained an AUC of 0.99, showcasing outstanding generalizability. LightGBM and KNN followed closely, with AUCs of 0.98, whereas the decision tree exhibited the lowest AUC at 0.86.

Figure 4.

Tenfold cross-validated ROC curves for seven distinct machine learning models. Panel A represents the training set, Panel B the validation set, and Panel C the test set. The AUC value for each model is provided to evaluate the consistency of discrimination across folds. XGBoost: extreme gradient boosting; LightGBM: light gradient boosting machine; KNN: K-nearest neighbor; AUC: area under the curve; ROC: receiver operating characteristic.

Pairwise comparisons of ROC AUCs were performed using DeLong's test with Bonferroni correction. Logistic regression (AUC = 0.981) did not differ significantly from stacking (AUC = 0.983; p = 0.42), XGBoost (AUC = 0.975; p = 0.18), LightGBM (AUC = 0.973; p = 0.21), or random forest (AUC = 0.971; p = 0.27) but outperformed KNN (AUC = 0.892; p = 0.004) and decision tree (AUC = 0.873; p < 0.001). For the final logistic regression model, the test set AUC was 0.981 (95% CI: 0.962–0.998), and the optimism-corrected AUC via bootstrap (0.632+) was 0.976, confirming the robustness of the findings.

Calibration curve analysis of the seven ML models

Model calibration was evaluated through calibration curves and Brier scores, as depicted in Figure 5. In the training set (Figure 5A), the stacking model displayed superior calibration with a Brier score of 0.030, outperforming LightGBM (Brier score = 0.0222) and XGBoost (Brier score = 0.0371). The logistic regression model also showed satisfactory calibration with a Brier score of 0.0533. Within the validation set (Figure 5B), stacking exhibited strong calibration with a Brier score of 0.0409, while XGBoost (Brier score = 0.0367) and logistic regression (Brier score = 0.0618) maintained consistent performance. Conversely, the decision tree (Brier score = 0.0945) and random forest (Brier score=0.0515) models demonstrated higher Brier scores, indicating less optimal calibration. In the test set (Figure 5C), logistic regression achieved the lowest Brier score (0.0301), suggesting the highest agreement between predicted and observed outcomes. Stacking (Brier score = 0.0430), random forest (Brier score = 0.0600), and XGBoost (Brier score = 0.0361) also displayed acceptable calibration. Overall, both logistic regression and stacking showed the most consistent and reliable calibration performance across all three data partitions.

Figure 5.

Calibration curves evaluating the concordance between predicted and observed PSA probabilities across three sets: the training set (Panel A), the validation set (Panel B), and the test set (Panel C). The dashed lines in the figures indicate perfect calibration, with Brier scores included for quantitative calibration evaluation. XGBoost: extreme gradient boosting; LightGBM: light gradient boosting machine; KNN: K-nearest neighbor.

DCA of the PSA risk prediction models

DCA was utilized to evaluate the net clinical benefit of each model at various threshold probabilities (Figure 6). In the training set (Figure 6A), the stacking model exhibited the highest net benefit across most probability thresholds (AUC = 0.998, ΔAUC = 0.0011), closely followed by XGBoost (AUC = 0.997) and LightGBM (AUC = 0.996), indicating significant clinical promise during model training. In the validation set (Figure 6B), the logistic regression model demonstrated superior net benefit within specific threshold ranges (AUC = 0.879, ΔAUC = 0.0016), while the stacking and XGBoost models maintained consistent performance. All three models surpassed the “treat-all” and “treat-none” strategies within the 0.2–0.7 threshold range, underscoring their favorable generalizability. In the test set (Figure 6C), the stacking model maintained its superiority (AUC = 0.933), with enhanced net benefit across the 0.3–0.8 threshold range. The logistic regression and XGBoost models also demonstrated substantial clinical utility (AUC = 0.926 and 0.930, respectively). Overall, the stacking model exhibited the most consistently advantageous DCA curves across all datasets, underscoring its robust clinical relevance.

Figure 6.

Decision curve analysis (DCA) assessing the clinical net benefit of PSA prediction models across three sets: the training set (Panel A), validation set (Panel B), and test set (Panel C). This analysis depicts the model's performance across various threshold probabilities in comparison to “treat-all” and “treat-none” strategies. PSA: poststroke anxiety; XGBoost: extreme gradient boosting; LightGBM: light gradient boosting machine; KNN: K-nearest neighbor.

PR curve analysis

PR curves were utilized to evaluate model performance in the presence of class imbalance across the training, validation, and test sets (Figure 3). These curves depict the trade-off between precision and recall, a critical consideration given the modest class distribution between PSA and non-PSA instances in the dataset (109 vs. 129). Notably, both logistic regression and stacking models consistently demonstrated the highest average precision across all dataset partitions, indicating robust predictive performance in detecting PSA cases while minimizing false positives. These findings are consistent with results from ROC and DCA, highlighting the logistic regression model's superior generalizability and practical utility.

Model interpretation using SHAP

Figure 7 utilizes SHAP to explicate the logistic regression model, identifying crucial features influencing PSA prediction. In Figure 7A, recurrent stroke emerges as the most substantial positive predictor, exhibiting the highest SHAP values and a robust positive correlation with predicted PSA risk. Other noteworthy factors include hypertension, NIHSS score, sleep quality, working status, income level, and overwork. The majority of features demonstrate a direct relationship between higher values and increased risk, emphasizing their significance as PSA risk determinants. Figure 7B ranks the average SHAP importance, underscoring recurrent stroke, hypertension, NIHSS score, sleep disturbance, and working status as the primary contributors. These results underscore the model's focus on neurological function, comorbidities, and psychosocial aspects in risk evaluation.

Figure 7.

SHAP summary plots for the logistic regression model. Panel A illustrates SHAP value distributions, indicating the direction and impact of each feature. Panel B ranks global feature importance based on mean absolute SHAP values. SHAP: Shapley Additive Explanations; NIHSS: National Institutes of Health Stroke Scale; BI: Barthel index; APTT: activated partial thromboplastin time; FIB: fibrinogen; TC: total cholesterol; LDL: low-density lipoprotein; WBC: white blood cell.

Figure 8 displays local SHAP explanations for six representative patients, demonstrating that distinct variables predominantly influenced PSA predictions for each individual. The key features identified were recurrent stroke, hypertension, NIHSS score, sleep quality, employment, and income. Notably, in Samples 1 and 2, “recurrent stroke” and “sleep disturbance” positively contributed significantly, while “hypertension” and “NIHSS score” made negative contributions in Sample 2. These personalized SHAP profiles highlight the model's capacity for tailored prediction and adjustment based on individual patient data, aligning with the principles of precision medicine. This individual-level elucidation advocates for the utilization of SHAP-guided model interpretation as a potential tool for clinical decision-making. Additionally, a normalized SHAP heatmap (refer to Supplementary Figure 1) offers a comparative summary of feature contributions across various ML algorithms, revealing consistent importance patterns for recurrent stroke, NIHSS, hypertension, and sleep across the models.

Figure 8.

SHAP force plots providing personalized prediction explanations for six selected patients. Panels A to F display the predominant feature contributions for each patient, reflecting personalized PSA risk factors with directional indicators. SHAP: Shapley Additive Explanations; PSA: poststroke anxiety; NIHSS: National Institutes of Health Stroke Scale; BI: Barthel index; FIB: fibrinogen; APTT: activated partial thromboplastin time; TC: total cholesterol; LDL: low-density lipoprotein.

Visualization of the PSA risk prediction model

A nomogram was constructed based on the final logistic regression model, incorporating 10 SHAP-ranked predictors (NIHSS score, BI, LDL, APTT, income level, diabetes status, sleep quality, WBC, FIB, and hyperlipidemia) (Figure 9). The selection from the top 15 SHAP-ranked features followed three guiding principles: (i) minimizing redundancy, (ii) prioritizing clinical usability, and (iii) preserving predictive performance, in line with recent guidance for clinically deployable prediction models.^12–15

Figure 9.

A nomogram illustrating the final logistic regression model incorporating 10 essential features. This tool enables clinicians to assess an individual's PSA risk by aggregating scores corresponding to specific variables along the top axis and correlating them with the probability of anxiety on the bottom scale. PSA: poststroke anxiety; SHAP: Shapley Additive Explanations; NIHSS: National Institutes of Health Stroke Scale; BI: Barthel index; LDL: low-density lipoprotein; APTT: activated partial thromboplastin time; WBC: white blood cell; FIB: fibrinogen.

Rationale for excluding certain high-importance features (e.g. recurrent stroke)

Although recurrent stroke ranked highly in the global SHAP importance plot (Figure 7B), both the dependence plot (Figure 7A) and patient-level SHAP explanations (Figure 8A–F) showed that its predictive contribution substantially overlapped with NIHSS. For example, in several representative patients (Figure 8), the presence of recurrent stroke was accompanied by high NIHSS scores, and the SHAP contribution of recurrent stroke diminished once NIHSS was included. This reflects a clinical reality: recurrent strokes usually manifest with greater neurological deficits, which are already quantified by NIHSS at admission. Thus, NIHSS was retained as the more practical severity measure, while recurrent stroke was excluded to avoid redundancy and improve bedside feasibility.

Representative feature selection

For groups of highly correlated laboratory variables, we retained the most clinically informative and interpretable indicators. For instance, LDL was chosen over TC, as LDL is not only a central therapeutic target in secondary prevention of vascular disease but also more directly associated with atherosclerotic burden and disease progression. This ensures that the nomogram remains both statistically stable and clinically meaningful.¹³

Parsimony with preserved performance and utility

The simplified 10-feature nomogram achieved almost identical discrimination compared with the 15-feature model (test AUC 0.981 vs. 0.983; sensitivity 0.867 vs. 0.872; specificity 0.952 vs. 0.948). DCA further demonstrated a higher net benefit for the simplified model across clinically relevant thresholds, underscoring its real-world utility.¹² Supplementary Figure 1 confirmed that the retained predictors were consistently important across multiple algorithms, supporting the robustness of feature selection.

Each predictor corresponds to a scoring axis, where the patient's feature value maps onto the “Points” axis to generate an individual score. The total score is then translated into a personalized probability of PSA on the “Total Points” and “Probability of Anxiety” axes. This nomogram functions as an intuitive bedside tool, enabling clinicians to perform rapid risk evaluation and tailor intervention strategies accordingly.

Discussion

This pioneering multicenter study integrates various demographic, clinical, biochemical, psychosocial, and stroke-related factors to establish and validate multiple ML models for predicting PSA. Although the stacking model exhibited marginally superior predictive efficacy during training and validation phases, its performance fluctuated in the test dataset, suggesting potential overfitting. Despite several models achieving nearly flawless AUC values during training, we implemented LASSO feature selection, L2 regularization, complexity control (e.g. max_depth), and cross-validation techniques to enhance model robustness and generalizability. The conclusive findings indicated that logistic regression and random forest models demonstrated the most consistent performance in the test dataset, affirming the effectiveness of our overfitting prevention strategies. Given the intricate nature and limited interpretability of the stacking model, logistic regression was selected for its simplicity and high explicability, enabling the construction of a nomogram to enhance clinical applicability.

Despite the small sample size of the test set (approximately 36 cases), we utilized fivefold cross-validation and hyperparameter optimization using GridSearchCV during model training to mitigate bias stemming from singular random splits. Additionally, we maintained the independence of the validation and test sets to uphold evaluation objectivity and stability. Subsequent studies should aim to augment the sample size and incorporate more rigorous validation techniques, such as nested cross-validation or external validation, to bolster the robustness of model testing.

The SHAP analysis offered a comprehensive explanation of the contribution of each feature to the prediction of PSA, effectively emphasizing significant influencing variables. Notably, recurrent stroke events, hypertension, and NIHSS scores were identified as impactful predictors with notably positive SHAP values, consistent with prior pathophysiological investigations. The occurrence of recurrent strokes may heighten the risk of PSA by instigating neuroinflammatory responses and compromising endothelial function.¹⁶ Concurrently, chronic hypertension could disturb cerebral perfusion and metabolic equilibrium, thereby influencing emotional regulation.¹⁷ Furthermore, NIHSS scores mirror neurological impairments strongly associated with the development of anxiety.¹⁸

Sleep quality and income level are recognized as important psychosocial risk factors. Sleep disruptions can disturb the balance of neurotransmitters and neuroendocrine systems, increasing vulnerability to anxiety.¹⁹ Likewise, lower income can exacerbate financial stress, leading to heightened emotional distress.²⁰ These results underscore the potential effectiveness of holistic interventions, including improving sleep and providing socioeconomic assistance, as novel approaches to addressing psychosocial adversity.

We found that specific biochemical markers, such as homocysteine, WBC, LDL, and FIB, demonstrated elevated SHAP values, suggesting potential roles in neurotoxicity, low-grade inflammation, and possibly endothelial dysfunction. Recent reviews and observational studies report associations between these biomarkers and endothelial functional impairment in stroke or vascular brain injury settings.^21,22 These markers could contribute to destabilization of neuropsychiatric status following a stroke, providing a biological basis for PSA risk and potential therapeutic targets. Discrepancies in predictor importance among studies may stem from regional, age-related, or comorbidity-specific variations, underscoring the need for further investigation.

The nomogram, developed using the top 10 features ranked by SHAP values, demonstrated robust clinical utility and interpretability. This tool enables clinicians to promptly assess individual PSA risk and implement tailored interventions for high-risk individuals. While the stacking model exhibited strong performance during training, the logistic model presented enhanced interpretability and stability, better suited to the current sample size and feature complexity. Subsequent studies could explore integrating DL or multimodal interpretation approaches to augment both performance and clinical relevance.

This study presents several key strengths: (1) the utilization of real-world, multicenter data enhances the external generalizability of the model; (2) integration of multidimensional variables spanning clinical, biochemical, and psychosocial domains offers a comprehensive insight into PSA risk; and (3) implementation of SHAP and nomogram visualization techniques improves model interpretability and facilitates practical application.

Several limitations must be acknowledged. Firstly, the relatively small sample size, particularly within the PSA group (n = 109), may have restricted the generalizability of the findings. Although we applied strategies such as LASSO regression, regularization, and cross-validation to mitigate bias, the possibility of overfitting cannot be fully excluded. In particular, the perfect training performance observed in the stacking model (AUC = 1.00) suggests a risk of overfitting, which reinforces our decision to select logistic regression as the final model due to its superior interpretability and more stable generalization across datasets. Secondly, external validation is still lacking, and it is imperative to conduct larger, multicenter, and prospective studies to further evaluate the robustness and transferability of the model. Thirdly, although certain psychosocial features were considered, others such as social support and major life events were not included. Future studies should incorporate comprehensive psychosocial evaluations using standardized tools to improve the model's predictive precision and clinical applicability.

Conclusion

In this multicenter retrospective study, we developed and validated an interpretable ML model for predicting PSA. The model, based on logistic regression and enhanced by SHAP explainability and nomogram visualization, demonstrated high discrimination (AUC = 0.981 on the test set), good calibration, and consistent performance across multiple datasets.

Compared with more complex ensemble methods, the logistic regression model achieved comparable accuracy while maintaining simplicity, interpretability, and clinical applicability. By integrating demographic, clinical, psychosocial, and laboratory indicators, the model highlights the multifactorial nature of PSA risk.

This tool may provide clinicians with a practical instrument for early risk stratification and individualized intervention, thereby promoting precision medicine in stroke rehabilitation. Future prospective and externally validated studies are warranted to further confirm its generalizability.

Footnotes

Abbreviations

Acknowledgements

We express our gratitude to all collaborating institutions for their assistance in sharing clinical data and providing ethical oversight.

ORCID iD

Mengke Lyu

Ethical approval

Approval for this retrospective multicenter study was obtained from the Ethics Committee of The First Affiliated Hospital of Henan University of Chinese Medicine (approval number: 2024HL-457). As the coordinating institution, it oversaw ethical compliance across all 10 participating centers in accordance with the Guidelines for the Protection of Human Subject Research (2020 edition) issued by the National Health Commission of China and the Declaration of Helsinki.

Informed consent

Informed consent waiver rationale: The requirement for informed consent was waived because the study used only preexisting retrospective medical data, all of which were de-identified prior to analysis. No additional patient contact was conducted, and the study posed no risks to participants. Data de-identification protocols: To protect privacy, each center removed all direct identifiers (name, ID, contact information, admission and record numbers) from the dataset. Patients were assigned unique alphanumeric codes (e.g. ZZ-001, LY-002) for anonymization. The mapping file was password-protected, stored on an encrypted server accessible only to a designated data manager, and permanently destroyed after data verification. Source of psychosocial data: Psychosocial variables (income level, sleep quality, occupational stress, overwork) were extracted from routinely collected in-hospital records, including (1) admission assessment forms completed by nurses within 24 h of admission and (2) psychological evaluation records from clinical psychologists using validated tools (PSQI for sleep, Job Content Questionnaire (JCQ) for occupational stress). No retrospective patient contact was performed.

This process ensured strict compliance with national regulations and international ethical standards for retrospective medical research.

Contributorship

The study was designed and data collection supervised by Xiaoming Shen and Mengke Lyu. Data analysis was conducted by Min Li. Manuscript writing and revision were contributed by Mengke Lyu and Christian Hölscher. All authors have reviewed and approved the final manuscript.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was financially supported by the National Natural Science Foundation of China (Grant Nos. 81303011 and 81973618).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Availability of data and materials

The datasets produced and/or scrutinized in this study are not publicly accessible to safeguard patient confidentiality and comply with institutional guidelines. However, interested parties may obtain the data from the corresponding author upon request.

Guarantor

Xiaoming Shen is the guarantor of this study and accepts full responsibility for the integrity of the data and the accuracy of the analysis.

Supplemental material

Supplemental material for this article is available online.

References

Zhu

B-L

, et al. Association between obesity and post-stroke anxiety in patients with acute ischemic stroke. Front Nutr 2021; 8: 749958.

Redmond

, et al. Association of in-hospital depression and anxiety symptoms following stroke with 3 months—depression, anxiety and functional outcome. J Clin Neurosci 2022; 98: 133–136.

Yao

Luo

, et al. High prevalence of post-stroke anxiety in elderly patients following COVID-19 outbreak. Front Psychiatry 2021; 12: 699869.

Shi

Lang

, et al. Longitudinal change of six common inflammatory cytokines and their relationship to anxiety, depression, and cognitive impairment in acute ischemic stroke patients. Braz J Med Biol Res 2023; 56: e13025.

Wang

Zhao

Lin

, et al. Post-stroke anxiety analysis via machine learning methods. Front Aging Neurosci 2021; 13: 657937.

Oei

Tan

R-S

Chan

, et al. Explainable risk prediction of post-stroke adverse mental outcomes using machine learning techniques in a population of 1780 patients. Sensors 2023; 23: 7946.

Sadeghi

Alizadehsani

CIFCI

, et al. A review of explainable artificial intelligence in healthcare. Comput Electr Eng 2024; 118: 109370.

Bjelland

Dahl

Haug

, et al. The validity of the Hospital Anxiety and Distress Scale. J Psychosom Res 2002; 52: 69–77.

Greenberg

Ziai

Cordonnier

, et al. 2022 Guideline for the management of patients with spontaneous intracerebral hemorrhage: a guideline from the American Heart Association/American Stroke Association. Stroke 2022; 53: 101161STR0000000000000407 .

10.

Powers

Rabinstein

Ackerson

, et al. 2018 guidelines for the early management of patients with acute ischemic stroke: a guideline for healthcare professionals from the American Heart Association/American Stroke Association. Stroke 2018; 49: e46–e110.

11.

Riley

Collins

. Stability of clinical prediction models developed using statistical or machine learning methods. Biom J 2023; 65: 202200302.

12.

Hashemi

Looha

Mazaherinia

, et al. Risk of stroke development following retinal vein occlusion: a systematic review and meta-analysis. Surv Ophthalmol 2024; 69: 924–936.

13.

Asghari

Seyedi

Kazemi

, et al. Imaging the neuroinflammatory landscape: COX enzyme mapping with PET. Clin Transl Imaging 2025; 13: 351–367.

14.

Yousefi

Akhbari

Mohamadi

, et al. Machine learning based algorithms for virtual early detection and screening of neurodegenerative and neurocognitive disorders: a systematic-review. Front Neurol 2024; 15: 1413071.

15.

Noori Mirtaheri

Akhbari

Najafi

, et al. Performance of deep learning models for automatic histopathological grading of meningiomas: a systematic review and meta-analysis. Front Neurol 2025; 16: 1536751.

16.

, et al. Colchicine in patients with acute ischaemic stroke or transient ischaemic attack (CHANCE-3): multicentre, double blind, randomised, placebo controlled trial. BMJ 2024; 385: e079061.

17.

Johnson

. Anxiety and hypertension: is there a link? A literature review of the comorbidity relationship between anxiety and hypertension. Curr Hypertens Rep 2019; 21: 66.

18.

You

Wang

, et al. Twenty-four-hour post-thrombolysis NIHSS score as the strongest prognostic predictor after acute ischemic stroke: ENCHANTED study. J Am Heart Assoc 2024; 13: e036109.

19.

Peng

Lai

, et al. The bidirectional relationship between sleep disturbance and anxiety: sleep disturbance is a stronger predictor of anxiety. Sleep Med 2024; 121: 63–68.

20.

Huang

. Analysis of the influencing factors of increasing farmers’ income, urban–rural income gap, and anxiety disorder. CNS Spectr 2023; 28: S15–S15.

21.

Sashindranath

Nandurkar

. Endothelial dysfunction in the brain. Stroke 2021; 52: 1895–1904.

22.

Mendyk-Bordet

A-M

Ouk

Muhr-Tailleux

, et al. Endothelial dysfunction and pre-existing cognitive disorders in stroke patients. Biomolecules 2024; 14: 721.