Sage Journals: Discover world-class research

Abstract

Introduction

Acute dizziness accounts for approximately 4% of emergency department (ED) visits, with stroke often missed. Current methods for stroke detection in dizzy patients have notable limitations, with vestibular strokes missed in a substantial proportion of ED visits. This study aimed to develop a machine learning (ML) tool to assess stroke risk in patients with acute dizziness.

Methods

We developed an ensemble model combining four ML algorithms using structured electronic medical record data and unstructured ED physician notes. Model performance was evaluated on a holdout test set and compared with the ABCD² score using area under the receiver operating characteristic curve (AUC), net reclassification improvement (NRI), integrated discrimination improvement (IDI), and decision curve analysis.

Results

The ensemble model achieved the highest AUC at 0.880, significantly outperforming the ABCD² score (AUC 0.673) and individual ML models. The ensemble model demonstrated superior calibration with the lowest Brier score and showed greater clinical utility across different risk thresholds. Features extracted from unstructured clinical text substantially enhanced model performance, with models combining structured and unstructured data consistently outperforming those trained on structured data alone.

Conclusions

Our ensemble prediction model effectively stratifies stroke risk in ED patients with acute dizziness. By integrating natural language processing of clinical notes with structured patient data, the model offers a more accurate risk assessment than traditional methods. The implementation of this tool could improve patient outcomes by directing advanced neuroimaging to high-risk patients while avoiding unnecessary testing in low-risk patients, ultimately enhancing patient safety and optimizing resource utilization.

Keywords

dizziness emergency department machine learning prediction stroke vertigo

Introduction

Acute dizziness accounts for approximately 4% of emergency department (ED) visits worldwide, ranging from benign vestibular disorders to life-threatening stroke.¹ The challenge for emergency physicians lies in accurately differentiating between these conditions, as misdiagnosis can cause significant morbidity and mortality. Despite its importance, stroke remains one of the most frequently missed diagnoses in patients presenting with dizziness or vertigo in emergency settings.

Current diagnostic methods face notable limitations. Routine neurological exams often fail to detect subtle deficits from posterior circulation strokes, which can be small and spare main neural pathways.² Vestibular strokes are missed in up to 35% of ED visits, versus only 4% for those presenting with motor weakness.³ Computed tomography scans have poor sensitivity for posterior fossa ischemia,⁴ while magnetic resonance imaging (MRI), though more sensitive, is more resource-intensive and often unavailable in emergency settings.

Existing assessment tools include the ABCD² score,⁵ the head impulse, nystagmus, and test of skew (HINTS) exam,⁶ the STANDING diagnostic algorithm,⁷ and the TriAGe+ score.⁸ While the latter three outperform ABCD² in identifying the causes of central vertigo,^9–11 emergency physicians often struggle with administering and interpreting these tests properly.¹² HINTS can produce false positives in patients without acute vestibular syndrome, leading to unnecessary imaging.¹¹ At the same time, HINTS, STANDING and TriAGe+ require specialized neurological assessment that may be difficult to perform in busy EDs with varying levels of staff expertise.¹³

Known stroke predictors in patients with dizziness include advanced age, male gender, diabetes, atrial fibrillation, previous cerebrovascular disease, recurrent vertigo, and high blood pressure at ED presentation.^14–18 These factors, combined with clinical features, can help assess risk. Electronic medical records (EMRs) include both structured data (such as demographics and vital signs) and unstructured text (like clinical notes) that may contain useful information for predicting stroke risk in these patients.

A clinical decision tool for stroke risk assessment in patients with acute dizziness is urgently needed due to current diagnostic challenges. With stroke frequently misdiagnosed and existing methods limited, emergency care has a significant gap. Therefore, this study aims to develop and validate a prediction model using EMR data through natural language processing (NLP) and machine learning (ML) to assess the risk of stroke in ED patients with acute dizziness.

Methods

Study setting and data source

This retrospective study was conducted at Ditmanson Medical Foundation Chia-Yi Christian Hospital, a 1000-bed tertiary teaching facility in southern Taiwan. The ED handles 75,000 to 105,000 annual visits. Data came from the Ditmanson Research Database (DRD), a de-identified research repository containing administrative claims, EMRs, and National Death Index vital status data. Previously described in literature,¹⁹ the DRD houses clinical data for approximately 1.6 million hospital patients. It contains structured data (demographics, diagnoses, prescriptions, procedures, ED triage information, physiological measurements, laboratory results) and unstructured text (physician notes, nursing notes, radiology reports, and pathology reports).

Study population

Taiwan’s EDs use the Taiwan Triage and Acuity Scale,²⁰ adapted from the Canadian system, which organizes complaints by organ systems. Triage nurses assess each patient’s primary complaint, which formed the basis for our study population selection.

As shown in Supplemental Figure S1, we extracted triage data using SQL from the DRD for patients ≥20 years who visited the ED for acute dizziness (2012 to 2021). Dizziness complaints included vertigo, lightheadedness, gait imbalance, and nonspecific dizziness. We included only initial visits for patients with multiple presentations, excluded those lost to follow-up (no visits within two years), and removed patients with confirmed stroke during the ED visit, unavailable EMR data, or no blood tests.

To predict acute stroke in dizzy patients, we focused on strokes within seven days post-ED visit. We excluded patients with stroke diagnoses from 8 to 365 days post-visit. Those diagnosed within seven days were classified as stroke cases, while those without any stroke diagnosis within one year served as non-stroke controls.

Features

Features were extracted from structured and unstructured data. Structured data (Supplemental Table S1) includes demographics, stroke risk factors, triage information, vital signs, and laboratory results. Due to the link between inflammation and stroke risk,²¹ inflammatory markers (neutrophil-to-lymphocyte, monocyte-to-lymphocyte, platelet-to-lymphocyte, and platelet-to-white blood cell ratios) were also considered.

Variables with >50% missing data were excluded. Outliers, defined as ±4 standard deviations from the mean,²² were imputed along with missing data using multivariate imputation by chained equations (MICE), first on the training set and then on the holdout test set using the derived prediction matrix. The MICE algorithm sequentially fills in missing values one variable at a time using other variables as predictors.²³ This process iterates multiple times until convergence or a preset iteration limit is reached.

For continuous variables, predictive mean matching prevented out-of-range imputations, while logistic regression handled categorical variables. MICE created five imputed datasets; continuous variable values were averaged,²⁴ and categorical variables used the mode. After imputation, continuous variables were normalized to zero mean and unit standard deviation, while categorical variables were binary encoded (0 = absent, 1 = present).

We extracted unstructured data from ED physician notes. Two text vectorization methods were tested: a basic “bag-of-words” (BOW) approach and a deep learning approach using the bidirectional encoder representations from transformers (BERT) language model. Text preprocessing included: (1) spell-checking and auto-correction using Jazzy spell checker (https://github.com/kinow/jazzy); (2) expanding acronyms with a clinical terms list; (3) removing non-ASCII and non-word special characters; (4) converting to lowercase; (5) lemmatizing words; (6) removing stop words. Only steps 1 through 3 were needed for the BERT approach.

For BOW, we built a document-term matrix with columns representing unique words and rows representing patient documents. Matrix cells indicated word presence. To reduce redundancy and improve efficiency,²⁵ we removed words that appeared in fewer than 5% or more than 95% of training documents and applied penalized logistic regression with 10-fold cross-validation to identify predictive words.^26,27

For BERT, we used the language model to convert clinical text into context-aware vectors. BERT is a deep neural network that uses bidirectional transformers trained through masked language modeling and next-sentence prediction.²⁸ We specifically employed ClinicalBERT,²⁹ a version adapted for medical text.

Model building

Figure 1 illustrates the model-building process. We combined features from both unstructured clinical text and structured data. The dataset was divided by year into a training set (2012–2018) and a holdout test set (2019–2021), as shown in Supplemental Figure S1. Models were developed using the training data, with the holdout test set reserved exclusively for final evaluation to prevent information leakage. To enhance the robustness of our findings, we conducted supplementary analyses to provide preliminary evidence of model transferability within our single-center cohort. First, we split the holdout test set into two subsets by time period (year 2019 and years 2020 to 2021) because documentation styles may have evolved over time due to physician turnover or the arrival of new physicians. Second, we split the holdout test set by sex to assess consistency across diverse patient populations.

Figure 1.

Machine-learning model development workflow.

Supplemental Methods detail the implementation of model construction, including handling class imbalance, hyperparameter tuning, and cross-validation. In brief, to address class imbalance in medical datasets that biases ML models toward majority classes,³⁰ we tested random oversampling and undersampling techniques to balance minority and majority classes. We developed prediction models using four algorithms: artificial neural networks (ANN), logistic regression (LR), random forest (RF), and support vector machines (SVM). We performed 10-fold cross-validation three times on the training set to optimize hyperparameters (Supplemental Table S2), text vectorization, and class balancing methods. Model performance was evaluated using the area under the receiver operating characteristic curve (AUC).

We then created four ensemble models using the optimized ML models as base models. Ensemble learning improves performance over individual models by reducing variance or bias.^31,32 Using predicted probabilities from the four base models, we calculated mean, median, maximum, and minimum values for each patient to create four ensemble models. We evaluated these models on the holdout test set, using an LR model trained solely on structured data as the baseline. Shapley additive explanation (SHAP) analysis³³ was used to interpret model outputs and identify important features.

Experiments were conducted on a system equipped with an NVIDIA GeForce RTX 3090 GPU, running Linux 6.8.0-60-generic-x86_64 with glibc 2.35, Python 3.10.12, and CUDA 12.1. Model training utilized imbalanced-learn 0.12.3, scikit-learn 1.5.2, torch 2.4.1, transformers 4.45.2, and shap 0.46.0.

Clinical model for comparison

The ABCD² score evaluates stroke risk after transient ischemic attack.⁵ It assigns points for age (1 for 60–69 years; 2 for ≥70), blood pressure (1 for systolic ≥140 mmHg or diastolic ≥90 mmHg), clinical features (2 for unilateral weakness; 1 for speech disturbance without weakness), symptom duration (1 for 10–59 minutes and 2 for ≥60 minutes), and diabetes (1 point). The score ranges from 0 to 7 and helps estimate the likelihood of a stroke within 2–7 days. Studies have demonstrated its ability to predict stroke risk in patients with dizziness.^9,34

Statistical analysis

Patient characteristics were summarized using counts and percentages for categorical variables and means with standard deviations for continuous variables. Group differences were assessed using chi-squared tests (categorical) and t-tests (continuous).

For each model, we calculated sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and likelihood ratios (LR). Sensitivity measures the ability to correctly identify true stroke cases, while specificity measures the ability to correctly identify those who will not have a stroke. PPV and NPV indicate the probability that a “positive” or “negative” prediction by the model is correct in a real-world clinical setting. LRs indicate how much a test result will change the odds of having a stroke. A higher positive LR (LR+) and a lower negative LR (LR-) suggest a more clinically useful test.

The ABCD² threshold was ≥4.^9,34 Model discrimination was assessed using the AUC on the holdout test set. AUC is a global measure of how well the model distinguishes between stroke and non-stroke cases. An AUC of 1.0 is perfect, and an AUC of 0.5 is no better than a coin flip. An AUC value above 0.7 is considered clinically acceptable.³⁵

The ensemble model’s discrimination ability was compared to ABCD² and other models using DeLong’s method.³⁶ Continuous net reclassification improvement (NRI) and integrated discrimination improvement (IDI) indices³⁷ were calculated, with higher values indicating better discrimination. Calibration was assessed via the Brier score, where lower scores signify better calibration. Finally, decision curve analysis was used to estimate net benefit for examining clinical utility, which reflects the improved decision-making capacity provided by each model.³⁸

All data preprocessing and statistical analyses were conducted using R version 4.3.1 (R Foundation for Statistical Computing, Vienna, Austria) and Stata 15.1 (StataCorp, College Station, Texas). Two-tailed p-values of 0.05 were considered statistically significant.

Results

Characteristics of the study population

From 41342 ED visits for acute dizziness, we excluded repeat visits (n = 12013) and those without a two-year follow-up (n = 5063). After further excluding initial stroke diagnoses (n = 589), unavailable EMR data (n = 849), missing blood test results (n = 8354), and delayed strokes (n = 194), our study population consisted of 14,280 patients (Table 1). Of these, 595 developed strokes within 7 days. Stroke patients were older, more frequently male, and had higher rates of vascular risk factors (hypertension, diabetes, hyperlipidemia, atrial fibrillation, prior stroke). The groups differed significantly in terms of triage details, vital signs, and laboratory results. We divided the population into training (n = 11885) and holdout test (n = 2267) sets by visit year (Supplemental Figure S1), with the characteristics detailed in Supplemental Table S1.

Table 1.

Characteristics of the study population.

Variables	Stroke (n = 595)	Non-stroke (n = 13685)	P
Demographics
Age, year	68.0 (12.6)	61.8 (16.9)	<0.001
Male	351 (59.0)	5736 (41.2)	<0.001
Risk factors
Hypertension	458 (77.0)	6698 (48.9)	<0.001
Diabetes mellitus	264 (44.4)	3753 (27.4)	<0.001
Hyperlipidemia	320 (53.8)	2692 (19.7)	<0.001
Atrial fibrillation	58 (9.7)	468 (3.4)	<0.001
Prior stroke	108 (18.2)	1122 (8.2)	<0.001
IHD	73 (12.3)	1529 (11.2)	0.407
Triage details
Acuity level			<0.001
Level 1	1 (0.2)	37 (0.3)
Level 2	300 (50.4)	4869 (35.6)
Level 3	292 (49.1)	8763 (64.0)
Level 4	2 (0.3)	16 (0.1)
Modifier of vertigo/dizziness			<0.001
Non-positional, with or without other neurological symptoms	297 (49.9)	4825 (35.3)
Positional, without other neurological symptoms	297 (49.9)	8838 (64.6)
Chronic recurrent vertigo	1 (0.2)	17 (0.1)
Others	0 (0)	5 (0.0)
Shift of ED arrival			<0.001
Night (0–8)	58 (9.7)	1852 (13.5)
Day (8–16)	330 (55.5)	6579 (48.1)
Evening (16–24)	207 (34.8)	5254 (38.4)
Patient activity			<0.001
Able to ambulate independently	158 (26.6)	6362 (46.5)
Requires assistance to ambulate	224 (37.6)	3966 (29.0)
Unable to ambulate	213 (35.8)	3355 (24.5)
Vital signs
PR, beats/min	79.5 (15.6)	81.5 (17.2)	0.004
RR, breaths/min	19.3 (1.3)	19.5 (1.4)	0.013
BT, °C	36.3 (0.7)	36.3 (0.7)	0.490
SBP, mmHg	160.0 (27.1)	144.0 (28.3)	<0.001
DBP, mmHg	91.1 (15.0)	84.5 (15.1)	<0.001
Laboratory results
RBC count, 10¹²/L	4.6 (0.7)	4.4 (0.7)	<0.001
Hemoglobin, g/L	135.3 (21.3)	128.1 (22.4)	<0.001
Hematocrit	0.4 (0.1)	0.4 (0.1)	<0.001
MCV, fL	87.7 (8.2)	87.9 (8.0)	0.699
MCH, fmol/cell	1.8 (0.2)	1.8 (0.2)	0.650
MCHC, g/L	336.5 (13.1)	335.4 (14.2)	0.055
RDW-CV, %	13.6 (1.5)	13.7 (1.8)	0.122
WBC count, 10⁹/L	8.7 (4.2)	8.1 (3.6)	<0.001
Segment, %	68.3 (12.9)	67.5 (13.2)	0.155
Lymphocyte, %	24.2 (11.2)	24.9 (11.8)	0.164
Monocyte, %	5.2 (2.2)	5.4 (2.5)	0.149
Eosinophil, %	1.9 (2.8)	1.8 (2.3)	0.123
Basophil, %	0.3 (0.3)	0.3 (0.3)	0.233
Platelet count, 10⁹/L	217.5 (75.5)	212.0 (75.5)	0.080
MPV, fL	10.0 (0.9)	9.9 (0.8)	0.003
Glucose, mmol/L	8.9 (4.1)	8.0 (3.7)	<0.001
AST, μkat/L	0.5 (0.5)	0.6 (1.5)	0.020
ALT, μkat/L	0.4 (0.4)	0.5 (1.1)	0.038
BUN, mmol/L	6.9 (3.8)	7.3 (6.0)	0.186
Creatinine, μmol/L	97.7 (94.0)	106.6 (140.6)	0.126
eGFR, mL/min/1.73m²	80.1 (34.4)	84.1 (41.6)	0.021
Sodium, mmol/L	137.9 (3.7)	137.4 (4.8)	0.004
Potassium, mmol/L	3.8 (0.5)	3.8 (0.5)	0.465
NLR	4.2 (4.1)	4.3 (5.2)	0.525
MLR	0.3 (0.2)	0.3 (0.3)	0.119
PLR	695.3 (462.1)	775.3 (729.8)	0.008
PWR	27.5 (10.7)	29.0 (13.7)	0.006

Data are given as n (%) and mean (standard deviation).

ALT, alanine aminotransferase; AST, aspartate aminotransferase; BT, body temperature; BUN, blood urea nitrogen; DBP, diastolic blood pressure; ED, emergency department; eGFR, estimated glomerular filtration rate; IHD, ischemic heart disease; MCH, mean corpuscular hemoglobin; MCHC, mean corpuscular hemoglobin concentration; MCV, mean corpuscular volume; MLR, monocyte-to-lymphocyte ratio; MPV, mean platelet volume; NLR, neutrophil-to-lymphocyte ratio; PLR, platelet-to-lymphocyte ratio; PR, pulse rate; PWR, platelet-to-white blood cell ratio; RBC, red blood cell; RDW-CV, red cell distribution width-coefficient of variation; RR, respiratory rate; SBP, systolic blood pressure; WBC, white blood cell.

Construction of prediction models

Supplemental Figure S2A shows the highest AUC values after hyperparameter optimization from three 10-fold cross-validations on the training set across various algorithm-feature combinations. Models using only structured data showed lower AUCs than those using unstructured clinical text. Models combining both data types achieved higher AUCs across all algorithms. The baseline model (structured data only) reached an AUC of 0.803. The highest AUCs for ANN, LR, RF, and SVM models were 0.857, 0.852, 0.872, and 0.879, respectively. Their specific resampling methods, feature sets, and hyperparameters appear in Supplemental Table S3. Among ensemble models, the one averaging predictions from all four models achieved the highest training set AUC (Supplemental Figure S2B).

Comparison of prediction performance

Table 2 compares performance metrics of various ML models, including the best ensemble model from training, against the baseline model and the ABCD² score for stroke detection on the holdout test set. The ABCD² score had the lowest AUC (0.673), followed by the baseline model (0.791). AUCs for ANN, LR, RF, and SVM models were 0.850, 0.857, 0.846, and 0.858, respectively (Figure 2). The ensemble model achieved the highest AUC (0.880), significantly outperforming both ABCD² and other ML models (Supplemental Table S4). NRI and IDI indices confirmed the ensemble model’s superior discrimination ability. It also had the lowest Brier score, indicating better calibration.

Table 2.

Comparative performance of machine learning models and the ABCD² score for stroke prediction. This table evaluates model performance on a holdout test set. Values represent the primary estimate followed by the 95% CI in parentheses.

Metric	ABCD² ≥ 4	Baseline	Ensemble	ANN	LR	RF	SVM
AUC	0.673 (0.632–0.713)	0.791 (0.755–0.827)	0.880 (0.857–0.903)	0.850 (0.818–0.881)	0.857 (0.830–0.883)	0.846 (0.813–0.878)	0.858 (0.831–0.886)
Sensitivity	73.4 (64.9–80.9)	74.2 (65.7–81.5)	72.7 (64.1–80.2)	50.0 (41.0–59.0)	78.9 (70.8–85.6)	82.8 (75.1–88.9)	75.8 (67.4–82.9)
Specificity	53.0 (50.9–55.0)	70.0 (68.1–71.9)	84.3 (82.8–85.8)	91.0 (89.8–92.2)	77.0 (75.2–78.7)	70.6 (68.7–72.5)	77.9 (76.2–79.6)
PPV	8.1 (6.6–9.8)	12.3 (10.0–14.8)	20.8 (17.1–24.8)	24.0 (19.0–29.6)	16.2 (13.4–19.4)	13.7 (11.4–16.4)	16.2 (13.4–19.5)
NPV	97.2 (96.2–98.1)	98.0 (97.2–98.6)	98.2 (97.5–98.7)	97.0 (96.2–97.7)	98.5 (97.8–99.0)	98.6 (98.0–99.1)	98.3 (97.6–98.8)
LR+	1.56 (1.39–1.75)	2.47 (2.19–2.79)	4.64 (4.02–5.35)	5.58 (4.49–6.94)	3.43 (3.05–3.86)	2.82 (2.55–3.12)	3.44 (3.03–3.89)
LR-	0.50 (0.38–0.67)	0.37 (0.27–0.50)	0.32 (0.24–0.43)	0.55 (0.46–0.65)	0.27 (0.20–0.38)	0.24 (0.17–0.36)	0.31 (0.23–0.42)

Data are reported as percentages with (95% CI), except AUC and LRs.

AUC, area under the receiver operating characteristic curve; ANN, artificial neural network; CI, confidence interval; LR, logistic regression; LR+, positive likelihood ratio; LR-, negative likelihood ratio; NPV, negative predictive value; PPV, positive predictive value; RF, random forest; SVM, support vector machine.

Figure 2.

ROC curves comparing model performance.

All models showed high NPVs, with ABCD² having the lowest PPV (8.1%) and LR+ (1.56). Among ML models, ANN achieved the highest PPV (24.0%) and LR+ (5.58), but the lowest sensitivity (50.0%). The ensemble model ranked second in PPV (20.8%) and LR+ (4.64), with comparable sensitivity (72.7%). Figure 3 shows net benefit curves across risk thresholds, with the ensemble model demonstrating superior clinical utility.

Figure 3.

Decision curve analysis comparing clinical net benefit across prediction models.

Supplemental Tables S5 to S8 present the results of supplementary analyses that used different holdout test sets to assess consistency across varying time periods and patient populations. These results were consistent with the main analysis. The ABCD² score had the lowest AUCs, ranging from 0.644 to 0.704, whereas the ensemble achieved the highest AUCs, ranging from 0.854 to 0.901. Similarly, the ABCD² score had the lowest PPV and LR+ values among all the models, while the ensemble model ranked second in terms of PPV and LR+.

Feature importance

We used SHAP analysis to identify key features for stroke prediction in patients with acute dizziness. Figure 4 shows the top 20 features from the baseline model (structural data), while Figure 4 shows the top 20 from the RF model (combined structural and unstructured data). Important predictors included age, male gender, risk factors (hyperlipidemia, hypertension), vital signs (systolic blood pressure, temperature, etc.), laboratory values (creatinine, hematocrit, etc.), and inflammatory markers (monocyte-to-lymphocyte and platelet-to-white blood cell ratios). Words like “left,” “tomography,” and “brain” in ED physician notes were also significant contributors.

Figure 4.

Bee swarm plots illustrating the relative impact of clinical, demographic, and textual features on the predictive output of the baseline (a) and random forest (b) models on the holdout test set.

Discussion

We developed a stroke risk assessment tool for ED patients with acute dizziness using an ensemble of four ML models. This ensemble outperformed both the ABCD² score and individual ML models in discrimination, calibration, and clinical utility. Our study found that incorporating unstructured clinical text features significantly improved prediction performance compared to the baseline model, which used structured data alone.

Comparison of models

We evaluated various ML algorithms and feature sets to develop our clinical decision tool. All models exceeded the acceptable AUC threshold of 0.7.³⁵ The ensemble approach integrated predictions from base models, enhancing discriminatory ability. Ensemble learning enhances accuracy, robustness, and efficiency in the presence of complex or noisy data,^39,40 making it a popular choice in healthcare analytics.

Prior studies have investigated ML models for stroke prediction in dizzy ED patients,^41–44 using various algorithms and features. Most relied on video-oculography,^41–43 requiring specialized equipment. One study, like ours, utilized EMR clinical data and provider notes,⁴⁴ but created separate models for structured and unstructured data without integration, and did not explore advanced NLP technologies, such as large language models.

Utility of clinical text

Clinical text analysis enhances ML models for predicting patient outcomes and diagnosing diseases by leveraging rich, unstructured data from clinical narratives.⁴⁵ Research shows that combining free-text notes with structured EMR data enhances stroke prediction accuracy.^46–48 This integration provides a deeper understanding of patient histories and symptoms, thereby improving the performance of stroke prediction models.⁴⁹

For ML algorithms to process clinical text effectively, textual data must be converted to numerical vectors. Various NLP techniques can vectorize text, from simple word-count methods like BOW to sophisticated BERT embeddings. The optimal vectorization method varies by algorithm and classification task.^50,51 Our study demonstrates that the optimal vectorization approach varies depending on the specific ML algorithm employed, underscoring the importance of selecting the proper technique to maximize model performance.

Influential features

Beyond traditional risk factors like age, hyperlipidemia, and hypertension, other influential features included systolic blood pressure, hemoglobin, creatinine, and blood cell counts, confirming findings from a previous study.⁴⁴ We also identified inflammatory markers (monocyte-to-lymphocyte and platelet-to-white blood cell ratios) among the top features. While these markers correlate with stroke severity and prognosis,^52,53 their reliability as biomarkers for differentiating acute ischemic stroke requires further research.

While ED physician notes improved model performance, black-box BERT models reduce interpretability.⁵⁴ Despite this, we incorporated several approaches to enhance the interpretability of the prediction model. We used BOW vectorization, which, through SHAP analyses, provides transparent insights into which clinical variables and textual features drive model decisions and offers more interpretable textual features than BERT embeddings alone. Although this approach identified influential words, their interpretation remains challenging and often requires domain experts to evaluate contextual meaning. Nevertheless, expert-verified textual features could help refine prediction models.

Clinical implications

A critical finding in our study is that approximately 4.2% (595/14,280) of patients initially discharged from the ED without stroke identification received delayed stroke diagnoses within seven days. This rate aligns with published literature reporting that 3–5% of acutely dizzy patients in the ED have cerebrovascular disease.¹ Although this proportion is modest, it represents a significant patient safety concern,^55,56 as delayed diagnosis is associated with increased morbidity, missed opportunities for time-sensitive interventions such as thrombolysis and thrombectomy, and failure to initiate secondary prevention, resulting in higher rates of recurrent stroke. Several factors contribute to this diagnostic challenge⁵⁶: posterior circulation strokes often present with isolated dizziness and no obvious focal deficits²; computed tomography scans have poor sensitivity for acute posterior fossa strokes⁴; and cognitive biases may lead to premature closure on benign diagnoses, such as benign paroxysmal positional vertigo.

Our ensemble prediction tool addresses these challenges by systematically integrating vascular risk factors, vital signs, laboratory results, and analysis of free-text physician notes to capture subtle stroke indicators that may be overlooked during time-pressured evaluations. Our tool was trained on ED patients with acute dizziness, not just those with acute vestibular syndrome. It achieves a discrimination ability (AUC 0.880) comparable to existing tools, such as HINTS and TriAGe+ (both with an AUC of 0.88).¹¹

A sensitive clinical decision support tool could flag high-risk patients who might otherwise be discharged, prompting additional evaluation. Clinical implementation should integrate the tool into ED workflows to provide real-time risk stratification and actionable recommendations, while allowing physician overrides with documented reasoning. However, it is important to recognize that not all delayed diagnoses are preventable. Furthermore, model sensitivity must be balanced against specificity to avoid excessive false positives. Our tool’s threshold is adjustable to optimize sensitivity and PPV. Test-positive cases can undergo expedited MRI^57,58 for stroke screening, enabling timely treatment and mitigating patient safety issues.

Strengths and limitations

This study’s main strengths include effectively managing missing data and class imbalance, common issues in medical datasets that impact model reliability. Systematic missing data can render models less representative of the population, introducing bias and limiting their generalizability.⁵⁹ Missing data also reduces statistical power, affecting reproducibility. Similarly, training data biased toward one class can skew model predictions,³⁰ reducing accuracy for underrepresented classes and limiting generalization to new datasets.⁶⁰

This study has several limitations. First, although various holdout sets were used to verify model transferability, this study was conducted on a single-site dataset, which limits the generalizability of its findings. It should be considered an initial proof-of-concept for integrating structured EMR data with unstructured clinical text using advanced NLP and ML techniques to predict stroke risk in dizzy patients in the ED. Prospective multicenter validation studies are necessary to confirm the model’s generalizability across diverse healthcare settings before widespread clinical implementation. Second, there was some outcome assessment bias, as patients hospitalized for stroke outside the study hospital could not be identified. This could lead to an underestimation of stroke risk. Third, we did not compare the developed models to physiology-based approaches, such as the HINTS exam, because no such data were available. Fourth, differences in terminology and style in clinical documentation across healthcare organizations may affect the transferability of the developed models.

Fifth, using the BERT language model limits interpretability. Future research may explore attention mechanisms to identify which portions of clinical text most influence BERT-based predictions. Moreover, clinical validation studies in which physicians review model predictions alongside feature importance explanations to assess clinical face validity are important. Finally, our tool has not been tested in real-world clinical settings. Prospective validation is needed to confirm that implementation reduces delayed diagnosis rates in clinical practice. With appropriate implementation and ongoing validation, such tools have the potential to reduce diagnostic errors, improve patient outcomes, and optimize resource utilization in emergency care settings.

Conclusions

We developed an ensemble prediction model using NLP and ML techniques to assess stroke risk in dizzy ED patients. The model could be integrated into the hospital information system to provide personalized and efficient risk assessment. By targeting advanced neuroimaging at high-risk patients while avoiding unnecessary tests for low-risk ones, this tool could enhance patient safety, reduce errors, and optimize resource allocation.

Supplemental material

Supplemental material - Developing a clinical decision support tool for stratifying stroke risk in patients presenting with dizziness to the emergency department: A retrospective cohort study

Supplemental material for Developing a clinical decision support tool for stratifying stroke risk in patients presenting with dizziness to the emergency department: A retrospective cohort study by Sheng-Feng Sung and Ya-Han Hu in Digital Health.

Footnotes

Acknowledgments

The authors thank the Clinical Data Center of Ditmanson Medical Foundation Chia-Yi Christian Hospital for providing administrative and technical support. This study is based on data from the Ditmanson Research Database (DRD), provided by Ditmanson Medical Foundation Chia-Yi Christian Hospital. The interpretation and conclusions contained herein do not represent the position of Ditmanson Medical Foundation Chia-Yi Christian Hospital.

ORCID iDs

Sheng-Feng Sung

Ya-Han Hu

Ethical considerations

The study protocol was approved by the Institutional Review Board of Ditmanson Medical Foundation Chia-Yi Christian Hospital, which waived the requirement for informed consent for this study (approval number: IRB2022111).

Consent to participate

The need for informed consent was waived for this study due to its retrospective design.

Author contributions

SFS conceived the study, performed statistical modeling, and drafted the manuscript. YHH supervised the study and critically revised the manuscript. All authors participated in manuscript revision and approved the final version.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Science and Technology Council [grant number NSTC 112-2221-E-705-001-MY2]. The research funder had no role in the design, conduct, or interpretation of the study, nor in the decision to submit it for publication.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The authors do not have permission to share the data due to local regulations governing the use of electronic medical records.*

Guarantor

YHH.

Supplemental material

Supplemental material for this article is available online.

Code availability

The code used to develop and evaluate the prediction models in this study is publicly available at the following link: .

References

Newman-Toker

Hsieh

Y-H

Camargo

, et al. Spectrum of Dizziness Visits to US Emergency Departments: Cross-Sectional Analysis From a Nationally Representative Sample. Mayo Clin Proc 2008; 83: 765–775. https://doi.org/10.4065/83.7.765

Tehrani

ASS

Kattah

Mantokoudis

, et al. Small strokes causing severe vertigo. Neurology 2014; 83: 169–173. https://doi.org/10.1212/wnl.0000000000000573

Morgenstern

Lisabeth

Mecozzi

, et al. A population-based study of acute stroke and TIA diagnosis. Neurology 2004; 62: 895–900. https://doi.org/10.1212/01.wnl.0000115103.49326.5e

Hwang

Silva

Furie

, et al. Comparative Sensitivity of Computed Tomography vs. Magnetic Resonance Imaging for Detecting Acute Posterior Fossa Infarct. J Emerg Med 2012; 42: 559–565. https://doi.org/10.1016/j.jemermed.2011.05.101

Johnston

Rothwell

Nguyen-Huynh

, et al. Validation and refinement of scores to predict very early stroke risk after transient ischaemic attack. Lancet 2007; 369: 283–292. https://doi.org/10.1016/S0140-6736(07)60150-0

Kattah

Talkad

Wang

, et al. HINTS to Diagnose Stroke in the Acute Vestibular Syndrome. Stroke 2009; 40: 3504–3510. https://doi.org/10.1161/STROKEAHA.109.551234

Vanni

Nazerian

Casati

, et al.

Can emergency physicians accurately and reliably assess acute vertigo in the emergency department?

Emerg Med Australas 2015; 27: 126–131. https://doi.org/10.1111/1742-6723.12372

Kuroda

Nakada

Ojima

, et al. The TriAGe+ Score for Vertigo or Dizziness: A Diagnostic Model for Stroke in the Emergency Department. Journal of stroke and cerebrovascular diseases : the official journal of National Stroke Association 2017; 26: 1144–1153. https://doi.org/10.1016/j.jstrokecerebrovasdis.2017.01.009

Newman‐Toker

Kerber

Hsieh

, et al. HINTS Outperforms ABCD² to Screen for Stroke in Acute Continuous Vertigo and Dizziness. Acad Emerg Med 2013; 20: 986–996. https://doi.org/10.1111/acem.12223

10.

Gerlier

Hoarau

Fels

, et al. Differentiating central from peripheral causes of acute vertigo in an emergency setting with the HINTS, STANDING, and ABCD² tests: A diagnostic cohort study. Acad Emerg Med 2021; 28: 1368–1378. https://doi.org/10.1111/acem.14337

11.

Toplu

ACO

Aslan

Akoglu

, et al. The role of the HINTS exam, TriAGe+ score, and ABCD² score in predicting stroke in acute vertigo patients in the ED. Am J Emerg Med 2025; 91: 110–117. https://doi.org/10.1016/j.ajem.2025.02.027

12.

Newman-Toker

Curthoys

Halmagyi

. Diagnosing Stroke in Acute Vertigo: The HINTS Family of Eye Movement Tests and the Future of the ?Eye ECG? Semin Neurol 2015; 35: 506–521. https://doi.org/10.1055/s-0035-1564298

13.

Morrow

Koohi

Kaski

. Hyperacute assessment of vertigo in suspected stroke. Front Stroke 2023; 2: 1267251. https://doi.org/10.3389/fstro.2023.1267251

14.

Kerber

Meurer

Brown

, et al. Stroke risk stratification in acute dizziness presentations. Neurology 2015; 85: 1869–1878. https://doi.org/10.1212/WNL.0000000000002141

15.

Courand

P-Y

Serraille

Grandjean

, et al. Recurrent vertigo is a predictor of stroke in a large cohort of hypertensive patients. J Hypertens 2019; 37: 942–948.

16.

Mármol-Szombathy

Domínguez-Durán

Calero-Ramos

, et al. Identification of dizzy patients who will develop an acute cerebrovascular syndrome: a descriptive study among emergency department patients. Eur Arch Oto-rhino-l 2018; 275: 1709–1713. https://doi.org/10.1007/s00405-018-4988-2

17.

Kim

Bae

Kim

, et al. Stroke prediction in patients presenting with isolated dizziness in the emergency department. Sci Rep-uk 2021; 11: 6114. https://doi.org/10.1038/s41598-021-85725-1

18.

Choi

Kim

Boo

, et al. Risk of future stroke in patients with a diagnosis of peripheral vertigo in the emergency department. Eur J Neurol 2023; 30: 2062–2069 Epub ahead of print 2022. https://doi.org/10.1111/ene.15543.

19.

Tsai

H-C

Hsieh

C-Y

Sung

S-F

. Application of machine learning and natural language processing for predicting stroke-associated pneumonia. Frontiers Public Heal 2022; 10: 1009164. https://doi.org/10.3389/fpubh.2022.1009164

20.

C-J

Yen

Z-S

Tsai

JC-H

, et al. Validation of the Taiwan triage and acuity scale: a new computerised five-level triage system. Emergency medicine journal : EMJ 2010; 28: 1026–1031. https://doi.org/10.1136/emj.2010.094185

21.

Zhang

Liu

, et al. Association of neutrophil-to-lymphocyte ratio with stroke morbidity and mortality: evidence from the NHANES 1999–2020. Front Med 2025; 12: 1570630. https://doi.org/10.3389/fmed.2025.1570630

22.

Yang

Hutcheon

. Identifying outliers and implausible values in growth trajectory data. Ann Epidemiology 2016; 26: 77–80.e2. https://doi.org/10.1016/j.annepidem.2015.10.002

23.

Azur

Stuart

Frangakis

, et al.

Multiple imputation by chained equations: what is it and how does it work?

Int J Methods Psychiatr Res 2011; 20: 40–49. https://doi.org/10.1002/mpr.329

24.

Burns

Butterworth

Kiely

, et al. Multiple imputation was an efficient method for harmonizing the Mini-Mental State Examination with missing item-level data. J Clin Epidemiology 2011; 64: 787–793. https://doi.org/10.1016/j.jclinepi.2010.10.011

25.

Deng

Weng

, et al. Feature selection for text classification: A review. Multimed Tools Appl 2018; 78: 3797–3816. https://doi.org/10.1007/s11042-018-6083-5

26.

Weissman

Hubbard

Ungar

, et al. Inclusion of Unstructured Clinical Text Improves Early Prediction of Death or Prolonged ICU Stay. Critical care medicine 2018; 46: 1125–1132. https://doi.org/10.1097/CCM.0000000000003148

27.

Huang

. Penalized feature selection and classification in bioinformatics. Brief Bioinform 2008; 9: 392–403. https://doi.org/10.1093/bib/bbn027

28.

Devlin

Chang

M-W

Lee

, et al. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies 2019; 4171–4186.

29.

Alsentzer

Murphy

Boag

, et al. Publicly Available Clinical BERT Embeddings. Proc 2nd Clin Nat Lang Process Work 2019; 72–78.

30.

López

Fernández

Moreno-Torres

, et al. Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics. Expert Syst Appl 2012; 39: 6585–6608. https://doi.org/10.1016/j.eswa.2011.12.043

31.

Rokach

. Ensemble-based classifiers. Artif Intell Rev 2010; 33: 1–39. https://doi.org/10.1007/s10462-009-9124-7

32.

Doganer

. Different Approaches to Reducing Bias in Classification of Medical Data by Ensemble Learning Methods. Int J Big Data Anal Healthc (IJBDAH) 2021; 6: 15–30. https://doi.org/10.4018/ijbdah.20210701.oa2

33.

Lundberg

Erion

Chen

, et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell 2020; 2: 56–67. https://doi.org/10.1038/s42256-019-0138-9

34.

Navi

Kamel

Shah

, et al. Application of the ABCD² score to identify cerebrovascular causes of dizziness in the emergency department. Stroke 2012; 43: 1484–1489. https://doi.org/10.1161/STROKEAHA.111.646414

35.

LaValley

. Logistic regression. Circulation 2008; 117: 2395–2399. https://doi.org/10.1161/CIRCULATIONAHA.106.682658

36.

DeLong

Clarke-Pearson

. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 1988; 44: 837–845.

37.

Pencina

D’Agostino

Steyerberg

. Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. Statistics in medicine 2011; 30: 11–21. https://doi.org/10.1002/sim.4085

38.

Vickers

Elkin

. Decision curve analysis: a novel method for evaluating prediction models. Medical decision making : an international journal of the Society for Medical Decision Making 2006; 26: 565–574. https://doi.org/10.1177/0272989X06295361

39.

Naderalvojoud

Hernandez-Boussard

. Improving machine learning with ensemble learning on observational healthcare data. AMIA Annu Symp Proc AMIA Symp 2023; 2023: 521–529.

40.

Chowdhury

Tabassum

Shatabda

, et al. An optimized data analytics pipeline for improving healthcare diagnosis using ensemble learning. Inform Med Unlocked 2025; 53: 101623. https://doi.org/10.1016/j.imu.2025.101623

41.

Ahmadi

S-A

Vivar

Navab

, et al. Modern machine-learning can support diagnostic differentiation of central and peripheral acute vestibular disorders. J Neurol 2020; 267: 143–152. https://doi.org/10.1007/s00415-020-09931-z

42.

Korda

Wimmer

Wyss

, et al. Artificial intelligence for early stroke diagnosis in acute vestibular syndrome. Front Neurol 2022; 13: 919777. https://doi.org/10.3389/fneur.2022.919777

43.

Wang

Sreerama

Nham

, et al. Separation of stroke from vestibular neuritis using the video head impulse test: machine learning models versus expert clinicians. J Neurol 2025; 272: 248. https://doi.org/10.1007/s00415-025-12918-3

44.

Abedi

Misra

Chaudhary

, et al. Machine Learning-Based Prediction of Stroke in Emergency Departments. Ther Adv Neurol Disord 2024; 17: 17562864241239108. https://doi.org/10.1177/17562864241239108

45.

Rosario

Pitarch-Corresa

Pedrosa

, et al. Applications of Natural Language Processing for the Management of Stroke Disorders: Scoping Review. JMIR Méd Inform 2023; 11: e48693. https://doi.org/10.2196/48693

46.

Sung

S-F

Lin

C-Y

Y-H

. EMR-Based Phenotyping of Ischemic Stroke Using Supervised Machine Learning and Text Mining Techniques. IEEE J Biomed Heal Inform 2020; 24: 2922–2931. https://doi.org/10.1109/JBHI.2020.2976931

47.

Sung

Chen

Pan

, et al. Natural Language Processing Enhances Prediction of Functional Outcome After Acute Ischemic Stroke. J Am Heart Assoc 2021; 10: e023486. https://doi.org/10.1161/JAHA.121.023486

48.

Lee

H-J

Schwamm

Sansing

, et al. StrokeClassifier: ischemic stroke etiology classification by ensemble consensus modeling using electronic health records. npj Digit Med 2024; 7: 130. https://doi.org/10.1038/s41746-024-01120-w

49.

Spasic

Nenadic

. Clinical Text Data in Machine Learning: Systematic Review. JMIR Méd Inform 2020; 8: e17984. https://doi.org/10.2196/17984

50.

Zhan

Humbert-Droz

Mukherjee

, et al. Structuring clinical text with AI: Old versus new natural language processing techniques evaluated on eight common cardiovascular diseases. Patterns 2021; 2: 100289. https://doi.org/10.1016/j.patter.2021.100289

51.

Verma

Goyal

Bansal

, et al. Comparing the performance of various Encoder Models and Vectorization Techniques on Text Classification. In: 2023 14th Int Conf Comput Commun Netw Technol (ICCCNT), 2023, 1–7.

52.

Amalia

Dalimonthe

. Clinical significance of Platelet-to-White Blood Cell Ratio (PWR) and National Institute of Health Stroke Scale (NIHSS) in acute ischemic stroke. Heliyon 2020; 6: e05033. https://doi.org/10.1016/j.heliyon.2020.e05033

53.

Wang

Pang

Huan-Yu , et al. Monocyte-to-lymphocyte ratio affects prognosis in LAA-type stroke patients. Heliyon 2022; 8: e10948. https://doi.org/10.1016/j.heliyon.2022.e10948

54.

Moradi

Samwald

. Explaining Black-Box Models for Biomedical Text Classification. IEEE J Biomed Heal Inform 2020; 25: 3112–3120. https://doi.org/10.1109/JBHI.2021.3056748

55.

Cummings

Kasner

Mullen

, et al. Delays in the Identification and Assessment of in-Hospital Stroke Patients. J Stroke Cerebrovasc Dis 2022; 31: 106327. https://doi.org/10.1016/j.jstrokecerebrovasdis.2022.106327

56.

Bakradze

Liberman

. Diagnostic Error in Stroke—Reasons and Proposed Solutions. Curr Atheroscler Rep 2018; 20: 11. https://doi.org/10.1007/s11883-018-0712-3

57.

Shah

Luby

Poole

, et al. Screening with MRI for Accurate and Rapid Stroke Treatment. Neurology 2015; 84: 2438–2444. https://doi.org/10.1212/WNL.0000000000001678

58.

Nael

Khan

Choudhary

, et al. Six-Minute Magnetic Resonance Imaging Protocol for Evaluation of Acute Ischemic Stroke. Stroke 2018; 45: 1985–1991. https://doi.org/10.1161/strokeaha.114.005305

59.

Emmanuel

Maupong

Mpoeleng

, et al. A survey on missing data in machine learning. J Big Data 2021; 8: 140. https://doi.org/10.1186/s40537-021-00516-9

60.

Johnson

Khoshgoftaar

. Survey on deep learning with class imbalance. J Big Data 2019; 6: 27. https://doi.org/10.1186/s40537-019-0192-5