Abstract
Stroke is the fifth leading cause of death in the United States and a major cause of severe disability worldwide. Yet, recognizing the signs of stroke in an acute setting is still challenging and leads to loss of opportunity to intervene, given the narrow therapeutic window. A decision support system using artificial intelligence (AI) and clinical data from electronic health records combined with patients’ presenting symptoms can be designed to support emergency department providers in stroke diagnosis and subsequently reduce the treatment delay. In this article, we present a practical framework to develop a decision support system using AI by reflecting on the various stages, which could eventually improve patient care and outcome. We also discuss the technical, operational, and ethical challenges of the process.
Keywords
Introduction
Stroke is the fifth leading cause of death in the United States and a significant cause of severe disability in adults. 1 Each year, around 800,000 Americans experience a new or recurrent stroke. 2 Rapid diagnosis and treatment of stroke is crucial and leads to improved outcomes and prognosis among patients treated within the ‘Golden Hour’.3,4
However, strokes, especially posterior circulation strokes, are associated with significant (>10%) diagnostic error. 5 The latter could be due to (1) some patients with acute stroke present with non-focal symptoms such as dizziness, diplopia, dysarthria, or ataxia, 6 which may not trigger a neurology consult or a need for a more detailed neurological examination; (2) stroke is commonly misdiagnosed in younger patients7,8; and (3) the emergency department (ED) is a challenging environment for providers, especially with the multiplicity of care protocols, and the dynamic nature of patient care.8,9 Triage, consultations, admissions, discharge, and other steps in emergency care are time-sensitive, complex, and always changing to further improve efficacy and quality of care. Therefore, identifying potential stroke symptoms can be challenging,10–12 especially when the providers are in training.13,14 Besides, the risk of misdiagnosis can be higher among walk-in patients, 15 when the providers do not receive a pre-arrival notification from emergency medical services, 16 or when a neurologist is not readily available for an urgent consultation.17–19 Scoring systems for the diagnosis of stroke and recurrent stroke do not have a high sensitivity to diagnose the posterior circulation stroke.20,21 Furthermore, these tools are also not automatic, and require that the physicians suspect stroke as a differential diagnosis to apply the scoring system.
Artificial intelligence (AI), a computational framework meaning to emulate human insight, is one of the most transformative technologies.22,23 The era of augmented intelligence in healthcare is driven by the notion that intelligent algorithms can support providers in diagnosis, treatment, and outcome prediction, especially with growing digital and connected patient data and advances in computational abilities.24–26 The augmented-diagnostic model for stroke may be particularly helpful in low volume or non-stroke centers’ ED, where emergency providers have limited daily exposure to stroke. An automated, computer-assisted screening tool that can be seamlessly integrated into clinical workflow to quickly analyze patient symptoms and clinical data and suggest a diagnosis of stroke (‘StrokeAlert’ pop-up) in an ED setting could be valuable. Such a system will also help bring access and timely diagnosis for patients who choose to self-present to an ED. In this paper, we present a practical framework and summarize the stages needed to create a machine learning (ML)-enabled clinical decision support system for the screening of stroke patients in ED using data from electronic health records (EHRs) combined with the patient’s presenting symptoms at the point of care. We have assembled a team of experts and are leading such effort at Geisinger. Figure 1 summarizes the key steps of such a system.

Key steps for a stroke ML-enabled decision support system for EDs.
Building the training and testing cohorts
ML-enabled clinical decision support systems, with providers-in-the-loop, are essential advances in ensuring that decisions at the bedside are timely and data-driven. However, transparent reporting of prediction models is key to building confidence and improving reproducibility regardless of the study design or findings. 27
Case/control design
The initial phase is to create representative examples for model training. The inclusion and exclusion criteria should be restrictive enough to ensure that cases and controls have clear separation and are aligned with clinical pipelines. For instance, the case (confirmed stroke and transient ischemic event) cohort should have at least these conditions: (1) patient encounters should be of a minimum duration to ensure the severity of the condition and to remove noises from repetitive coding (e.g. >24 h); (2) discharge codes should be focused on primary diagnosis; (3) patient should have had confirmatory neuroimaging [e.g., brain magnetic resonance imaging (MRI)]. The same level of detail should be given to the design of inclusion and exclusion criteria to create a labeled dataset for the control group(s). The goal of creating control groups is to capture stroke mimics, stroke misdiagnoses, while capturing some level of diversity by including patients with similar general presentations as stroke, including, for instance, patients with a discharge diagnosis of migraine headache, seizure, and peripheral neuropathy.
Data extraction/processing
Maintaining data integrity during data aggregation is key. The past medical history should be defined carefully and not include new information from the index encounter. Quality control methods should (1) ensure variables have the same units and correct if needed; (2) remove clinically implausible values according to expert knowledge and aligned with the available literature; (3) apply filters to capture the relevant data elements within the desired timeframe; (4) identify and remove extreme outliers for longitudinal data and where multiple values are available; (5) use median as opposed to mean if multiple values are available; and (6) use imputation techniques apply to the training and testing dataset separately. If a laboratory value is missing, an estimated value should be generated to be used in the model. Nonetheless, in an operational stroke prediction model, laboratory values might not be essential, since, in real life, the diagnosis of stroke is made before any laboratory results are available. However, for model building, the results might be important to confirm other diagnoses. Some imputation methods include using median or mean values, K-nearest neighbors, multivariate imputation by chained equation (MICE), 28 or imputation designed for EHR. 29 Extraction and processing of clinical notes, especially triage/ED provider notes with information about the patient’s symptoms, are critical for a technology designed for acute conditions. During natural language processing (NLP), a set of positive and negative keywords such as dizziness, vertigo, headache, confusion, etc., are extracted from a subset of cases and controls, are used to build a custom dictionary for the NLP. These keywords are expanded to medical concepts using the Unified Medical Language System (UMLS) dictionary, 30 allowing for efficient translation and interoperability. The NLP pipeline will generate added insights, such as the polarity of the words and context.
Designing the ML-enabled diagnostic tool
The ML development process will consist of various iterative steps for training, testing, and predicting the probability of patients presenting to ED with stroke.
Exploratory data analysis will help modelers understand feature distribution, multicollinearity among features, missing data, data quality. Features that are irrelevant or partially relevant (such as procedure codes that are no longer in active use), or highly sparse can be removed or merged. Feature selection will help reduce overfitting, training time, and improve the model accuracy. Feature engineering, also an important step, can be used to construct robust high-level feature representations from complex and high-dimensional concepts (e.g. diagnoses, medications). During the model development, typically, an 80–20 split is performed to test the model performance; that is 20% of the data is marked as unseen and is used for model testing, while 80% is used for model development. Furthermore, as the number of cases is likely to be significantly less than the number of controls, it is crucial to address the class imbalance. Standard techniques to address class imbalance include up-sampling (by use of SMOTE algorithm, etc.) the minority class and/or down-sampling the majority class. 31
Training predictive models are done by using the training dataset to identify the best performing candidate model. Models like logistic regression, decision trees, random forest, autoencoders, and neural networks can be used in the training process. Typically, an interpretable framework such as logistic regression is used for benchmarking. Finally, nested K-fold cross-validation applied to 80% of the data should be performed for hyperparameter tuning and making modeling framework choices – to avoid the underfitting and overfitting. Various model metrics such as the area under the receiver operating characteristics curve, sensitivity, specificity, F-score, positive predictive value (PPV), negative predictive value (NPV), and missed classification rates can be used to identify the final candidate model for implementation and testing, prospectively. These metrics should be used for model selection only after careful evaluation to better understand the clinical needs in actual care settings. In the case of stroke, the cost of misdiagnosis is asymmetric, meaning that underdiagnosis (labeling true stroke patients as non-stroke) might have a higher consequence than overdiagnosis. Therefore, the system should enjoy a high sensitivity and NPV while keeping the specificity in a reasonable range. Finally, depending on the model used, knowing the most influential driving features helping the stroke prediction for each patient would be helpful and can provide insights into the prediction and ultimately assist the physician in decision making.
Workflow and system implementation
In a typical ED setting, a patient arrives at the hospital, either by ambulance or by a private vehicle. Although our proposed ML-enabled clinical decision support system is designed to work for all patients regardless of the arrival mode, we believe such a system will be more beneficial among patients self-presenting with milder and atypical symptoms. These patients meet a nurse at the point-of-care desk where they are asked preliminary questions while their vitals are recorded. The patients’ symptoms and any pertinent information are entered in the EHR and can be available to the NLP pipeline for modeling. The patient is then sent back to the waiting area. The ED providers rely on the clinical presentation and vitals to prioritize the patients and performing the physical examination (which may not include neurological examination), ordering labs, and, in some cases, imaging procedures.
The average time (in the United States) in EDs before admission is approximately 5.5 hours, and time until the patient is sent home is approximately 3 hours. 32 The waiting time varies depending on the season and the physicians’ workload. The imaging procedures ordered in ED also vary as per hospital protocol. Furthermore, if only brain computed tomography (CT) is ordered, it is still likely that the ischemic stroke is missed for a patient with atypical symptoms, as the head CT cannot reveal a hyperacute stroke in the majority of cases, and it has reduced sensitivity for lacunar strokes. 33 Despite the rapid increase in the use of advanced neuroimaging, it may be challenging to reduce misdiagnosis of stroke as the use of urgent MRI to diagnose stroke in ED is still limited. 34 Nevertheless, in reality, a provider should consider the possible stroke diagnosis to order additional neuroimaging.
Patients’ notes at the point-of-care contain vital information that, if combined with the patients’ profile and medical history, can be ingested by an intelligent system to identify at-risk patients. Implementation of such a system must be seamless without affecting the ED workflow. If a patient has a relatively significant chance of stroke, a ‘stroke alert’ will be generated when the patient’s chart is viewed by the next ED provider. The provider will then have a chance to act upon the alert, and, if needed, call stroke-alert, request an urgent neurological consult, or order confirmatory imaging.
System adoption and evaluation
Promoting adoption
A methodical plan for promoting adoption should be part of a thoughtful implementation. In general, physicians have relatively positive attitudes toward the idea of decision support systems.35,36 However, many challenges including low specificity,37,38 work-flow interruption,39–41 computer literacy and confusing interface,42,43 low confidence in the evidence, 44 awareness of the information, 45 requirement for a lot of data,41,46 interference with physician autonomy,35,47 or lack of relevance, 41 limit the effective use and adoption of such systems in many healthcare systems. “Alert fatigue” can be caused by poorly designed and implemented clinical decision support system.35,48–50 Too many alerts will discourage adoption.
Improving coordination and capacity building
To identify barriers and facilitators in this process, it is essential to design a model based on the unified theory of acceptance and use of technology.50,51 A proposed model for user acceptance and adaptation should include the following stakeholders in the planning stages: (1) end-users (care providers); (2) hospital and service-line leaders; (3) EHR engineers, innovation team; and (4) next-providers (inpatient/outpatient neurologists/hospitalists). The discussion objectives should focus on understanding the current workflow; how the intervention(s) may impact the people and processes; what the care providers’ perceptions are; and how to achieve their buy-in while communicating the current gaps, clinical goals, and how the proposed system can help.
System evaluation
A mixture of targeted chart-review, systematic evaluation, and assessment of trends and rate of misidentification is important. The process should be designed to be agile and iterative. Prospective evaluation is critical and cannot be a one-time process. As long as a decision support system remains in operation, the model outcome should be re-assessed at regular intervals, and integration of variables with most and least promising relevance can be reassessed. Fine-tuning should be subtle and iterative to ensure smooth and clear improvements in both effectiveness and performance. Presenting the summary findings to the stakeholders to demonstrate transparency, gaps, areas of misclassifications, and the overall added value is important.
Challenges and opportunities
Technical challenges – tool or model-dependencies
Selection of the right mix of tools, techniques, and languages for productionalizing
Specific tools/languages (e.g., Spark) might be conducive to handling large data volumes; however, they might not have mature data science libraries such as the ones offered in Python to develop stable models. The proper selection of tools will cause downstream technical challenges, especially if the designed pipeline is not agnostic to the implementation language.
Model drifting
Models deteriorate in terms of their predictive power and clinical utility if not continuously adjusted. In healthcare, addressing model drift takes on a larger dimension since changing trends in population health manifests itself as both concept drift and data drift. Continuous domain adaptation is an active field of research to address such challenges. 52 Healthcare systems that are in stable regions with low drop-out rates (such as Geisinger Health System) are better equipped to incorporate continual learning in training the model for improved clinical utility.
Model generalizability
Developing generalizable models require a comprehensive and multi-level view of the patients with an added effort to ensure adequate patient representation to reduce algorithmic bias. The utilization of data from two or more centers will be important to develop generalizable models. Techniques such as transfer learning can be designed to evaluate model generalizability and transferability across health systems. 53 These solutions are critical for the development of models that can function well in smaller systems, to drive technological advances for the mainstream, in rural and urban areas alike.
Operational challenges
Implementation of an ML model to provide predictions and recommendations in real-time in EHR requires specialized programming expertise and purchase of specialized products from the vendor, which could be prohibitive for smaller healthcare systems with limited resources, especially since, for a real clinical utility, there will always be a need for maintenance with added financial burden. Operational challenges also entail usability and adoption. A less discussed challenge is the need for continuous learning. Feedback loop process creation can be a tool to facilitate creating an automated process to generate data used for continual learning of the model to improve performance; the latter needs commitments from users to add needed information into the system. Adding additional steps to the already busy schedules of the users is a challenge; however, the development of tools with clinical-experts-in-the-loop from the initial phase could provide opportunities for better adoption.
Ethical challenges
Defining how an “ethical” AI system should perform in this context is somewhat subjective and encompasses our experience in the field and perception about how the AI software operates. Overall, the generic goal would be to ensure fairness, efficiency, efficacy, and patient safety. In an ideal scenario, a system would benefit all identified groups equally; however, in practice, such a goal is often impossible, and it will be necessary to define an acceptable bias. Rigorous regulatory institutions, such as the United States Food and Drug Administration (FDA), are starting to guide the development and maintenance of AI systems to ensure compliance and best practices for data-driven models. Nevertheless, the highest standard of AI-driven triage system for stroke will require to go a step further, and can be achieved only by a joint effort between care providers and modelers; the former will be able to define the needs, current limitations, and acceptable biases regarding stroke diagnostic, while the latter will provide insights about objective functions, computational fairness constraints, and estimation of projected model performance. A possible solution is to instantiate a Clinical-AI Review Board within institutions.
Our roadmap and framework targets ED; emergency medical services and telemedicine could also be a viable target for similar systems. The ML-enabled prediction model can seamlessly triage patients in real-time and alert the provider and the care team.
Footnotes
Acknowledgements
The authors would like to thank Sparkle Russell-Puleri for insightful discussion.
Author contributions
VA and RZ conceived of the presented idea. VA, AK, RZ wrote the manuscript. DPC, DMi, VeA, DMa, KAM, CMS, CK, NC, XL, SF, JL provided critical feedback and contributed to different sections of the manuscript. All authors reviewed and approved the final version of the manuscript.
Conflict of interest statement
The authors declare no competing interests. NC and XL are Genentech/Roche paid employees. Genentech/Roche had no role in study design, data collection, and interpretation, or the decision to submit the work for publication.
Funding
This work was sponsored in part by funds from the Geisinger Health Plan Quality Fund and National Institute of Health R56HL116832 (sub-award) to VA and RZ. The funders had no role in study design, data collection, and interpretation, or the decision to submit the work for publication.
