Abstract
Introduction
Stroke is recognized as one of the leading causes of death worldwide. In addition to its fatal consequences, stroke is a major contributor to long-term disability in adults, placing a substantial burden on healthcare systems globally.1–3 More than half of stroke survivors experience unfavorable outcomes, 4 and older patients, in particular, tend to suffer functional decline between 18 and 60 months post-stroke. 5
Given these challenges, it is crucial to develop an early warning system capable of accurately predicting a patient’s functional recovery following a stroke. Accurate outcome prediction can help patients and their families prepare for necessary post-acute care, while also enabling healthcare policymakers to strategically plan staffing and allocate resources for the medium- and long-term care of stroke patients.6–8
Previous research on functional outcome prediction has utilized various types of data. These include structured data, such as demographics, stroke subtypes (e.g., total anterior circulation infarcts, partial anterior circulation infarcts), and the presence of cerebellar symptoms9–14; imaging data, such as X-rays and angiographic images15–17; and unstructured text data, such as clinical and radiology reports.18–23
With the increasing availability of clinical text data and the advancement of text mining technologies, clinical text classification has become a prominent research area.24–26 Many existing studies have employed traditional text mining techniques for feature extraction and representation, such as term frequency-inverse document frequency (TF-IDF), alongside traditional machine learning models like k-nearest neighbors (KNN) and support vector machines (SVM).18–22,27–29
Meanwhile, deep learning techniques have been widely adopted for various text classification tasks. Models such as bidirectional encoder representations from transformers (BERT) for feature representation and convolutional neural networks (CNN) for prediction have demonstrated superior performance over traditional approaches in general text classification settings.30,31 However, to date, there is a lack of studies evaluating the effectiveness of deep learning methods for predicting functional outcomes following stroke.
This study aims to fill this gap by comparing the performance of several well-established traditional machine learning and deep learning approaches for functional outcome prediction. In addition, we explore the impact of feature fusion—where multiple types of features from various sources are combined—on model performance. Prior work has shown that feature fusion often leads to improved classification accuracy compared to single-feature approaches.32–35
The contributions of this paper are threefold. First, we systematically evaluate the predictive performance of traditional and deep learning methods for functional outcome prediction after stroke. Second, we identify the best-performing model, which can serve as a guideline for medical institutions to implement early warning systems and support proactive care planning. This model also provides a strong baseline for future research and potential performance enhancement. Third, we investigate the impact of concatenating multiple types of textual features on prediction accuracy, providing insights into the benefits of multi-feature integration in clinical text mining applications.
Methods
The text mining procedure
The text mining procedure for functional outcome prediction is illustrated in Figure 1. The dataset comprises narrative clinical text documents along with other relevant medical records. The process begins with text preprocessing, which involves selecting appropriate clinical notes and assigning class labels based on the patients’ functional outcomes. Subsequently, feature representation and model construction are carried out using both traditional machine learning and deep learning approaches. Finally, the predictive performance of each model is evaluated to identify the most effective method for forecasting functional outcomes following a stroke event. The text mining procedure for functional outcome prediction.
Data collection
The experimental dataset was collected from a local hospital in Taiwan and includes records of over 6000 patients who were hospitalized for ischemic stroke between 2006 and 2022.
Narrative clinical notes, specifically, admission notes documenting patients’ clinical symptoms, were extracted from the hospital’s electronic medical records (EMR) database. Figure 2 presents an example of such narrative clinical notes. After excluding records with missing data, a total of 5191 text documents corresponding to 5191 patients were retained for analysis. This 65 year-old man has past medical history of hypertension with regular medical control at LMD for 2–3 years. This time, he suffered from dizziness with gradually progressive L’t side limb weakness and an unsteady gait was noted this afternoon. He denied fever, nausea, vomiting, diarrhea or numbness. Therefore, he visited our emergency room for help. His laboratory data were within the normal range. A CT scan of the brain showed no active brain lesion. After initial treatment at the emergency room. Under the impression of stroke and hypertension, he was admitted to the neurologic ward for further treatment. An example of the narrative clinical notes.
Basic information about the experimental datasets.
Textual feature extraction and representation
In this study, each clinical text document was processed using four distinct feature extraction methods, resulting in four types of feature representations. The methods employed were bag-of-words (BOW), term frequency-inverse document frequency (TF-IDF), embeddings from language models (ELMo), and bidirectional encoder representations from transformers (BERT). Prior to feature extraction, text preprocessing was conducted, including the removal of specific punctuation marks (e.g., %, $, etc.) and the expansion of contractions (e.g., “can’t” to “cannot” and “don’t” to “do not”).
BOW and TF-IDF
The bag-of-words (BOW) method represents a document by calculating the term frequency of each word in a predefined dictionary, that is, the number of times a given term appears within the document. These word frequencies are then used as features to represent the document. However, not all words contribute equally to a document’s meaning; some may appear frequently but carry little informative value. To address this limitation, the term frequency-inverse document frequency (TF-IDF) method is employed. TF-IDF adjusts term weights by considering how common or rare a word is across the entire corpus, thereby emphasizing more informative and discriminative terms. 36
The TF-IDF is based on
For BOW and TF-IDF feature extraction, stop word removal and lemmatization were performed as part of the preprocessing steps. The Scikit-learn library in Python was used to implement both BOW and TF-IDF feature extraction. Following the recommendations of Dessi et al., 37 Lin et al., 38 and Sheikh et al., 39 300 terms were selected to represent each document, resulting in 300-dimensional feature vectors.
ELMo and BERT
Contextualized text representation is a dynamic word embedding technique that differs from prediction-based methods like Word2Vec. It was introduced to address the issue of polysemy, where a single word can have multiple meanings depending on context, which traditional word embedding methods fail to capture. This approach leverages deep learning architectures such as bidirectional long short-term memory (BiLSTM) networks. 40 Two representative models that utilize contextualized embeddings are Embeddings from Language Models (ELMo) and Bidirectional Encoder Representations from Transformers (BERT).
ELMo is based on a two-layer BiLSTM network that learns contextual information by processing input sequences in both forward and backward directions. The forward pass captures the current word along with its preceding context, while the backward pass captures the word along with its succeeding context. The final ELMo embedding is derived from a weighted sum of the internal states of the BiLSTM layers. In this study, ELMo features were extracted using pre-trained models from the AllenNLP library, producing 1024-dimensional feature vectors for each document.
In contrast, BERT is built on the transformer architecture, which utilizes self-attention mechanisms rather than recurrent layers. Unlike traditional encoder-decoder models that struggle to retain long-range dependencies, BERT’s attention mechanism allows the model to capture relationships between all words in a sentence simultaneously. The bidirectional nature of BERT enables it to understand context from both preceding and succeeding words, making it highly effective for language modeling. In this architecture, the transformer encoder generates rich contextual embeddings by assigning attention weights that help the model focus on the most relevant parts of the input sequence.
For this study, BERT features were extracted using pre-trained models from Google’s TensorFlow library. Each document was represented by a 768-dimensional feature vector.
Prediction techniques
This study applies four prediction (or classification) techniques to develop functional outcome prediction models: k-nearest neighbor (KNN), support vector machine (SVM), convolutional neural network (CNN), and long short-term memory (LSTM). Specifically, KNN and SVM are implemented using the Scikit-learn library, while CNN and LSTM are developed using the TensorFlow framework. To evaluate model performance, a 5-fold cross-validation strategy is employed, in which each experimental dataset is split into 80% training and 20% testing subsets. The area under the receiver operating characteristic (ROC) curve (AUC) is used as the primary evaluation metric to assess the predictive performance of the models.
KNN
The k-nearest neighbor (KNN) classifier operates by measuring the distance between an unknown test instance and its k nearest neighbors in the training set to determine its class label. Typically, the Euclidean distance is used as the distance metric. In the simplest case where k = 1, the class label of the single nearest neighbor is assigned to the test instance. When k is greater than 1 (e.g., k = 5), the final classification is determined by a majority vote among the class labels of the k nearest neighbors. 41 In this study, the KNN classifier is implemented using the default settings of the Scikit-learn library, with k set to 5.
SVM
A support vector machine (SVM) employs a kernel function to transform the original feature space of a two-class training dataset into a higher-dimensional space, where a separating hyperplane can be constructed to distinguish between the two classes. The objective of the kernel function is to maximize the margin of the hyperplane, thereby improving the classifier’s ability to differentiate between classes. Commonly used kernel functions include the linear kernel, polynomial kernel, and radial basis function (RBF) kernel. 42 In this study, the SVM classifier is implemented using the default settings of the Scikit-learn library, which utilize a linear kernel and a regularization parameter of C = 1.0.
CNN
A convolutional neural network (CNN) consists of an input layer, convolutional layers, pooling layers, a flattening layer, fully connected layers, and an output layer. Originally developed for computer vision tasks, CNNs utilize convolutional layers to extract local features and generate feature maps, while pooling layers reduce the dimensionality of the data, thereby lowering the number of training parameters and computational complexity. This process helps retain essential information while enabling the extraction of deeper hierarchical features. 43
For text classification tasks, the input is typically represented as an
LSTM
Long short-term memory (LSTM) networks 44 are a specialized type of recurrent neural network (RNN) designed to retain and utilize information over long sequences. Unlike traditional RNNs, LSTMs are capable of preserving dependencies between past and current inputs, making them particularly well-suited for sequential data. In LSTMs, the output at each time step is influenced not only by the current input but also by the information retained from previous time steps, allowing the model to capture temporal relationships effectively.
LSTMs address the common issues of vanishing and exploding gradients in standard RNNs through a gated architecture composed of three primary components: the forget gate, input gate, and output gate, along with a memory cell. These gates regulate the flow of information, determining what to keep, update, or discard from the memory cell. Each gate is controlled by learnable parameters, and their activation is determined by weighted combinations of the input data. This mechanism enables the model to protect and manage its internal memory state dynamically throughout the learning process.
Results
Single text feature prediction models
AUC rates from different prediction models for different feature representations.
Figure 3 compares the AUC scores obtained from different prediction models at 30, 90, and 180 days following a stroke. Once again, the KNN and SVM models, when combined with BOW or TF-IDF features, consistently outperform models using BERT and ELMo representations. Moreover, these traditional classifiers and feature extraction methods demonstrate superior performance in predicting long-term functional outcomes. Among them, the combination of BOW features and the SVM classifier emerges as the most effective approach. AUC rates from different prediction models for mRS scores 30, 90, and 180 days after a stroke.
In contrast, among the deep learning methods, both CNN and LSTM perform better when using BERT for text feature representation compared to the other methods. However, regardless of the feature representation used, CNN and LSTM consistently underperform relative to the traditional classifiers KNN and SVM.
Prediction models obtained by concatenating multiple text features
AUC rates from different prediction models obtained by using different feature representation combinations.
Figure 4 compares the performance of various prediction models based on mRS scores at 30, 90, and 180 days post-stroke. For the KNN and SVM classifiers, the top three feature representation combinations are BOW + BERT, BOW + TF-IDF, and TF-IDF + BERT. Among these, BOW + BERT and BOW + TF-IDF achieve the best performance when used with the SVM classifier, with nearly identical results. However, the BOW + TF-IDF combination is recommended due to its lower feature dimensionality—600 features compared to 1068 for BOW + BERT—making it a more computationally efficient option. Additionally, this combination demonstrates superior performance for long-term functional outcome prediction. AUC rates from different prediction models for mRS scores 30, 90, and 180 days after a stroke obtained with different combined feature representation combinations.
For the deep learning models, the TF-IDF + BERT feature representation yields the best performance when used with the CNN classifier. In contrast, the LSTM model shows only a slight improvement with TF-IDF + BERT compared to other feature combinations for short-term functional outcome prediction. Although the BOW + BERT combination enables LSTM to achieve high AUC scores for predicting the mRS score at 30 days post-stroke (mRS30), the long-term prediction performance remains comparable across the different feature representation combinations.
Discussion
Overall, the two experimental studies demonstrate that traditional text mining approaches, specifically, feature representation methods such as BOW and TF-IDF, along with classification techniques like KNN and SVM, outperform deep learning-based approaches. This includes both feature representations such as BERT and ELMo, and classifiers such as CNN and LSTM, in the context of functional outcome prediction. Among the evaluated approaches, the best results are achieved using SVM in combination with BOW, BOW + TF-IDF, or BOW + BERT, all of which yield similarly high AUC scores.
While the performance of prediction models based on individual and combined feature representations was compared, the results indicate that using a feature fusion strategy, concatenating multiple types of text features, does not necessarily lead to improved prediction performance. Therefore, in addition to evaluating AUC scores, it is also critical to assess the impact of type I error in functional outcome prediction. Type I error refers to instances where the model incorrectly classifies patients with poor outcomes as having good outcomes. High type I error rates can mislead patients and their families in making follow-up care decisions and may negatively affect health policy planning, particularly in the allocation of medical personnel and resources for stroke rehabilitation.
Type I errors from SVM by BOW, BOW + TF-IDF, and BOW + BERT.
However, a slight reduction in type I error is observed when using the BOW + TF-IDF feature representation with SVM, compared to using BOW alone as the baseline. Although the performance difference is modest, it holds practical significance in real-world applications. For instance, among 1000 stroke patients with poor outcomes at discharge (as indicated by their mRS scores), the SVM model using BOW + TF-IDF correctly classifies approximately four more patients than the model using only BOW.
Therefore, despite the increased computational complexity associated with extracting both BOW and TF-IDF features (resulting in 600 feature dimensions) and training the corresponding SVM model, the use of BOW + TF-IDF is recommended for feature representation in developing SVM-based functional outcome prediction models.
While this study provides an empirical comparison of several well-known traditional machine learning and deep learning approaches to identify the most effective feature representation methods and classification techniques for functional outcome prediction, several limitations remain that warrant further investigation: (1) Class Imbalance: The experimental datasets used in this study exhibit class imbalance. Future research could apply data re-sampling or augmentation techniques to re-balance the training sets and evaluate whether these methods improve prediction performance. (2) Alternative BERT Variants: In addition to conventional BERT embeddings, domain-specific models such as ClinicalBERT could be explored to assess whether contextual embeddings tailored to medical texts yield better results. (3) Multimodal Learning: Future work could also incorporate multimodal learning techniques by integrating structured data, imaging data, and clinical text. Such an approach may enhance prediction accuracy by leveraging complementary information across modalities. (4) Missing Data Handling: This study excluded records with missing data from the original EMR database. Employing imputation techniques to recover missing values would increase the dataset size and may lead to improved model performance. It would be worthwhile to investigate whether models trained on imputed datasets outperform those developed using only complete cases.
Conclusion
This study focuses on the application of text mining techniques for predicting the functional outcomes of stroke patients, incorporating both traditional machine learning and deep learning approaches. Specifically, bag-of-words (BOW) and term frequency-inverse document frequency (TF-IDF) were used as traditional feature representation methods, while k-nearest neighbor (KNN) and support vector machine (SVM) served as the corresponding prediction models. For deep learning-based methods, embeddings from language models (ELMo) and bidirectional encoder representations from transformers (BERT) were used as pre-trained feature representations, with convolutional neural networks (CNN) and long short-term memory networks (LSTM) employed as the prediction models.
Experimental results based on narrative clinical notes of stroke patients collected from a Taiwanese hospital demonstrated that traditional machine learning methods significantly outperformed their deep learning counterparts. In particular, the best performance was achieved using BOW for feature representation and SVM for classification.
Additionally, this study explored the use of a feature fusion strategy by concatenating multiple types of features. However, the results revealed that feature fusion does not necessarily enhance model performance. Among the various feature combinations, the best performance with SVM was observed using BOW + TF-IDF and BOW + BERT, both of which performed comparably to SVM with BOW alone, without statistically significant differences in AUC scores.
However, further analysis of type I error, defined as the misclassification of patients with poor outcomes into the good outcome class, revealed that the lowest error rate was achieved using the BOW + TF-IDF combination with the SVM model.
Therefore, considering both high prediction accuracy and reduced type I error, the combination of BOW + TF-IDF for feature representation and SVM for classification is recommended for functional outcome prediction in stroke patients.
Footnotes
Ethics considerations
This study was approved by the Ditmanson Medical Foundation Chia-Yi Christian Hospital Institutional Review Board (CYCH-IRB No.2022086). Patient identifiers were removed to ensure patient confidentiality and privacy. In addition, patient consent was not required and patient data will not be shared with third parties.
Author contributions
Conceptualization: Yu-Hsiang Su and Chih-Fong Tsai; Methodology: Chih-Fong Tsai; Software: Chih-Fong Tsai; Supervision: Yu-Hsiang Su; Validation: Chih-Fong Tsai; Resources: Yu-Hsiang Su; Data curation: Yu-Hsiang Su; Writing – Original Draft: Yu-Hsiang Su and Chih-Fong Tsai; Writing – Review & Editing: Yu-Hsiang Su and Chih-Fong Tsai.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Ditmanson Medical Foundation Chia-Yi Christian Hospital (grant number R112-022-2).
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
