Abstract
Introduction
Vital signs including heart rate (HR), body temperature (T), respiration rate (RR), blood pressure (BP), and blood oxygen saturation (SPO2) are important parameters that reflect the health condition and provide critical and valuable information necessary for the accurate assessment and diagnosis of patients. 1 Monitoring these parameters on a regular basis is essential, especially in Intensive Care Units (ICUs) where timely signs deterioration detection can improve the prompt and proper intervention. Many Early Warning Systems (EWS) are available to determine patients at risk based on their monitored vital signs. Typically, these systems give scores to each vital sign range, with a high score indicating a greater level of deterioration. 2
EWS’s are very valuable for patient deterioration monitoring due to their simplicity and effectiveness in most medical settings. 3 The most common EWS’s which incorporate vital signs are NEWS, the Modified Early Warning Score (MEWS), Hamilton Early Warning Score (HEWS), Acute Physiology Score (RAPS), Simple Early Warning Score (SEWS) and Rapid Emergency Medicine Score (REMS).4,5 Out of the available EWS’s, numerous research publications have shown that NEWS is the most accurate, with superior sensitivity and specificity in predicting deterioration in emergency departments and various healthcare setting.6,7
While NEWS has been shown to be useful in healthcare monitoring, its implementation without automation or machine learning (ML) can present significant disadvantages.8,9 These disadvantages include, and not limited to, higher frequency of false alarms, increased human error, limitations in quickly analyzing large datasets, and a slower response time to emerging health situations. 10 ML can automate and improve the accuracy and responsiveness of EWS like NEWS. ML models can generalize complex patterns and trends within vital signs data and accurately and quickly predict the NEWS score which leads to earlier and more precise classification of patient deterioration.
This study aims to compare the performance of different neural network (NN) architectures in predicting NEWS scores from five key vital signs. The aim is to identify the most effective approach for integrating ML with NEWS and enhancing its predictive capabilities, in terms of precision speed, and potential for deployment to hardware implementation. Precisely, the objective of this paper is to compare the available NN architectures and evaluate their performance in determining the NEWS score from the five vital signs. The performance will be evaluated in terms of accuracy, sensitivity, and complexity.
Literature review
The adequacy of EWS’s, particularly NEWS, has encouraged researchers to seek the standardization of patient monitoring across various clinical settings. The main goal was to establish advanced techniques that can overcome NEWS limitations caused by the threshold-based decision, causing problems of sensitivity and specificity. Various ML approaches were explored including Fuzzy Logic (FL) and NN.
An approach of using fuzzy logic to design and implement an equivalent warning system for patients’ status classification was compared to MEWS. 11 The approach uses a large number of fuzzy rules (1800 rules) that are manually generated. Despite the relative complexity of the system design and deployment, the authors claim that the results are acceptable despite the absence of evaluation metrics.
A study, based on fuzzy logic, compared 16 different algorithms and verified their performance with simulation on 12 datasets. 12 The evaluation performance indicates that some fuzzy approaches tend to have a large rule base, some others are less accurate than needed. However, a classification system using fuzzy logic and gene expression programming (GPR), was found to have good classification performance with a small number of explainable rules.
Of the ML approaches applied to EWS, NN received the greatest attention due to their generalization capabilities and their power in dealing with complex data sets. A brief guide to deep learning in healthcare summarizes various applications like computer vision, generalization, and reinforcement learning and it shows how these could be beneficial to important medical applications such as disease risk and genetic traits. 13
The advantages and disadvantages of ML in predicting patients’ health deterioration in medical settings are discussed in a published systematic review. 14 The review examined 29 research papers and found that the various ML models had an area under the curve (AUC) ranging from 55% to 99%. Many models are used to automate the health deterioration risk and there is still a need for further improvement, especially in real-world situations and in areas related to patient clinical deterioration.
Another scoping review, 15 reached the same conclusion that ML-based EWS models are promising but still need further research to be successfully applied in clinical practice. The article presented many ML models including Kernel-based, tree-based, and regression-based for risk deterioration prediction with AUC ranging from 57% to 97%. A simple feedforward NN to predict the deterioration of five vital signs (HR, T, BP, RR, and SPO2) was able to achieve 95% precision. 16
Many publications investigated the time series approach to vital signs monitoring using various ML algorithms. For example, a hybrid KNN-LS-SVM learning algorithm was used to predict future vital signs values with less than 5% mean absolute percentage error. 17 A Recurrent NN (RNN) with three layers achieved an AUROC of 87%. 18 Another time-series work using six statistical ML algorithms (Naive Bayes, Gradient Boosting, Decision Trees, Ensemble Methods, Logistic Regression, and Random Forest) achieved an AUC ranging from 84% to 96%. 19 Similar approaches are applied for respiratory deterioration, where EWS’s need improvement. 20 The algorithm performance in predicting the 24 h ahead deterioration achieved an AUROC of 94% with a 70% accuracy. Other publications with similar NN architectures (logistic regression, Naive Bayes classifier, decision trees, support vector classifier, K-Neighbors Classifier, and gradient boosting classifier) achieved results ranging from 84% to 89% accuracy and AUROC ranging from 68% to 94%.21–23 ML was also applied to EWS for specific objectives such as mortality predictions and cardiac arrest risks and showed a potential for fast identification of high-risk patients, with reasonable accuracy an capability of reducing daily alarm rates by over 20%.24–27
The review of the previous work indicates good potential for applying ML with EWS. Many of the evaluated approaches provided reasonable accuracy. However, the study of the complexity of these approaches and their potential deployment is not thoroughly discussed. In addition, with the advancement of methodologies and tools, higher precision and low complexity are more possible now, which will be investigated in this paper.
Methodology
Data collection and processing
NEWS scoring system.
NEWS clinical risk level classification.
Data points are synthetically generated to represent various situations. Synthetic data is preferred for the evaluation objective because real data tends to represent normal and close to normal situations with little or no representation of extreme clinical values. A set of 9000 cases is divided into training and testing sets (80% of the data for training and 20% for testing). It was ensured that the training and testing sets are well-balanced and that all the classes are equally represented in both.
NN architectures
List of NN Architectures evaluated in this study.
Evaluation metrics
All the architectures will be evaluated using the following metrics: • Testing Accuracy (%), knowing that the data used is balanced. • Total Testing Cost, assuming equal penalties for all misclassifications during testing. • Prediction Speed in number of observations per second (obs/sec), representing the number of data samples processed by the model per second during testing. • Model Size in kB or MB, useful for deployment study. • Precision. • Recall. • F1Score. • Mean AUROC. • Execution Time (using: i7-6500U, 2.50 GHz, 8.00 GB, 64-bit OS).
Results
After training and evaluating all the 29 architectures, three performance groups are identified and will be labelled as: low performers, average performers, and top performers.
Low-performing architectures
Testing Results for the low performing architectures.

Confusion Matrix for one of the low-performing architectures (Ensemble subspace KNN).

ROC plot for one of the low-performing architectures (Coarse Tree).
Average-performing architectures
Testing Results for the average performing architectures.

Confusion Matrix for one of the average-performing architectures (Logistic Regression Kernel).

ROC plot for one of the average-performing architectures (SVM Kernel).
Top-performing architectures
Testing Results for the top performing architectures.

Confusion Matrix for one of the top-performing architectures (Narrow NN).

ROC plot for one of the top-performing architectures (Linear Discriminant).
Complexity summary of the top performing architectures.
Discussion
The comparative analysis of 29 neural network architectures for NEWS score prediction reveals a broad spectrum of performance. Most of the models—primarily those based on KNN, decision trees, and ensemble methods—showed insufficient accuracy and precision for reliable clinical applications, underscoring the critical importance of careful model selection in high-stakes medical contexts. While several models showed acceptable performance, only seven (Linear Discriminant Analysis, narrow and medium neural networks, and some SVM configurations) achieved perfect accuracy and precision. The superior performance and efficiency of Linear Discriminant Analysis and simpler neural networks make them particularly promising candidates for real-time deployment in clinical settings.
However, there is still a crucial need for explainable AI (XAI) in healthcare. High accuracy is essential, but the ability to understand why a model generates a specific prediction is paramount for building clinician trust and ensuring responsible AI use. The need for transparency and interpretable results, despite the challenges, is evident. 29 In high-stakes medical applications like early warning systems, understanding the model’s reasoning is essential for ensuring accurate and prompt clinical intervention. Therefore, future research should prioritize the development of XAI methods, such as SHAP or LIME, to enhance the interpretability and trustworthiness of these models. In conclusion, this study identifies promising architectures for NEWS score prediction. However, further research is needed to fully realize the potential of machine learning in this critical area.
Limitations
Few limitations of this study warrant consideration. The use of synthetic data, while allowing for important insights, may not fully capture the complexity and variability inherent in real-world clinical data, potentially limiting the generalizability of the findings. The study also did not explicitly assess model performance under conditions of noisy or incomplete data, a common occurrence in clinical practice. Furthermore, the integration of these models into existing clinical workflows and their impact on human-computer interaction requires further investigation for practical implementation. These factors need further research to improve model robustness and address practical challenges in clinical settings.
Conclusions
This comparative analysis reveals a significant disparity in performance among the 29 NN architectures investigated. While several architectures, notably those based on KNN, ensemble methods, and decision trees, showed inferior performance, a group of 11 showed acceptable accuracy, suggesting a possibility for potential real-world applications. However, only seven architectures achieved exceptional performance, reaching 100% accuracy and precision. These include Linear Discriminant, NN (narrow and medium), and SVM (Linear, Quadratic, and Coarse Gaussian), along with the Efficient Logistic Regression. Among these top performers, Linear Discriminant, narrow NN, and medium NN stand out due to their exceptional speed (over 120,000 observations per second), minimal model size (less than 10 kB), and real-time monitoring potential, making them ideal candidates for deployment in real-world clinical environments like ICUs. This research underscores the critical need for comprehensive analysis and thorough selection of proper NN architectures for improving the performance and feasibility of ML-based EWS in healthcare.
Footnotes
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
