Abstract
Background:
Several early warning scores have been designed to optimize acute care by identifying patients at risk of deterioration.
Methods:
In this post hoc dual center study, we analyzed the performance of six clinical scores (the Goodacre score, Groarke, Worthing Physiological Score, Rapid Acute Physiology Score, Rapid Emergency Medicine Score, United Kingdom National Early Warning Score. The primary outcome is 30-day all-cause mortality after inclusion and data were obtained from previous studies performed at two different emergency departments on two continents (Denmark, Europe, and Hong Kong, Asia).
Results:
We included 2952 people; 1482 (50.2%) were male, mean age (standard deviation) was 65.7 (18.3) years, and 109 (3.7%) died within 30 days. Mortality rate increased steadily with increasing scores for all six scoring systems in Hong Kong while this was less obvious in Denmark. In all patients, Rapid Acute Physiology Score had the lowest discriminatory power while National Early Warning Score had the highest. National Early Warning Score performed best in Hong Kong while Worthing performed marginally better in Denmark.
Discussion:
Surprisingly, the performance of the scoring systems varied considerably, but were largely unaffected by location, and none of them performed close to what clinicians would normally require for predicting 30-day all-cause mortality
Conclusion:
All scores performed similarly across both centers, with poor prediction of 30-day all-cause mortality. Based on these findings, we believe that clinical scores must be supplemented by either biochemical values or global markers of physiological reserve to reflect reality and to be of true value.
Keywords
Introduction
There is an abundance of early warning scores1,2 designed to identify patients at risk of deterioration so that acute care can be optimized. 1 A substantial number of scores have undergone local validation and many international scrutiny. 2 Some scores have demonstrated utility while others are less than optimal. 2
While most scores contain identical physiological variables, the assignments of their weightings vary. While one early warning score assigns a score of 1 for a systolic blood pressure of 70 mmHg, another will opt for a score of 3, affecting the performance of the score. 3
Our group has recently published a study from Denmark validating six clinical scores. 4 Apart from being single centered, this data set was also affected by missing data. To rectify these shortcomings, we have designed this post hoc study to analyze already collected data from two very different emergency departments (EDs) placed on two different continents to test the performance of the six clinical scores (the Goodacre score, 5 Groarke score, 6 Worthing Physiological score, 7 Rapid Acute Physiology Score (RAPS), 8 Rapid Emergency Medicine Score (REMS), 9 and the United Kingdom National Early Warning Score (NEWS) 1 ) anew.
Methods
We performed a post hoc dual center analysis of already collected data from two EDs, one in Denmark and one in Hong Kong, to examine the performance of six early warning scores in common clinical use. Both cohorts were collected with another aim in mind (ClinicalTrials.gov Identifier: NCT03108807 and NCT02817581).
Settings
The Danish cohort was collected from the xxx, a 400-bed regional teaching hospital where all patients except women in labor and children are admitted through the ED. The cohort from Hong Kong was collected at the ED of the xxx, a 1500-bed tertiary university hospital with an open access ED. The Danish department sees approximately 40 patients per day, while the Hong Kong ED deals with approximately 400 patients per day.
At both centers, all adult (age ⩾ 18 years) patients, who were not in the highest triage level or who did not have obviously minor problems (i.e. the lowest triage level) at their first visit in the study period, were included in this analysis. Patients were not necessarily hospitalized after inclusion and some were discharged directly from the ED.
Scores
We included six scores1,5 –9 that mostly were similar, but had subtle differences (Table 1). All scores were based on physiological parameters, but the weights varied.
Physiological parameters included in each of the scores.
NEWS: National Early Warning Score; RAPS: Rapid Acute Physiology Score; REMS: Rapid Emergency Medicine Score.
Outcome
Our primary outcome is 30-day all-cause mortality after inclusion. In Denmark, follow-up was conducted through the Danish Civil Personal Register, 10 while follow-up in Hong Kong was via telephone follow-up calls and using the local electronic healthcare system (Clinical Management System). Blinding was not performed.
Sample size
As this is a post hoc analysis, a sample size calculation has not been performed.
Statistical analysis
Data are presented descriptively as mean (standard deviation (SD)) or number (proportion) as appropriate. The discriminatory power, that is, the ability to discriminate between patients that meet the endpoint, is presented as area under the receiver operating characteristics curve (AUROC). Calibration, that is, the precision, was calculated according to Seymour et al. 11 Analyses were performed using Stata 15 (Stata Corp, College Station, TX USA).
Results
We included 2952 people; 1482 (50.2%) were male and the mean age (SD) was 65.7 (18.3) years (Table 2). Within 30 days, 109 (3.7%) died, the majority in Hong Kong (Table 2).
Baseline characteristics of the participants.
SD: standard deviation; NEWS: National Early Warning Score; RAPS: Rapid Acute Physiology Score; REMS: Rapid Emergency Medicine Score.
The mortality rate increased steadily with increases in scores for all six scores in Hong Kong (Figure 1(a)), while this was less obvious in Denmark (Figure 1(b)).

Bar graphs of 30-day mortality rate with increasing scores for patients from (a) Hong Kong, (b) Denmark, and (c) all. The green color designates survivors while the red color shows fatalities.
The discriminatory power varied between the six scores. In all patients, RAPS had the lowest AUROC of 0.547, while NEWS had the highest at 0.701 (Table 3 and Figure 2). NEWS also had the best discriminatory power in Hong Kong, while Worthing performed marginally better in Denmark. All scores had acceptable calibration in both settings (Table 3).

Discrimination plots for the six clinical scores, used in (a) Hong Kong, (b) Denmark, and (c) all.
Discriminatory power and calibration of the six scores.
NEWS: National Early Warning Score; RAPS: Rapid Acute Physiology Score; REMS: Rapid Emergency Medicine Score.
Discussion
In this post hoc study from two EDs on two continents, we found that the performance of six early warning scores in common clinical use varied considerably, but not due to the location. While there were minor differences between the sites, all scores had very similar performances across both sites.
While all scores had a similar predictive power in an undifferentiated ED population, none of them had a performance close to what clinicians would normally require for predicting 30-day all-cause mortality. 2 Obviously, some of this is explained by the fact that the scores were developed for predicting short-term mortality, but we believe, as we have previously argued, 12 that scores based on purely vital signs are likely unable to stand on their own. They need to be supplemented by either biochemical values, such as D-dimer, 2 or global markers of physiological reserve, such as mobility13,14 to improve their performance and support clinical decision-making in an emergency setting.
The components of the six scores varied but the main difference was in the weights assigned to the abnormal values. REMS 9 has the most components (seven), whereas the Goodacre score 5 has three and RAPS 8 has four. The scores with the most components generally performed better. NEWS 1 and Worthing 7 performed well at both sites and we believe the difference in their predictive power to be due to the weights. Interestingly though, the Goodacre score, 5 with its only three components, also performed well. This could indicate that identifying patients at risk perhaps is not only based on strict measurement of all vital signs but discerning the key predictive variables.
The Goodacre score 5 contains few traditional vital signs and is similar to components of the Emergency Severity Index (ESI) triage system. 15 It only comprises age, consciousness, and oxygen saturation. Chronological age does give an indication of chronic health status and the potential co-morbidity burden while decreased consciousness can be considered a universal marker for a severe deterioration in health. 16
It is somewhat surprising that all six scores performed equally in both settings. The healthcare systems in Denmark and Hong Kong are different and the populations attending the ED are difficult to compare. While Denmark has a very strong primary care system, Hong Kong has a limited primary care system. In Denmark, patients are either seen by a general practitioner before coming to the ED, or they attend after having contacted the EMS because of a suspected acute, life-threatening situation. In Hong Kong, anyone can visit the ED and register to be seen by an ED doctor.The reason for this surprising result is probably a difference in populations included in the cohorts. In Denmark, we actively sought inclusion of all patients arriving to the ED, while in Hong Kong, we actively sought inclusion of intermediate risk patients and thus excluded a lower risk population. But indeed, our group has previously shown that patients in Denmark and Uganda with near normal vital signs have comparable outcomes. 17
Our study has some limitations. While we tried to keep rates of missing data as low as possible, we still had up to 6% missing values. In addition, none of the scores were designed to predict clinical outcomes 30 days into the future. Any prediction system will become increasingly inaccurate as it tries to reach further into the future. As the vital signs were mostly collected by clinical staff and then copied by the research staff, we cannot rule out inaccurate measurements or errors in data entry.
Conclusion
In this post hoc study from two EDs in Denmark and Hong Kong, we found that six early warning scores in common clinical use varied in performance as measured using discriminatory power and calibration. All scores performed equally across both sites. The best scores were NEWS and Worthing Score, while RAPS had the poorest performance. Surprisingly, the Goodacre score with only three ingredients (age, consciousness, and oxygen saturation) performed almost as well as the best score.
Footnotes
Acknowledgements
The authors wish to thank Dr John Kellett for his invaluable help with writing the manuscript.
Author contributions
M.B., L.Y.L., K.K.C.H., C.A.G., and C.H.N. conceived the study. M.B., C.A.G., T.C., and C.H.N. were involved in protocol development. R.S.L.L. wrote the initial article. L.E.L., R.S.L.L., L.Y.L., K.K.C.H., S.L., S.P., and T.C. collected and analyzed the data. All authors approved the final version of the article and edited and reviewed the article.
Availability of data and materials
The Hong Kong data set is available from the corresponding author. The Danish data set cannot be shared due to Danish law.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Informed consent
Written consent was obtained either from the patient or from a relative in all cases.
Ethical approval
The Danish part of the study was approved by Danish Regional Committee of Health Research Ethics (Identifier: S-20170005) and the Danish Data Protection Agency (Identifier: Region Syddanmark 2452). The data from the Hong Kong cohort were obtained from a prospective study that was approved by the Institutional Review Board of The Joint Chinese University of Hong Kong—New Territories East Cluster Clinical Research Ethics Committee (CRE-2016.236).
