Sage Journals: Discover world-class research

Abstract

Objective

To compare prediction accuracy of rule-based early warning tools (EWTs) using a large healthcare electronic medical record (EMR) dataset and to re-evaluate using a novel hospital workload capacity evaluation method.

Materials and methods

Adult inpatient admissions to 11 Australian hospitals were included in a retrospective analysis of four EWTs: National Early Warning Score (NEWS), Between the Flags (BTF), Modified Early Warning Score (MEWS) and Queensland Adult Deterioration Detection Systems (Q-ADDS). Using death and unplanned transfer to the intensive care unit (UICU) as composite outcome, each EWT was evaluated with area under the receiver operating curve (AUROC), sensitivity and positive predictive value (PPV). A second analysis was performed with clinician workload capacity indicators.

Results

A total of 683,617 admissions were analysed, including 4954 deaths and 3400 UICU. NEWS2 AUROC was superior to Q-ADDS (1.6%, p < .001), MEWS (3.1%, p < .001) and BTF (28%, p < .001). At each alert threshold, Q-ADDS had superior PPV. Q-ADDS and MEWS operated at the lowest alert burden (1.0–3.8 alerts per 100 patient days) across all alert thresholds [low, moderate and Medical Emergency Team (MET)], followed by NEWS2 (1.9–5.5) and BTF (4.1–18).

Conclusion

Precision-recall workload capacity analysis provides a visual means of displaying the operational characteristics of EWTs in terms of EWT alert thresholds, resultant alert rates and traditional EWT accuracy (PPV and sensitivity). It may be helpful for healthcare organisations to consider clinician workload capacity, in addition to traditional evaluation metrics such as sensitivity and PPV, when selecting EWTs or setting escalation thresholds.

Keywords

Clinical deterioration prediction early warning tool health informatics digital health

Introduction

In-hospital patient deterioration can result in death or serious morbidity and affects 0.1–4.0% hospital admissions.¹ Over the past two decades, healthcare organisations in many countries have adopted vital sign rule-based early warning tools (EWTs), the most common being the National Early Warning Score (NEWS),² Between the Flags (BTF),³ the Modified Early Warning Score (MEWS)⁴ and Adult Deterioration Detection Systems (ADDS).⁵ These tools use vital sign trigger thresholds, derived from expert consensus as indicators of severity of illness, to trigger appropriately skilled interventions to the right patient at the right time. In many organisations these tools use paper observation charts and sit within broader early warning ‘systems’ (EWS) that include severity dependent, tiered escalation pathways, which may involve Medical Emergency Team (MET) review. As digital capability has grown, most EWTs have become embedded into electronic medical records (EMRs).

Despite their common goal, obtaining reliable comparisons of the impact of these different EWTs and EWSs (Appendix A in the supplementary materials) on patient outcomes has proved challenging. However, multiple comparisons of the predictive accuracy of the underlying EWTs are available, with more recent studies leveraging the large healthcare datasets now available.^6,7 Validation studies, along with clinical experience, show rule-based EWTs as having low predictive accuracy. Despite this, EWTs have been implemented at scale, with several, although not all, studies showing favourable clinical impact in reducing cardiorespiratory arrests and unplanned transfers to intensive care units (ICUs).⁸ In contrast, AI early warning tools (AI-based EWTs) have greater predictive accuracy, but less implementation history, with mixed evidence of their clinical efficacy and on a small scale.^1,7

AI-based approaches use many more variables than vital signs, including laboratory and demographic variables, but require sophisticated, real-time data pipelines to transform data from EMR systems to generate deterioration predictions. Hospitals often lack the funding, infrastructure and expertise to implement such tools.⁹ Additionally, AI-based EWTs are often considered ‘black-boxes’ by clinicians who do not understand how their outputs are derived, leading to lack of trust.¹⁰ They also attract privacy and bias concerns that regulation and governance struggle to keep pace with.^11,12 For these reasons, rule-based EWTs are likely to remain the dominant method for identifying deteriorating patients in the short to medium term.

Comparative studies of existing EWTs help decide whether and how to consider new solutions for identifying deteriorating patients. While data-based EWT evaluations do not consider variations in the EWS efferent limb (clinician and patient intervention response), validation on large datasets can provide information on the predictive accuracy of EWTs and the clinician workload associated with different EWT trigger thresholds. The predictive accuracy of rule-based EWTs is traditionally evaluated using the area under the receiver operating curve (AUROC) with secondary comparisons of sensitivity, specificity and positive predictive value (PPV) at each alert threshold.^2,5 However, AUROC can be misleading when applied to highly imbalanced datasets,¹³ such as those pertaining to critical patient deterioration which usually affects less than 1% of hospitalised patients.¹¹ Tabular comparisons of sensitivity, specificity and PPV at each alert threshold are also difficult to compare directly. For example, comparing the value of an EWT at a MET level threshold with 70% sensitivity and a PPV of 0.4 with another that is 65% sensitive with a PPV of 0.45 is problematic, because it lacks explicit valuations of the consequences of misclassification. A metric is needed that enables EWTs, rule-based or AI, to be compared directly and can demonstrate which EWT and trigger thresholds align best with both patient safety needs and workload capacity of the hospital.

Considerable evidence suggests that keeping clinician workloads generated by EWT alerts to a manageable level are crucial to the effectiveness of EWTs and, in turn, of EWSs. As a result, some data-capable organisations have adjusted EWT trigger thresholds according to predicted alert frequencies regarded as manageable given their capacity to respond.¹ In one study of 19 AI-based EWT implementation studies, 16 selected alert trigger thresholds based on manageable numbers of alerts, including false alerts that generate unnecessary clinical workload or alert fatigue. Of the six studies that quantified this variable, the alert rates were set between 3–12 alerts per 100 patient days. As patient safety concerns and clinician workload capacity heavily influence AI-based EWT alert thresholds, we developed a new method for comparing EWTs that considers both these determinants.

The objectives of this study were to:

Compare the prediction accuracy of digitally operational, rule-based EWTs with traditional evaluation methods using a large EMR retrospective dataset; and

Re-evaluate the same EWTs using a novel hospital workload capacity evaluation method.

Methods

Study cohort and data collection

Vital sign and outcome data from all adult ward inpatient admissions was collected between January 1, 2016 and June 30, 2020 at 11 digital hospitals within Queensland Health, a state-wide public healthcare provider in Australia. Sites included four city hospitals and seven regional hospitals ranging in size from 245 to 1074 and 25 to 928 beds respectively. During this period, a state-wide Cerner Millenium EMR was being rolled out to eight of the eleven sites, with the other three sites completing their implementation prior to the study start date. The first 1000 admissions were excluded from each of these new EMR sites to avoid any invalid data entries during EMR implementation. Other exclusions comprised admissions lacking at least one complete vital set, patients < 18 years of age, inter-hospital transfers within the same admission and non-inpatient admissions. Entire admissions were divided into episodes of care within a single location [ward, emergency department (ED), ICU and operating theatres (OT) or post-anaesthetic care units (PACU)]. Episodes of care within ED, ICU and OT/PACU were excluded as this analysis focussed on EWT performance in the general ward environment rather than acute care areas with higher monitoring intensity and nurse-to-patient ratios. For this research, the Townsville Hospital and Health Service Human Research Ethics Committee (HREC/QTHS/67897) granted a waiver of patient consent with Public Health Act approval.

EWTs

The latest, standardised release of the Queensland Adult Deterioration Detection System (Q-ADDS), the National Early Warning Score-2 (NEWS2), the Australian Capital Territory MEWS and the BTF protocol were evaluated. Other than MEWS, which was included as an in-use Australian EWT, Q-ADDS, BTF and NEWS2 are evidenced by many comparative studies.^2,5,14,15 The different tiers of alert thresholds and escalations required are shown in Table 1. EWT definitions and live environment variances are provided in Appendix A in the supplementary materials. Early warning scores were used throughout all hospitals for the entire study duration within the dataset, with Q-ADDS used at nine sites and BTF used at two sites.

Table 1.

EWT scores at each alert threshold level showing the escalation expectation at each level.

Alert threshold	Escalation required	Q-ADDS	NEWS2	BTF	MEWS
Low risk	Junior ward doctor review/ increase observations	4–5	5–6	Yellow alert	4–5
Low risk	Junior ward doctor review/ increase observations	4–5	5–6+^a	Yellow alert	4–5
Moderate risk	Senior ward doctor review/increase observations	6–7			6–7
MET level	Acute or Critical care team review	≥8 or a single vital sign in the purple zone	≥7	Red alert	≥ 8 or a single vital sign in the purple zone

Note. BTF entries are the applicable alert colour, as it is not a score-based system. Most have an additional tier for cardiac arrest response.

NEWS2-plus alert including any single vital sign triggering a score of 3 on its own.

Requirements for heart rate (HR), respiratory rate (RR), systolic blood pressure (SBP), temperature and oxygen saturation (SpO₂) were common across the EWTs. Patient oxygen use or quantity was required by NEWS2 and Q-ADDS respectively, and was captured through two variables in the EMR: either (i) fraction of inspired oxygen as a % (FiO₂); or (ii) oxygen flow rate (O₂ L/min). Appendix B in the supplementary materials provides conversion methods for calculating the equivalent NEWS2 and Q-ADDS oxygen flow scores.

Vital sign sets

Data capture for HR, RR, SBP, temperature and SpO₂ was fairly consistent across all sites, with level of consciousness and oxygen requirement more often missing, the former captured using either AVPU (alert, verbal, pain and unresponsive) or Glasgow Coma Scores (GCS), depending on the site. GCS scores were converted to AVPU scores and sedation score (utilised by MEWS) to AVPU according to the table in Appendix C in the supplementary materials. There are variations in how these EWTs function (Appendix A in the supplementary materials), which impacts data collection. In particular, single trigger tools (BTF) alert on any single vital sign threshold breach, whereas aggregate scoring tools (Q-ADDS, MEWS, NEWS2) primarily escalate on a cumulation of vital sign scores. This influences vital sign entry behaviour of the ward staff, with simultaneous full vital set entry essential for generating the aggregate score versus single or partial vital sign entry being adequate for the BTF single trigger tools. For aggregate scoring sites (9 of 11 sites using Q-ADDS), a minimum vital sign set was defined as that needed to generate a score; HR, RR, SBP, temperature, AVPU/GCS, oxygen requirement and SpO₂, with data carried forward up to 5 min to ensure the core vital signs were freshly inputted. The 5-min carry forward was required to provide a full set of vital signs because vital signs were captured individually and typically entered across several minutes. For single trigger sites (2 of 11 sites using BTF), HR, RR, SBP, temperature, and SpO₂ were required for the minimum vital set (with the 5-min carry forward), whereas AVPU/GCS and oxygen requirement data were carried forward up to 24 h, and, if absent, were given a value of ‘normal’ (AVPU = A, O₂ L/min = 0). Consistent with prior studies, this reasonably assumed nursing staff would not record these values for patients with normal level of consciousness and not receiving oxygen.⁵

Outcome measures

The primary outcome was deaths or unplanned transfers to ICU (UICU), in-line with previous studies,^14,16 and obtained from the EMR. Each episode of care could have one adverse outcome which ended the episode when it occurred, be that death or UICU, whichever came first. Otherwise, the episodes were labelled with ‘NO_OUTCOME’, meaning the patient was discharged alive or converted to a surgical care episode (defined as transfer to OT/PACU) and constituting a non-adverse outcome. Therefore, a single admission could have one or more episodes, each with either an adverse or non-adverse outcome, as used in prior work.^17,18

Precision-recall workload capacity calculations

As a novel approach, precision-recall workload capacity lines were developed using the following method. For the dataset, the total number of death or UICU outcome episodes (called α) was calculated and the total number of patient days (i.e., length of stay) across all episodes was divided by 100, to equal the total number of ‘100 patient days’ (called β). A single workload capacity line was then calculated for a given alert rate equivalent to N alerts per 100 patient days. For example, a hospital with a low workload capacity might be able to support 2 alerts per 100 patient days (i.e., N = 2) whereas a high workload capacity hospital might support 12 alerts per 100 patient days (i.e., N = 12). By setting sensitivity (S) to a range between 0 and 1, incremented by 0.01, the PPV was then calculated across the sensitivity range using:

PPV = \frac{S \cdot α}{β \cdot N} Number of outcomes / Number of alerts

For example, for a dataset of episode lengths summing to 1000 patient days (β = 1000/100 = 10) and an outcome prevalence of 1% (α = 0.01 × 1000 = 10), to build the low workload capacity line, set N = 2 alerts per 100 patient days. At 100% sensitivity, all patients with death/UICU are identified (S x α = 10) and at 2 alerts per 100 patient days, we have a total of 2 × 1000/100 = 20 alerts (β x N = 20). Using the above formula, PPV is calculated as 0.50 (10/20). This is repeated across a range of sensitivity values for each workload capacity line according to the value of N. See full worked example for our dataset in Appendix D in the supplementary materials.

Performance analyses

Patient characteristics are reported including medians and inter-quartile range where suitable. Two EWT performance analyses were performed. First, in line with prior work,⁵ AUROC, sensitivity, PPV and specificity were calculated on a vital sign set basis, where all sets within 24 h of death or UICU were considered to have that outcome event and all other vital sets did not. Sub-analyses of AUROC, AUPRC and F1-score were performed for the individual outcomes of death and UICU and the composite outcome. Further, AUROC was compared using DeLong's method, implemented in R.^19,20 Other than AUROC comparison, all analyses were carried out using Python and the scikit-learn library.

Second, the area under the precision recall curve (AUPRC) was evaluated as the primary metric for each EWT and plotted on the PR curve together with three alert rate lines, reflecting three workload capacity scenarios: (i) A low workload capacity able to support two alerts per 100 patient days, such as where a MET has to cover the whole hospital; (ii) A medium workload capacity able to support six alerts per 100 patient days, such as a senior ward doctor responsible for 20–30 patients; and (iii) A high workload capacity able to support 12 alerts per 100 patient days, such as ward nursing and doctor team. These workload capacity settings were based on real-world implementation settings, as previously discussed.

Results

Study cohort

The original dataset comprised 1,537,270 admissions of which 853,653 were excluded because they occurred in the initial EMR roll-out period (n = 11,745), involved non-adult patients (n = 141,216), had administrative inter-hospital transfer errors (n = 66), or had no complete vital sign sets (n = 700,626). This left 683,617 admissions for analysis, comprising 750,381 care episodes involving 316,667 patients and 7,432,121 complete vital sign sets, or one vital set on average every 5.3 h. Patient and dataset characteristics are summarised in Table 2 (and by site in Appendix E in the supplementary materials).

Table 2.

Patient and dataset characteristics.

	Patients (N = 316,667)	Admissions (N = 683,617)	Episodes (N = 750,381)
Median age at admission (IQR)		60 (43,73)
Male (%)	155,082 (49.0)
Female (%)	161,585 (51.0)
Median length of stay in hrs (IQR)		31.8 (6,96)	17.6 (2,66)
Outcomes: n (%)
In-hospital death		4954 (0.72)	4954 (0.66)
Transfer to ICU		3400 (0.50)	3717 (0.50)
ICU transfers per admission (% of transfers)
1 Transfer		3167 (93)
2 TO 4 Transfers		227 (6.8)
> 4 transfers		6 (0.2)
Number of vital sign sets			10,753,736

Early warning tool scores: median (inter-quartile range, maximum)	Episodes with ICU transfer outcome	Episodes with death outcome	All episodes
NEWS2	3 (4,18)	3 (5,18)	1 (2,19)
MEWS	1 (3, 34)	2 (4,36)	0 (1,40)
Q-ADDS	1 (3, 32)	2 (3, 38)	0 (1,40)
BTF – median only			0

ICU = intensive care unit, IQR = interquartile range.

Traditional comparison of EWTs

The EWT discrimination accuracy, as measured by AUROC, for the composite outcome and each individual outcome of death and UICU occurring within 24 h, is provided in Table 3; as are the differences between the tools for the composite outcome using DeLong's test. For the composite outcome, AUROC was highest for NEWS2 (0.8275) compared with Q-ADDS (0.8144, p = .001), MEWS (0.8027, p < .001) and BTF red category (0.6348, p = .001). Of the similar hybrid EWTs (single trigger + aggregated score), Q-ADDS performed significantly better than MEWS (1.5% difference, p < .001). BTF red was the lowest performing EWT by at least 20.9% (p < .001). The order of EWT performance was the similar for the individual outcomes of death and UICU with all EWTs more able to identify patients with UICU than those who died.

Table 3.

Evaluation results for early warning tools.

AUROC	MEWS	BTF-red	NEWS2	Q-ADDS
Death	0.7716	0.6161	0.7928	0.7861
Transfer to ICU	0.8476	0.6605	0.8779	0.8555
Composite^a	0.8027	0.6348	0.8275	0.8144
Delta (composite AUROC outcome)
From MEWS to		−20.9%	3.1%	1.5%
From BTF-red to			30.3%	28.3%
From NEWS2 to				−1.6%
AUPRC
Death	0.0246	0.0085	0.0315	0.0292
Transfer to ICU	0.0288	0.0097	0.0404	0.0369
Composite	0.0465	0.0168	0.0602	0.0559
F1-score
Death	0.066	0.044	0.071	0.076
Transfer to ICU	0.057	0.044	0.072	0.076
Composite	0.100	0.077	0.119	0.120

Note. The highest metric value for each outcome category are in bold.

All AUROC differences are statistically significant with p < .001 using DeLong test.

The performance of each EWT for the composite outcome at each alert threshold is shown in Table 4. At all thresholds, the hybrid EWTs, Q-ADDS and MEWS, operated at similar levels of specificity (within 0.3%), although the difference in PPV between the two progressively increased from 1.0% at the low risk level to 20% at the MET level. Q-ADDS had the highest PPV amongst all EWTs at all alert thresholds. On a by-alert-threshold basis (see Table 1): (1) At the low risk threshold , tool sensitivity ranged from 0.456 (MEWS) to 0.696 (BTF yellow), specificity from 0.896 (BTF yellow) to 0.954 (MEWS) and PPV from 0.014 (BTF yellow) to 0.047 (MEWS, Q-ADDS). Adding individual vital sign alerts (NEWS2-plus) improved sensitivity over NEWS2 by 9.4% (0.563 to 0.616), however the PPV dropped by 22% (0.037 to 0.029); (2) At the moderate risk threshold, Q-ADDS compared to MEWS had superior sensitivity (+9.6%) and PPV (+11%) for the same specificity; and (3) At the MET level threshold, NEWS2 and BTF red had markedly higher sensitivity (0.377, 0.302 respectively) when compared to Q-ADDS (0.221) and MEWS (0.205).

Table 4.

Traditional comparison table of tools at each threshold point for the composite outcome. It shows the number and percentage of admissions and individual alerts meeting each threshold, as well as the tool sensitivity, specificity and positive predictive value.

Tools	Score	Admissions with alerts above threshold (%)N = 683,617	Vital sign sets above threshold (%)N = 10,753,736	Sensitivity	Specificity	Positive predictive value (PPV)
Low risk threshold
BTF	Yellow	324,896 (48%)	2,718,484 (25.3%)	0 . 696	0.749	0.014
MEWS	≥ 4	83,700 (12%)	521,411 (4.8%)	0.456	0.954	0.047
Q-ADDS	≥ 4	91,276 (13%)	549,929 (5.1%)	0.485	0.951	0.047
NEWS2	≥ 5	94,773 (14%)	806,041 (7.5%)	0.563	0.927	0.037
NEWS2-plus	≥ 5	143,710 (21%)	1,149,253 (10.7%)	0.616	0.896	0.029
Moderate risk threshold
MEWS	≥ 6	44,679 (6,5%)	221,394 (2.1%)	0.288	0.981	0.069
Q-ADDS	≥ 6	39,974 (5.8%)	218,435 (2.0%)	0.319	0.981	0.078
MET level threshold
BTF	Red	70,281 (10%)	364,323 (3.4%)	0.302	0.967	0.044
MEWS	≥ 8	37,662 (5.5%)	165,932 (1.5%)	0.205	0.986	0.067
Q-ADDS	≥ 8	29,334 (4.3%)	143,588 (1.3%)	0.221	0.988	0.082
NEWS2	≥ 7	37,867 (5.5%)	285,287 (2.7%)	0.377	0.975	0.070

Note. The highest value of sensitivity, specificity and positive predictive values at each alert threshold are in bold font.

Comparison of EWTs using precision-recall workload capacity/alert rate graphs

Precision-recall (PR) curves plot precision (PPV) against sensitivity (recall), to provide a pictorial impression of EWT accuracy (see Figure 1) with a curve moving into the top right corner capturing maximum outcomes for minimum false alerts. Plot [a] reveals the PR curve for NEWS2 is always to the right and above all other EWT curves, indicating superior sensitivity and PPV at all possible threshold levels. Subplots [b], [c] and [d] focus in on each alert threshold: MET level, moderate risk (Q-ADDS and MEWS only) and low risk respectively.

Figure 1.

Precision-recall plots showing the precision recall curves (in blue shades) for MEWS (square marker), NEWS2 (star marker) and Q-ADDS (circle marker). Also shown are the two PR points for BTF red and yellow tools (inverted triangle markers) and single PR point for NEWS2-plus (cross marker), as these are binary rather than score-based. Please note that plots B, C and D are sub-plots of A and therefore scaled differently to focus in on the MET, moderate and low risk points on the EWT series. Graph [a] shows the overall PR chart depicting the PR curves for each EWT and the three workload capacity lines (black dashed) for low (short dash), medium (medium dash) and high (long dash) workload capacity, representing 2, 6 and 12 alerts per 100 patient days respectively. [VC1] [VC2] Subplots [b], [c] and [d] focus on each alert threshold: MET level, moderate risk (Q-ADDS and MEWS only) and low risk respectively. The markers on the EWT curves represent the tool's range of possible threshold scores, e.g., for Q-ADDS, the score range is from 0 to 48, whereas for NEWS2 it is from 0 to 20.

In Subplot [b], the MET level score thresholds are highlighted in red for each EWT. All thresholds, except BTF red, generate between 1.0 (Q-ADDS) and 2.2 (NEWS2) alerts per 100 patient days, i.e., close to the low workload capacity line. The deviation from the curve seen for the hybrid tools (Q-ADDS and MEWS) represents the single (vital sign) trigger components of those tools. While this reduces PPV, the single trigger components confer safety netting within the clinical environment by escalating to a critical care review, even when a complete vital sign set required to generate a score is not entered. For hospitals considering a rule-based EWT, Q-ADDS and NEWS2 have superior accuracy (PPV and sensitivity) to MEWS. However, each utilises different operating characteristics (i.e., PPV, sensitivity and alert frequency) for MET level escalation: Q-ADDS has nearly half the alert rate of NEWS2 (0.98 v 1.9 alerts per 100 patient days) but with 41% less sensitivity. The plot also suggests MEWS and Q-ADDS could have very similar MET alert operating characteristics to NEWS2, if the MET alert thresholds were shifted to a score of 5, although incurring a higher alert rate.

Subplot [c] identifies the moderate risk alert thresholds (Q-ADDS and MEWS only) with Q-ADDS having greater sensitivity to MEWS and conferring higher PPV, but at a slightly higher alert frequency (1.5 vs. 1.2 alerts per 100 patient days).

Subplot [d] identifies low-risk alert thresholds for all EWTs, with MEWS and Q-ADDS generating 3.6 and 3.8 alerts per 100 patient days respectively. NEWS2-plus (denoted by a green X) alerts at a score threshold of five as well as for any single vital sign reaching a score of 3, manifesting as higher sensitivity of NEWS2-plus compared to NEWS2 (0.616 vs. 0.523), but at a 40% increase in alert rate (7.7 vs. 5.5 alerts per 100 patient days). Similarly, BTF-yellow sensitivity was highest at 0.696, but generated 18 alerts per 100 patient days, or 5 times more alerts than MEWS. Any of the remaining three EWTs could be reconfigured to different operating thresholds to yield better accuracy than BTF-yellow at a similar alert rate.

Discussion

This evaluation of current digitally implemented rule-based EWTs using a large Australian EMR dataset has yielded similar predictive accuracy results to prior studies.^3,5,15 NEWS2 performed better than the other tools in terms of AUROC and AUPRC, but Q-ADDS had the highest PPV across all alert trigger thresholds. Unsurprisingly, the two hybrid (single trigger + aggregated scoring) EWTs (Q-ADDS, MEWS), tracked together, both with the lowest alert rates, with Q-ADDS being slightly superior. All three aggregated scoring tools performed better than the single trigger tool, at each point of clinical escalation. These results, however, must be interpreted in the context of the variance in their digital application within real clinical environments, such as alert suppression methods to reduce alert fatigue and unnecessary escalations (Appendix A in the supplementary materials).

The precision-recall workload capacity graphs provide a visual means of displaying the operational characteristics of EWTs in terms of alert thresholds, resultant alert rates and traditional accuracy measures (PPV and sensitivity), with two key findings. Firstly, ranking of EWTs in terms of alert rate was very similar across every level of alert threshold (MET, moderate, low): Q-ADDS and MEWS always had the lowest alert rate, then NEWS2 and finally BTF. For all these tools the lower alert rates are a trade-off for sensitivity. Some hospitals may allow alert mitigation methods (e.g., patient-specific threshold adjustments, or system alert suppressions) to reduce alert fatigue and unnecessary escalations, rendering a higher overall alert level system more manageable. Other hospitals may have generous staffing ratios and/or ample dedicated MET responders (high workforce capacity), while some have the opposite. Some hospitals may have multiple tiers of escalation, sharing the alert workload burden and improving overall EWS efficiency. What constitutes the ‘best efferent limb’ approach remains debatable and subject to ongoing research.

Secondly, although the PR curves (and AUROCs) were very similar between NEWS2 and Q-ADDS, their applied operating characteristics (PPV, sensitivity and alert burden) at each alert threshold (MET level and low risk) were quite different. This likely reflects evolution over time in clinical need, evidence base of EWS effectiveness, international expert recommendations and governmental mandates. Q-ADDS and NEWS thresholds have been revised over time after demonstrating that lower alert rates could be achieved without any increased patient risk. BTF, Q-ADDS and MEWS operate primarily in Australian public hospitals, whereas NEWS2 operates across the NHS in the UK, with objectives and resource considerations differing across each country.

The novel precision-recall workload capacity graphs provide a needed extension to the standard EWT metrics of AUROC, AUPRC, F1 and threshold levels of sensitivity and PPV. For example, if a hospital was seeking to upgrade their rule-based EWT to an AI-based tool with a better AUROC, the workload capacity graphs would provide a means of understanding whether the clinician workload would change from their existing EWT and what threshold they would need to set the new EWT in order to maintain the same or lower workload levels to meet their current resourcing situation.

This study has several strengths. Firstly, it is the first analysis of a large Australian public hospital network EMR dataset, where clinical workflows align with the EWTs being compared. Secondly, this is the first study to focus specifically on digitally implemented rule-based EWTs where comparisons of performance reflect their current digital functionality, and to describe key sensing and escalation differences. Thirdly, this novel PR workload capacity method provides a means for comparing EWTs, including alert rates and potential workload, at different escalation thresholds. For example, NEWS2 is used across Wales, but is configured to trigger low-risk alerts at a score of 6 and MET-level alerts at a score of 9. Figure 2 shows how, in both cases, this change decreases the sensitivity of NEWS2 to similar levels to Q-ADDS and MEWS whilst maintaining a similar or even lower alert burden. Finally, the proposed methodology is generalisable to any hospital with EMR-captured vital signs and would be similarly useful for any hospital, irrespective of whether they employ a lean or rich staffing model, as it provides a means for understanding the impact of changing the EWT threshold or model to a sensitivity, PPV and alert burden to suit any hospital's local requirements.

Figure 2.

Provides the same two precision-recall charts as that shown in Figure 1b and d, except now showing the impact of Wales shifting their NEWS alert thresholds. Subplot [a] demonstrates the shift from the regular NEWS2 MET alert threshold of 7–9. This reduces the sensitivity in line with Q-ADDDS and MEWS, but at a reduced alert workload capacity. Subplot[b] demonstrates the shift from the regular NEWS2 low risk alert threshold of 5–6, which again reduces the sensitivity in line with Q-ADDS and MEWS at similar workload capacity.

Limitations

Our study has some limitations. Analysing the predictive accuracy of each EWT should not be conflated with assessment of overall EWS performance, as the effectiveness of efferent arm design and implementation may dominate over the afferent capabilities of EWT alerts.²¹ For example, not every escalation alert predicted in this analysis will or needs to be acted upon or translated into an escalation call, and not every need for escalation is preceded by an alert (e.g., sudden, unexpected cardiac arrest or massive blood loss). Moreover, escalation for clinician or carer concern alone makes up 5–15% of MET calls in the live environment.²¹ Finally, the variability between hospitals in escalation culture, clinician intervention, and patient response,²² and between health services right down to the level of each individual in the system, limit our ability to directly link EWT accuracy with EWS success in improving patient outcomes.

A high number of patients were excluded due to incomplete vital sign sets, with half having no vital signs collected throughout their stay. These patients had a median stay of 5.3 h (3.1–7.0 IQR) and comprised ambulatory (‘walk-in’) patients who were not overly sick (‘worried well’, minor trauma), had no evidence of haemodynamic or cardiorespiratory compromise, were admitted to ambulatory or short stay wards and then discharged and did not have an oxygen pulsimeter attached to give a Sp02. Our study focuses on EWTs as applied to sick admitted patients and therefore we do not feel excluding patients lacking a single complete set of vital signs invalidates our results. Death or unplanned ICU admission are commonly used but known flawed outcome measures, death reflecting a very late and irreversible degree of deterioration, and ICU admission not always reflecting critical illness. Both outcomes also exclude patients who deteriorated on the ward and were subsequently treated and recovered without having to be moved to the ICU or dying. These outcome shortcomings are common across our EWT evaluations. MET calls as outcomes are sometimes used, but they are skewed in the opposite direction, with many called for patients who have not significantly deteriorated. Unfortunately, reliable and nuanced MET call and cardiac arrest data that reflect illness severity were not available. Having accurate ward-based severity of illness indicators recorded in EMR could allow a useful ‘critically-deteriorated patient’ metric to be developed as an outcome.

Implications for practice

This PR workload capacity method of EWT comparison can enable hospitals to identify EWTs with the highest accuracy at a given level of alert frequency, and apply them according to their needs for patient safety and the capacity of staff and the organisation. This method also helps to better understand and plan for expected workloads when implementing a particular EWT and its associated alert thresholds, and supporting real-time hospital resource management by making visible the alert burden for a given ward or clinical unit which can inform staffing ratios across all tiers of escalation. In relation to the current rules-based EWTs, this can guide either selection of a different EWT, or adjustments to alert thresholds of an already implemented EWT.

Changing an EWT within a widely implemented, well-embedded EWS, will be a non-trivial challenge for countries like Australia and the UK if considering a transition to AI-based EWT solutions. The first step is to employ the methodology set out in this study and evaluate the as-is and to-be EWTs using a recent retrospective dataset. This information can inform the impact and suitability in terms of sensitivity, PPV and alert burden of any proposed change. While clearly understanding the comparative performance of each tool can inform threshold settings, in the absence of a comparative clinical trial the decisions around the risk/benefit for any threshold adjustment remains a clinical one. Any threshold change to an EWT should be carefully piloted and assessed with clinical outcome metrics such as cardiac arrest and UICU rates.

Conclusion

The accuracy of these four rules-based digital EWTs in predicting death or UICU using an Australian dataset is consistent with prior comparisons, with NEWS2 slightly superior to Q-ADDS and MEWS, and the aggregated scoring tools significantly superior to the single trigger tool.

The performance of EWTs in relation to their sensitivity and PPV at each alerting threshold is only valid for a hospital if it has the capacity to clinically support the alert workload associated with those thresholds. Conventional measures of AUROC, AUPRC and F1 score provide no indication of this workload, yet we found that the alert workloads for commonly employed EWTs varied significantly. We demonstrated a novel evaluation methodology that allows hospitals to review both conventional EWT performance (i.e., sensitivity and PPV) in conjunction with their associated alert workloads, thereby informing EWS implementation decision makers about the right EWT, alerting thresholds, and implementation strategy for their patients, staff and overall care delivery. Future research should evaluate AI-based EWT performance using this method with the goal of further optimising EWSs.

Supplemental Material

sj-docx-1-dhj-10.1177_20552076251404509 - Supplemental material for Using workload capacity indicators to evaluate rule-based early warning tools and their relationship to escalation events

Supplemental material, sj-docx-1-dhj-10.1177_20552076251404509 for Using workload capacity indicators to evaluate rule-based early warning tools and their relationship to escalation events by Anton H van der Vegt, Victoria Campbell, Imogen Mitchell, Oliver C Redfern, Christian Subbe, Roger Conway, Arthas Flabouris, Robin Blythe, Rudolf Schnetler, Christopher Perkins, Naitik Mehta MHServMgt and Ian A Scott in DIGITAL HEALTH

Footnotes

Acknowledgements

We acknowledge the ongoing data governance and ethics application support of Vikrant Kalke and commitment to this project by Clinical Excellence Queensland Patient Safety Unit. We also acknowledge Andreas Bollinger for supporting the identification and acquisition of patient data. Finally, we thank the Australia & New Zealand Artificial Intelligence Keeping You Safer (AI-KEYS) working group for their guidance and expertise.

ORCID iDs

Anton H van der Vegt

Oliver C Redfern

Robin Blythe

Rudolf Schnetler

Ian A Scott

Ethical considerations

Townsville Hospital and Health Service Human Research Ethics Committee (HREC/QTHS/67897) granted a waiver of patient consent with Public Health Act approval.

Author contributions

A.H.V. and V.C conceptualized the study. A.H.V. conducted the study with statistical support from R.S. and drafted the manuscript. V.C managed clinical input from clinical authors and drafted clinical aspects of the manuscript. All other authors provided clinical feedback throughout the study and reviewed and updated the manuscript. A.H.V is guarantor for the study.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Department of Science, Information Technology and Innovation, Queensland Government (grant number Advanced Queensland Industry Research Fellowship g).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability

The data that support the findings of this study are available from Queensland Health but restrictions apply to the availability of these data, which were obtained under a Public Health Act approval for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Queensland Health.

Supplemental material

Supplemental material for this article is available online.

References

Van Der Vegt

Campbell

Mitchell

, et al. Systematic review and longitudinal analysis of implementing artificial intelligence to predict clinical deterioration in adult hospitals: what is known and what remains uncertain. J Am Med Inform Assoc 2024; 31: 509–524.

Subbe

. Validation of a modified Early Warning Score in medical admissions. QJM 2001; 94: 521–526.

Hughes

Pain

Braithwaite

, et al. Between the flags’: implementing a rapid response system at scale. BMJ Qual Saf 2014; 23: 714–717.

Canberra Hospital and Health Services. Canberra Hospital and Health Services Clinical Procedure: Vital signs and early warning scores. Published online March 2018. Canberra Hospital and Health Services Clinical Procedure.

Campbell

Conway

Carey

, et al. Predicting clinical deterioration with Q-ADDS compared to NEWS, between the flags, and eCART track and trigger tools. Resuscitation 2020; 153: 28–34.

Gerry

Bonnici

Birks

, et al. Early warning scores for detecting deterioration in adult hospital patients: systematic review and critical appraisal of methodology. BMJ 2020; 369: m1501.

Blythe

Parsons

White

, et al. A scoping review of real-time automated clinical deterioration alerts and evidence of impacts on hospitalised patient outcomes. BMJ Qual Saf 2022; 31: 725–734.

Credland

Dyson

Johnson

. Do early warning track and trigger tools improve patient outcomes? A systematic synthesis without meta-analysis. J Adv Nurs 2021; 77: 622–634.

Van Der Vegt

Campbell

Zuccon

. Why clinical artificial intelligence is (almost) non-existent in Australian hospitals and how to fix it. Med J Aust 2024; 220: 172–175.

10.

Rudin

. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat Mach Intell 2019; 1: 206–215.

11.

Harrison

Despotou

Arvanitis

. Hazards for the implementation and use of artificial intelligence enabled digital health interventions, a UK perspective. In: Mantas

Hasman

Househ

, et al. (eds) Studies in health technology and informatics. Amsterdam, Netherlands: IOS Press, 2022, pp.14–17.

12.

Rajkomar

Hardt

Howell

, et al. Ensuring fairness in machine learning to advance health equity. Ann Intern Med 2018; 169: 866–872.

13.

Berrar

Flach

. Caveats and pitfalls of ROC analysis in clinical microarray research (and how to avoid them). Brief Bioinform 2012; 13: 83–97.

14.

Green

Lander

Snyder

, et al. Comparison of the between the flags calling criteria to the MEWS, NEWS and the electronic Cardiac Arrest Risk Triage (eCART) score for the identification of deteriorating ward patients. Resuscitation 2018; 123: 86–91.

15.

Smith

Prytherch

Meredith

, et al. The ability of the National Early Warning Score (NEWS) to discriminate patients at risk of early cardiac arrest, unanticipated intensive care unit admission, and death. Resuscitation 2013; 84: 465–470.

16.

Churpek

Yuen

Winslow

, et al. Multicenter development and validation of a risk stratification tool for ward patients. Am J Respir Crit Care Med 2014; 190: 649–655.

17.

Kipnis

Turk

Wulf

, et al. Development and validation of an electronic medical record-based alert score for detection of inpatient deterioration outside the ICU. J Biomed Inform 2016; 64: 10–19.

18.

Romero-Brufau

Whitford

Johnson

, et al. Using machine learning to improve the accuracy of patient deterioration predictions: mayo Clinic Early Warning Score (MC-EWS). J Am Med Inform Assoc 2021; 28: 1207–1215.

19.

Robin

Turck

Hainard

, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 2011; 12: 77.

20.

DeLong

Clarke-Pearson

. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 1988; 44: 37.

21.

Mullins

Psirides

,., et al. Activities of a medical emergency team: a prospective observational study of 795 calls. Anaesth Intensive Care 2016; 44: 1–1.

22.

Flabouris

Nandal

Vater

, et al. Multi-Tiered observation and response charts: prevalence and incidence of triggers, modifications and calls, to acutely deteriorating adult patients. Bouchama A, ed. PLoS One 2015; 10: e0145339.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.33 MB