Abstract
Background:
The Acute Physiology and Chronic Health Evaluation II (APACHE II) is used to quantify disease severity and hospital mortality risk in critically ill patients. It is widely used in intensive care units (ICUs) in Singapore, but its prognostic validity remains questionable as it has not been thoroughly assessed by established statistical methods.
Objectives:
This study aimed to: (a) evaluate the discrimination and calibration accuracy of the APACHE II in the prediction of hospital mortality in a mixed ICU, and (b) customise the APACHE II in an effort to maximise its prognostic performance.
Methods:
A prospective cohort study was conducted and all adult patients with >24 h of ICU admission in a tertiary care institution in Singapore were included. The outcome measure was hospital mortality, and all patients were followed-up until hospital discharge or death for up to one year after ICU admission.
Results:
There were 503 patients, and their mean (SD) age and APACHE II score were 61.2 (15.8) years and 24.5 (8.2), respectively. Hospital mortality was 31%, and no patients were lost to follow-up. The APACHE II has good discrimination (receiver operating characteristic: 0.76) but poor calibration (Hosmer–Lemeshow C test: <0.001). Customisation did not significantly improve calibration accuracy.
Conclusions:
The APACHE II and its customised version should not be used in the local setting as they both have poor calibration. There is an urgent need for larger studies to perform second-level customisation or to develop a new prognostic model tailored to the Singapore critical care setting.
Introduction
The Acute Physiology and Chronic Health Evaluation II (APACHE II) was developed in 1985 to objectively quantify disease severity and predict hospital mortality risk. 1 Despite newer versions such as the APACHE III and IV,2,3 the APACHE II continues to be widely used in research and clinical practice. This is in part due to the ease of calculation, and the possibility of comparative consistency by reason of its long history of usage.
The hospital mortality rates of critically ill patients are used to assess the performance of intensive care units (ICUs) because it reflects important characteristics that are associated with good clinical practices (e.g. accurate diagnosis and timely therapies). 4 Since some hospitals will inherently admit patients with higher disease severity and thus have higher mortality rates than others, the APACHE II plays an essential role in the adjustment of mortality risk. That is, the predicted mortality rate derived from the APACHE II can be compared with the observed mortality rate; this is termed as the standardised mortality ratio (SMR). 4 Of note, the accuracy of the SMR in assessing ICU performance is underpinned by the accuracy of the predicted hospital mortality risk since under- or overestimation of such risk will, respectively, inflate or understate the actual performance of the ICU.
All the ICUs of the public hospitals in Singapore use the APACHE II for quality audit. Despite the pervasive use of the APACHE II, its prognostic accuracy in Singapore remains questionable since it has never been validated locally with established statistical methods. Therefore, this study primarily aims to determine the performance of the APACHE II in the prediction of hospital mortality in a mixed ICU. The secondary aim is to customise the APACHE II and evaluate the performance of this new model.
Methods
Patients and setting
This was a prospective cohort study conducted in a 35-bed mixed ICU at Ng Teng Fong General Hospital. The ICU functions as a closed unit in which board-certified intensivists and residents provide care for both medical and surgical patients. Between August 2015 and October 2016, all adult ICU patients ⩾21 years old who had ⩾24 h length-of-stay were enrolled. For patients who were readmitted to the ICU during the same hospitalisation, only the data on the first admission was included. The Domain Specific Review Board approved this study (NHG DSRB Ref: 2014/00878), and informed consent was not required as this study was deemed as a clinical audit.
Data collection
All data required to calculate the APACHE II score and predicted mortality risk (i.e. demography, physiological parameters, admission diagnoses and comorbidities) were prospectively recorded in the electronic medical records. Calculation of the APACHE II was carried out by methods described by Knaus et al. 1 However, several established modifications were also carried out. In most cases, the lowest Glasgow Coma Score (GCS) during the first 24 h of ICU admission was used to calculate the APACHE II. However, in patients who were anaesthetised before ICU admission, the GCS recorded before anaesthesia was used. 5 The diagnosis of acute kidney injury (AKI) was in accordance with the latest definition, that is, increase in serum creatinine by ⩾26.5 μmol/L within 48 h or by ⩾1.5 times baseline, or urine volume <0.5 mL/kg/h for 6 h. 6 For missing data, parameters not measured in the first 24 h of ICU admission were considered normal. 2
The predicted hospital mortality was calculated using a formula that comprised of a constant, the APACHE II score multiplied by a coefficient, exposure status for emergency surgery multiplied by a coefficient as well as the admission diagnostic coefficient outlined in Knaus et al., 1 for example, ln(R/I − R) = −3.517 + (APACHE II score × 0.146) + admission diagnostic coefficient + 0.603 if exposed to emergency surgery, where ln = natural logarithm and R = risk of hospital mortality. 1 In the circumstance of multiple admission diagnoses, the condition with the worst prognosis (e.g. haemorrhagic shock rather than sepsis) would be taken. 4 For observed hospital mortality, patients were followed until hospital discharge or death for up to one year after ICU admission.
Statistical analysis
Performance of the APACHE II
Performance was assessed by its discriminative ability and calibration accuracy. Discrimination refers to the ability of the APACHE II in distinguishing discrete outcomes (e.g. died/survived). This was measured by the area under the receiver operating characteristic (ROC) curve, in which perfect, excellent, very good, good, moderate and poor discrimination are defined as ROC of 1.00, 0.90–0.99, 0.80–0.89, 0.70–0.79, 0.60–0.69 and <0.60, respectively. 7 In contrast to discrimination, calibration accuracy refers to the ability of the APACHE II in quantifying risk across the continuum of mortality risk. Calibration was measured using two methods. First, by the Hosmer–Lemeshow C test, in which accurate calibration is defined as p-value > 0.05, indicating no significant difference between the observed and predicted mortality. 8 Second, by plotting a calibration curve, the observed and predicted mortality across all risk ranges was presented in a graphical plot.
The SMR (a ratio of the observed versus predicted hospital mortality (estimated by APACHE II)) and its 95% confidence interval (CI) was also calculated for the purpose of future comparisons. The 95% CI was derived by dividing the 95% CI of the observed mortality by the predicted mortality. 9 An SMR with 1.0 within the 95% CI indicates an overall good ICU performance.
Customisation and validation of the customised APACHE II
The study population was randomly split into equal training and validation groups. The training group was used to customise the APACHE II in which new coefficients for the APACHE II score and exposure to emergency surgery as well as a new constant were computed from logistic regression with hospital mortality as the dependent variable. Thereafter, in the validation group, the discriminative ability and calibration accuracy of the customised model were determined by methods described above.
Patient characteristics of the training and validation groups were reported as mean and standard deviation, medians and inter-quartile range or counts and percentages; and the Student’s t-test, Mann–Whitney U-test or Chi-square test were used appropriately to compare patient characteristics. All statistical analyses were performed using STATA 14.2 (Stata Corp, College Station, TX, USA) and significance assumed at p < 0.05.
Results
There were 844 admissions, of which 503 patients were enrolled (Figure 1). A majority of them were from the emergency department and admitted to the ICU for medical reasons. Other characteristics of the enrolled patients in the overall, training and validation groups are summarised in Table 1. Hospital mortality was 31% in the overall group, and no patients were lost to follow-up since the longest hospital length of stay was 255 days. A small number of patients had missing data. That is, 6.6% had missing haematocrit and white blood count, and 6.0% had missing serum sodium and potassium.

Patient enrolment.
Characteristics of the enrolled patients in the overall, training and validation groups.
Values are mean (SD), median (q1, q3) or count [percentage]. APACHE II: Acute Physiology and Chronic Health Evaluation II, LOS: length of stay, ICU: intensive care unit.
Overall sample
Discrimination was good as evidenced by the ROC, but calibration accuracy measured by the Hosmer–Lemeshow C test was poor (Table 2). This was consistent with the calibration curve which showed an overestimation of predicted hospital mortality risk in nearly all deciles (Figure 2).
Discriminative ability and calibration accuracy of the APACHE II in all patients, training and validation groups.
CI: confidence interval, HL-C: Hosmer–Lemeshow C test, ROC: receiver operating characteristic, SMR: standardised mortality ratio.

Calibration curves for all patients, the training and the validation group.
Customisation
The new customised equation to quantify predicted hospital mortality risk was as follows: logit = −4.587 + (APACHE II score × 0.143) + existing diagnostic weight outlined in Knaus et al. 1
Exposure to emergency surgery was not significantly associated with hospital mortality (p-value: 0.324) and hence was omitted in this new model.
Discrimination was good in the validation group and very good in the training group (Table 2 and Figure 3). Although customisation of the APACHE II considerably improved the accuracy of the predicted hospital mortality risk in all deciles (Figure 2: overall versus validation group), there was still significant inaccuracies in which predicted hospital mortality risks were under-estimated in patients with ⩽40%, and overestimated in patients with >40% observed hospital mortality risk. Of note, calibration accuracy was good for medical patients in both the training and validation group whereas it was poor in surgical patients in the validation group. Similarly, the SMR in medical patients, as opposed to surgical patients, appears to be more reliable as evidenced by the tighter CI.

Receiver operating characteristic curves for all patients, the training and the validation group.
Discussion
To our knowledge, this is the largest study conducted in Singapore to evaluate the validity of the APACHE II in predicting hospital mortality. The APACHE II has demonstrated good discrimination but poor calibration accuracy for hospital mortality, and customisation of the APACHE II did not significantly improve its calibration accuracy in the local setting.
In 1985, Knaus et al. used data (i.e. 12 physiological parameters, comorbidities and emergency surgery, age and admission diagnosis) from a reference population of 5815 patients from 13 hospitals in the USA to develop APACHE II, 1 whereby it quantifies the predicted hospital mortality risk of critically ill patients via an equation. Therefore, all subsequent evaluations of ICU performance using the APACHE II are in effect weighing against the reference population. Given the advances in ICU treatment modalities since 1985, it is crucial to validate the APACHE II before using it in local settings.
This is the fifth validation study performed in Singapore, but results of our study cannot be compared with three previous studies as they did not report the discrimination and calibration accuracy of APACHE II.10–12 Nevertheless, we were able to estimate the SMR and calibration accuracy from the crude results reported by Lee et al. 10 These authors prospectively calculated the APACHE II scores of 131 patients in the medical ICU, and the SMR was estimated to be 0.89 and there was good correlation (r = 0.95, p-value: 0.001) between observed and predicted mortality, suggestive of good calibration. Similar results were demonstrated in the surgical ICU in which there were very good discrimination and likely good calibration (correlation between observed and predicted mortality was 0.97 (p-value unreported)). 13 The good prognostic performance of the APACHE II in these studies was likely due to the close proximity of APACHE II development (i.e. 1985) and the validation period (1991) in which treatment modalities were likely similar.10,13 Evidently, reduction in observed hospital mortality with time due to advances in treatment modalities gradually reduces the discrimination and calibration accuracy of the APACHE II.14,15 This may account for the discordance of results between our study and those of Lee et al. and Chen et al.10,13
Compared to recent studies conducted in other countries, patients in our study had higher disease severity, as evidenced by the higher mean APACHE II score (24.5 versus 17–21).14,16–21 Similar to recent studies, the APACHE II in our study also had good discriminative validity, that is, 0.756 versus 0.729 to 0.805,14,16–21 and poor calibration accuracy.14,16–19 It is established that the latter will have a negative impact on the statistical risk adjustment in research studies and the SMR used in clinical audits. 4 Therefore, customisation of the APACHE II is often carried out in the literature in an effort to improve calibration accuracy.
There are two levels of customisation. First-level customisation refers to computing a new constant and new coefficients for the APACHE II score and exposure to emergency surgery. Second-level customisation involves all steps described above and additionally computes new coefficients for the admission diagnoses. 22 In our study, second-level customisation was not performed because it should only be performed in studies with a large sample size as this will reduce the risk of overfitting. 23 Although first-level customisation considerably improved calibration, it remained insufficient to significantly improve calibration accuracy. This is similar to the results of Brinkman et al. and Mann et al.,15,24 in which customisation also did not improve calibration accuracy.
This study has some strengths. Selection, attrition and treatment biases were respectively minimised by the use of consecutive recruitment, complete follow-up and blinding of the treatment team to the objectives of the study. However, there are some limitations. This was a single centre study and hence lacks generalisability. Although our study was the largest in the local setting, the sample size did not allow robust subgroup analyses.
Future direction
Poor calibration after customisation is indicative of the dire need to either conduct a local multicentre study to perform robust second-level customisation or develop a new prognostic model with the addition of strong prognostic variables such as exposure to cardiopulmonary resuscitation before ICU admission and baseline nutritional status.25,26 This will allow comparison of ICU performance among the local hospitals as well as internal quality assessment and benchmarking based on historical baseline performance data (e.g. baseline SMR).
The performance of an ICU is commonly measured by the SMR. For future studies, it is best practice to stratify the SMR by low-, medium- and high-risk patients to better understand the performance of the APACHE II in different risk groups. This is because the proportion of high-risk patients within a cohort will disproportionately affect the aggregated SMR since most high-risk patient will die. 4 In our study, we did not stratify the SMR by risk groups because the customised APACHE II did not demonstrate good calibration.
Conclusion
The APACHE II demonstrated good discrimination but poor calibration accuracy in the prediction of hospital mortality in a mixed ICU in Singapore. Customisation was attempted to improve calibration accuracy, but such effort proved to be futile. Therefore, there is an urgent need for future studies to recruit a larger sample of patients from multiple hospitals to perform second-level customisation or develop a new prognostic model that will better predict hospital mortality.
Footnotes
Acknowledgements
We are grateful for the statistical support provided by Wong Chiew Meng Johnny, Biostatistician, Clinical Research Unit, Ng Teng Fong General Hospital.
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
Availability of data
The datasets generated and/or analysed during the current study are not publicly available due to the data confidentiality requirements of the ethics committee, but are available from the corresponding author on reasonable request and approval from the ethics committee.
Authors’ contributions
CCH Lew, GJY Wong, CK Tan and M Miller equally contributed to the conception and design of the research; CCH Lew, and GJY Wong contributed to the acquisition of the data; and CCH Lew contributed to the analysis and interpretation of the data, CCH Lew drafted the manuscript. All authors critically revised the manuscript, agree to be fully accountable for ensuring the integrity and accuracy of the work, and read and approved the final manuscript.
Conflict of interest
The authors declare that there is no conflict of interest.
Ethical approval
Ethical approval was obtained from the Domain Specific Review Board (NHG DSRB Ref: 2014/00878).
Informed consent
Informed consent was not sought for the present study because this was an observational study where no attempt was made to change the standard of care.
