Sage Journals: Discover world-class research

Abstract

Current mortality prediction models and scoring systems for intensive care unit patients are generally usable only after at least 24 or 48 h of admission, as some parameters are unclear at admission. However, some of the most relevant measurements are available shortly following admission. It is hypothesized that outcome prediction may be made using information available in the earliest phase of intensive care unit admission. This study aims to investigate how early hospital mortality can be predicted for intensive care unit patients. We conducted a thorough time-series analysis on the performance of different data mining methods during the first 48 h of intensive care unit admission. The results showed that the discrimination power of the machine-learning classification methods after 6 h of admission outperformed the main scoring systems used in intensive care medicine (Acute Physiology and Chronic Health Evaluation, Simplified Acute Physiology Score and Sequential Organ Failure Assessment) after 48 h of admission.

Keywords

critically ill missing values mortality prediction patient mortality time-series analysis

Introduction

Early physiological monitoring and laboratory surveillance can aid clinicians in making effective interventions to improve patient outcome. Existing severity scoring systems and machine-learning approaches give rise to challenges in integrating a comprehensive panel of physiologic variables and presenting to clinicians interpretable models early in a hospital admission. This problem has particular importance in the intensive care unit (ICU), as patients are necessarily very unwell and there is considerable complexity. Early hospital mortality prediction for ICU (EMPICU) patients remains an open challenge as the majority of the severity of illness scores developed provide risk assessments for ICU patients based on the first 24, 48 or 72 h of a patient’s ICU stay.^1–11 According to research conducted in Luo et al.,¹ many measurements are not yet available during the first half of the first day (i.e. first 12 h), as a result data from this time period are usually missing and so excluded from the analysis. However, patients receive a great deal of intervention in this period, imposing a burden upon them and conferring a cost. It is in the interest of both patients and providers that intensive care intervention is delivered only where it is likely to be effective. The early identification of patients who are more likely to survive, and more likely therefore to benefit, may help both patients and providers to make informed choices about their care.

Therefore, this study presents a thorough time-series analysis for hospital mortality prediction during the first 48 h of ICU admission together with examining the impact of missing values on the performance of mortality prediction in order to establish the most effective model for EMPICU patients. The question that emerges is as follows:

Given the ICU patients’ medical records, how early in the ICU admission can data mining (DM) methods help in predicting hospital mortality considering the impact of missing measurements, and what are the most effective data mining methods for EMPICU?

This article is organized as follows. The ‘Related work in ICU mortality prediction’ section introduces previous work that has been done in ICU mortality prediction. The ‘Challenges in ICU data’ section presents challenges in ICU data. The ‘Time-series analysis for mortality prediction using data mining (DM) techniques’ section introduces the time-series analysis for ICU mortality prediction presented in this research. ‘A framework for early ICU mortality prediction’ section introduces a framework for early mortality prediction in the ICU. The ‘Results’ discussion’ section discusses the results, and finally the ‘Conclusion’ section concludes the work done in this research.

Related work in ICU mortality prediction

This section highlights some DM challenges in ICU mortality prediction facing medical doctors and data scientists. It provides a review of similar solutions for mortality prediction, including severity scoring systems, real-time models, daily models and DM approaches.

Scoring systems for mortality prediction

Traditional scoring systems for mortality prediction

In this section, we will discuss the following traditional ICU scoring systems: (1) Acute Physiology and Chronic Health Evaluation (APACHE),⁹ (2) Simplified Acute Physiology Score (SAPS)¹² and (3) Sequential Organ Failure Assessment (SOFA).¹³

Several publications in the literature have discussed and compared mortality prediction models for ICU patients that rely on panels of experts or statistical models.^{8–10,12,14–17} For example, APACHE⁹ and SAPS¹² assess disease severity to predict outcome. The objective of these models is to characterize disease severity from patient demographics and physiological variables obtained within the first 24 h after ICU admission in order to assess ICU performance. These models have been refined for use within specified geographical areas, such as France, Southern Europe and Mediterranean countries, and to Central and Western Europe.^3,16–21 Using a very different strategy, Hoogendoorn and others²² built two prediction models. The methods used were as follows: (1) extraction of high-level (temporal) features from electronic medical records (EMRs) and to build a predictive model; (2) definition of a patient similarity metric with prediction based on the outcome observed for similar patients. Neither approach gave optimal discrimination but the first model, using temporal features (area under the receiver operating characteristic curve (AUROC) = 0.84), was superior to the patient similarity model (AUROC = 0.68). In a recent study,^23,24 the authors looked at use of Random Forest (RF) in early ICU mortality prediction; however, the study does not provide time analysis of the proposed framework.

Prediction systems have evolved since their inception but have not always led to improved discrimination. APACHE-III¹⁸ was developed in 1991, and APACHE IV²⁵ was developed in 2002/2003, which provides length of stay prediction equations, in addition to the prediction capability of earlier iterations. A more detailed comparison of the current APACHE scoring systems is available in Vincent and Singer.¹⁶ Research in Le Gall et al.¹⁹ introduced an expanded SAPS-II by adding six admission variables: age, gender, length of pre–ICU hospital stay, patient location before ICU, clinical category and presence of drug overdose. Results show that the expanded SAPS-II performed better than the original and a customized SAPS-II, with an AUROC of 0.879. However, a study conducted by Gilani et al.,¹⁷ comparing APACHE scores and SAPS-II score, showed that the discrimination of APACHE-II (as measured by the AUROC) was excellent (AUROC = 0.828) and acceptable for APACHE-III (AUROC = 0.782) and SAPS-II (AUROC = 0.778) scores. In addition, Kramer and others²⁶ found that the discrimination of APACHE IVa was superior with AUROC (0.88) compared with Mortality Probability Model–III (MPM)²⁷ (0.81) and ICU Outcomes Model/National Quality Forum (0.80).²⁶

Another traditional scoring systems is the SOFA score,¹³ which is limited to six organ systems by looking at respiration, coagulation, liver, cardiovascular, central nervous system and renal measurements. For each organ system, the score provides an assessment of derangement between 0 (normal) and 4 (highly deranged).

According to the clinical review conducted by Vincent and Singer,¹⁶ the different types of score should be seen as complementary, rather than competitive and mutually exclusive. Scoring systems have focused on providing increasingly refined methods for benchmarking ICU performance and have laid the foundation for robust systems of quality control, but the use of such tools for individual decision Support remains unproven.

Early scoring systems for mortality prediction

The MPM²⁸ was described by Lemeshow et al.²⁸ Initially, 137 variables were considered; using statistical techniques, the relative importance of each variable was determined and only those with a strong association with outcome retained. This resulted in seven variables collected at admission and seven at 24 h. Unlike APACHE and SAPS, this model could be applied at the time of admission. Furthermore, the physiological variables are recorded as affirmative or negative rather than as an actual number. Lemeshow published an updated form of the model, the MPM-II in 1993.¹⁰ This resulted in two models: $m p m 0$ at admission and $m p m 24$ at 24 h. $m p m 0$ requires the collection of 15 variables and $m p m 24$ a further 8 variables. Both models were shown to be good systems for reliably estimating hospital mortality. At that time (1993), $m p m 0$ was, by definition, the only model for estimating early hospital mortality which was independent of treatment.

Another scoring system for early mortality prediction is SAPS-III.²¹ The objective of the development of SAPS-III was the evaluation of the effectiveness of ICU practices; therefore, the focus of the model was on data available at ICU admission or within a day of admission. Missing values were coded as the reference of ‘normal’ category for each variable. In data collection, maximum and minimum values were recorded during a certain time period; missing maximum values of a variable were replaced by the minimum and vice versa. Some regression imputations were performed if noticeable correlations of available values could be exploited. Selection of variables was done according to their association with hospital mortality, together with expert knowledge and definitions used in other severity of illness scoring systems. The objective of using this combination of techniques rather than regression-based criteria alone was to reach a compromise between over-sophistication of the model and knowledge from sources beyond the sample with its specific case mix and ICU characteristics. The study conducted by Poole et al.¹⁴ compared the predictive ability of SAPS-II (originally developed from data collected in 1991/1992) and SAPS-III (developed from data collected in 2002) scores on a sample of critically ill patients. Both scores provided unreliable predictions, but unexpectedly SAPS-III turned out to overpredict mortality compared to SAPS-II.

The MPM and SAPS-III attempt early mortality prediction; however, they were not used in comparison with our model as most of their attributes are not available in the Multiparameter Intelligent Monitoring in Intensive Care–II (MIMIC-II) database and are complex to calculate. On the contrary, the traditional scoring systems – APACHE-II, SAPS-I and SOFA scores – were used.

DM techniques for mortality prediction

Various studies have advocated the use of DM techniques for predicting ICU mortality, such as the one proposed by Calvert et al.,²⁹ which attempts to predict mortality 12 h before in-hospital death. Although the work conducted shows strong predictive accuracy, however, we question the practical utility of the tool, which predicts at a point 12 h from the sampling. It is not clear at what stage in the evolution of a critical care episode that this tool should be employed to best effect. If it were used continuously until such time as a death, it would be very high risk for patients, and for many of them, there already have been a protracted ICU course with the attendant burdens of treatment. While this delay is acceptable where the intended purpose is unit quality benchmarking, it is slow for the purpose of decision support. In contrast, the model proposed in our study attempts to predict in-hospital mortality shortly after ICU admission. It is our hypothesis that accurate prediction of hospital mortality is possible using data collected in the earliest phase of admission.

Another study that attempted early mortality prediction was proposed by Sadeghi et al.,³⁰ which focuses on specific patient diagnosis. The study proposed a novel method to predict mortality using 12 features extracted from the heart signals of patients within the first hour of ICU admission using the MIMIC-III database. Similar to our work, their study showed that the random forest classifier satisfies both accuracy and interpretability better than the other classifiers – linear discriminant, logistic regression, support vector machine (SVM), random forest, boosted trees, Gaussian SVM and K-nearest neighbourhood, producing an F1-score and AUROC of 0.91 and 0.93, respectively. The study indicates that heart rate (HR) signals can be used for predicting mortality in patients in the ICU. In addition, Crawford and others⁷ concluded that a decision tree (DT) used in their study provided a clinically acceptable mining result in predicting susceptibility of prostate carcinoma patients at low risk for lymph node spread. On the contrary, Ramon and others³¹ reported that the AUROCs of DT-based algorithms (DT learning, 65%; first-order RF, 81%) yielded smaller areas compared to those of naive Bayes (NB) networks (AUROC, 85%) and tree-augmented NB networks (AUROC, 82%) in their study on a small dataset containing 1548 mechanically ventilated ICU patients. Also, the work conducted by Yakovlev et al.³² showed that overall prediction accuracy was highest (90.0%) for NB in predicting in-hospital mortality for patients with acute coronary syndrome.

Unlike these models, the framework proposed by our study attempts mortality prediction from physiological data, including chart variables, lab tests, vital signs and patient demographics, that are not necessarily related to one specific organ/diagnosis as the ICU is a very complex environment and normally patients get admitted suffering from several conditions. Early mortality prediction is motivated by the intention to assist clinicians and patients in the assessment of the risks and benefits attending intensive care admission. We hold that it is in the interests of patients, or their advocates, to be informed of a quantitative mortality risk, as early as possible, and preferably before committing to burdensome critical care interventions, whenever that is possible.

Similarly, Pirracchio and others³ reported that Bayesian Additive Regression Trees (BARTs) is the best candidate when using transformed variables, while random forests outperformed all other candidates when using untransformed variables. Other authors achieved improved mortality prediction using a method based on SVMs.³³ Davoodi and Moradi³⁴ proposed a Deep Rule-Based Fuzzy System (DRBFS) to develop an accurate in-hospital mortality prediction in the ICU employing a large number of input variables. The method developed was evaluated against several common classifiers, including naïve Bayes, DTs, Gradient Boosting and Deep Belief Networks. The AUROC for NB, DT, Gradient Boosting, Dynamic Bayesian Networks and proposed method were 73.51, 61.81, 72.98, 70.07 and 73.90 per cent, respectively.

Many studies show that customized models perform better than traditional scoring systems.²³ Lee and Maslove³⁵ conducted a retrospective analysis using data from the MIMIC-II database; the study concluded that customized models trained on ICU-specific data provided better mortality prediction than traditional SAPS scoring using the same predictor variables. However, ICU is a very complex environment where patients may suffer from more than one condition, which makes it difficult to specify which customized model to use. Therefore, there is a need for general mortality prediction models, which is the focus of this study.

Challenges in ICU data

There are a number of challenges due to the characteristics of typically available ICU data: (1) attribute selection, (2) missing values in data and (3) the class imbalance problem. In this section, we will briefly discuss each challenge, and the details of how we addressed each of these challenges will be discussed further in the article.

Attribute selection: It is often difficult to decide which attributes in a dataset should be used to construct the model. Therefore, one of the core stages is to select the appropriate attributes; several manual and automatic methods are used to select the attributes.

Missing values: Not all medical variables/tests are measured for all patients within the first few hours of admission; therefore (for each patient), there may be some missing data. Missing values can be handled either by ignoring those records from the dataset that are not complete or by filling in missing values by a number of techniques.³⁶

Class imbalance: Class imbalance is a major problem in EMPICU, because the number of patients who die inside the hospital is relatively less in comparison with the number who survive. Techniques for dealing with class imbalanced datasets include modifying the dataset (resampling),³⁶ making the classifier ‘cost sensitive’³⁷ or a hybrid method that combines both.

Time-series analysis for mortality prediction using DM techniques

The target of this section is to realize how early is it to predict hospital mortality, considering the impact of missing measurements in early hours of ICU admission. This research performs experimental investigation on ICU patient data using DM classification techniques to predict mortality. Earlier studies¹ have defined early as the first 12 h of admission; others have defined it as 24, 48 or 72 h after admission.^1–11 These assumptions triggered work done in this research to perform a time-series analysis for mortality prediction over the first 48 h of ICU admission to try and define how early enough is it to effectively predict mortality in the ICU. The algorithms are evaluated on the PhysioNet/CinC Challenge 2012 Citi and Barbieri³³ dataset. The study considered 4000 subjects with single ICU stays whose age at ICU admission was 16 years or over in Medical ICU (MICU), Surgical ICU (SICU), Coronary ICU (CICU) or Cardiac Surgery ICU (CSICU), and whose initial ICU stay was at least 48 h long admitted. The data used for the challenge consisted of five general descriptors, including age, gender, height, ICU type and initial weight. The remaining variables are 36 time series (measurements of vital signs and laboratory results) from the first 48 h of the first available ICU stay of a patient’s admission, published previously in Citi and Barbieri.³³

We employ the RF, PART “The algorithm name”, Bayesian Networks (BN) algorithms. Random forest is an ensemble learning method for classification that operates by constructing multiple DTs at training time and outputs the class that gets the majority vote of the individual trees. PART uses partial DTs (feature subset selection) to generate the decision list shown in the output. Only the final decision list is used in classification. It produces rules from pruned partial DTs. A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph.³⁶

The primary outcome was hospital mortality. Performance measures were calculated using cross-validated AUROC to minimize bias. All experiments were done using Weka (Version 3.7.13; University of Waikato, Hamilton, New Zealand).³⁸ The results noted in Table 3 are AUROC of the average of 10 runs; each run is 10-fold cross-validated. The results are presented in detail in the following sections.

Experiment setting

This section presents the results for the top performing DM algorithms – RF, BN and PART. It is important to note here that we have also evaluated a larger set of algorithms, such as DTs – J48 (DT), SVMs and JRip; however, they were outperformed by the reported methods. Random forest is one of the most accurate learning algorithms available. For many datasets, it produces a highly accurate classifier. It runs efficiently on large databases and it has an effective method for estimating missing data and maintaining accuracy when a large proportion of the data is missing. PART uses partial DTs to generate the decision list shown in the output. Only the final decision list is used in classification. Bayesian networks are an increasingly popular methods for modelling uncertain and complex domains, such as medical diagnoses and the evaluation of scientific evidences. They provide a natural way to handle missing data, allow combination of data with domain knowledge, facilitate learning about causal relationships between variables and provide a method for avoiding over-fitting of data.³⁶

Methods

A total of 4000 ICU patients and 37 time-series variables were selected from every hour over the first 48 h of a patient’s admission for modelling.

We evaluated each of the three DM algorithms on each of the six versions of the dataset:

Original datasets (original),

Datasets after modified by applying the Synthetic Minority Oversampling Technique (SMOTE),³⁹ an oversampling technique that involves increasing the size of the minority class with the insertion of synthetic data (original + smote),

Datasets after replacing missing values with the mean (rep1) to handle the issue of missing values,

Datasets after replacing missing values with mean and then applying SMOTE (rep1 + smote),

Datasets after replacing missing values using the EMImputation algorithm (rep2),

Datasets after replacing missing values using EMImputation algorithm and applying SMOTE (rep2 + smote).

Results

The performance of RF, PART and BN for the six versions of the dataset is displayed in Table 1 for simplicity. The performance of the three algorithms on all the original 48 datasets is displayed in Figure 1(a).

Table 1.

Performance of early mortality prediction models developed using 10-fold cross-validated RF, PR and BN in the different experiment settings 2, 4, 6, 8 and 12, 24 and 48 h after ICU admission, measured with AUROC.

Physionet dataset	Hours	RF	PART	BN
Original	2	0.72 ± 0.03	0.64 ± 0.04	0.69 ± 0.04
Original	4	0.75 ± 0.03	0.69 ± 0.04	0.69 ± 0.04
Original	6	0.76 ± 0.03	0.70 ± 0.04	0.71 ± 0.04
Original	8	0.78 ± 0.03	0.71 ± 0.04	0.73 ± 0.03
Original	12	0.78 ± 0.03	0.73 ± 0.03	0.75 ± 0.03
Original	24	0.80 ± 0.03	0.75 ± 0.03	0.77 ± 0.03
Original	48	0.82 ± 0.03	0.76 ± 0.03	0.79 ± 0.03
Original + smote	2	0.70 ± 0.04	0.65 ± 0.04	0.67 ± 0.04
Original + smote	4	0.73 ± 0.03	0.68 ± 0.04	0.70 ± 0.04
Original + smote	6	0.75 ± 0.03	0.68 ± 0.04	0.71 ± 0.04
Original + smote	8	0.76 ± 0.03	0.70 ± 0.04	0.73 ± 0.04
Original + smote	12	0.77 ± 0.03	0.71 ± 0.04	0.74 ± 0.04
Original + smote	24	0.79 ± 0.03	0.73 ± 0.04	0.76 ± 0.03
Original + smote	48	0.81 ± 0.03	0.75 ± 0.04	0.77 ± 0.03
Rep1	2	0.73 ± 0.03	0.62 ± 0.05	0.70 ± 0.04
Rep1	4	0.76 ± 0.03	0.62 ± 0.05	0.71 ± 0.04
Rep1	6	0.77 ± 0.03	0.62 ± 0.05	0.73 ± 0.04
Rep1	8	0.78 ± 0.03	0.61 ± 0.06	0.74 ± 0.04
Rep1	12	0.79 ± 0.03	0.62 ± 0.05	0.75 ± 0.04
Rep1	24	0.81 ± 0.03	0.64 ± 0.05	0.77 ± 0.03
Rep1	48	0.83 ± 0.03	0.64 ± 0.05	0.77 ± 0.03
Rep1 + smote	2	0.72 ± 0.04	0.61 ± 0.04	0.68 ± 0.04
Rep1 + smote	4	0.76 ± 0.03	0.63 ± 0.04	0.70 ± 0.04
Rep1 + smote	6	0.77 ± 0.03	0.63 ± 0.05	0.71 ± 0.04
Rep1 + smote	8	0.78 ± 0.03	0.64 ± 0.05	0.71 ± 0.04
Rep1 + smote	12	0.78 ± 0.03	0.63 ± 0.05	0.72 ± 0.04
Rep1 + smote	24	0.80 ± 0.03	0.65 ± 0.05	0.75 ± 0.03
Rep1 + smote	48	0.82 ± 0.03	0.66 ± 0.05	0.76 ± 0.03
Rep2	2	0.71 ± 0.04	0.63 ± 0.05	0.68 ± 0.04
Rep2	4	0.73 ± 0.03	0.64 ± 0.04	0.71 ± 0.04
Rep2	6	0.75 ± 0.03	0.64 ± 0.04	0.72 ± 0.04
Rep2	8	0.76 ± 0.03	0.65 ± 0.04	0.73 ± 0.04
Rep2	12	0.78 ± 0.02	0.65 ± 0.04	0.75 ± 0.03
Rep2	48	0.82 ± 0.02	0.68 ± 0.05	0.78 ± 0.03
Rep2 + smote	2	0.71 ± 0.03	0.62 ± 0.04	0.69 ± 0.04
Rep2 + smote	4	0.74 ± 0.03	0.63 ± 0.04	0.71 ± 0.04
Rep2 + smote	6	0.76 ± 0.03	0.64 ± 0.05	0.73 ± 0.03
Rep2 + smote	8	0.76 ± 0.03	0.63 ± 0.05	0.74 ± 0.03
Rep2 + smote	12	0.78 ± 0.03	0.64 ± 0.04	0.76 ± 0.03
Rep2 + smote	24	0.80 ± 0.03	0.66 ± 0.04	0.78 ± 0.03
Rep2 + smote	48	0.82 ± 0.03	0.67 ± 0.04	0.78 ± 0.03

ICU: intensive care unit; AUROC: area under the receiver operating characteristic curve; RF: Random Forest; BN: Bayesian Networks; PART: “The algorithm name”.

Figure 1.

(a) The performance of all algorithms on the Yes class (patients at risk of dying inside the hospital) per hour during the first 48 h of ICU admission together with the percentage of available measurements during the first 48 h of ICU admission. (b) The percentage of missing values of all attributes and vital signs attributes during the first 48 h of ICU admission.

Performance analysis

Table 1 shows the performance of the three machine-learning algorithms (at 0.05 confidence level) in predicting hospital mortality among this patient cohort. Results were obtained on the original, original + smote, rep1, rep1 + smote, rep2 and rep2 + smote datasets, as shown in column 1 of Table 1. Among the six experiment categories, RF performed best, followed by BN then PR. The most effective RF performance model was obtained on the rep1 with AUROC = 0.83 ± 0.03 at hour 48, followed by the original, rep1 + smote and rep2 datasets with AUROC = 0.82 ± 0.03 at hour 40, then rep2 + smote with AUROC = 0.82 ± 0.03 at hour 48.

As shown in Figure 1(a), there is a dramatic increase in available measurements shown at hour 6 of ICU admission making it a suitable point for in-depth analysis of our proposed framework. We compared the performance of RF, PART and BN on patient data after 6 h of ICU admission with the performance of SOFA and SAPS scores on patient data after 24 h of ICU admission to figure out whether our proposed early mortality prediction framework (EMPICU) is relatively effective or not. Due to limited number of figures, we only display Figure 2, which displays the performance of all algorithms against SAPS and SOFA scores on only one dataset setting (original dataset). The figure shows that our models after 6 h of admission outperformed the main scoring systems used in intensive care medicine (APACHE, SAPS-I and SOFA) after 24 h of admission. The best performing classifier is RF, followed by BN, then PART. As represented in the graph of Figure 2, the RF model outperforms the main scoring systems (APACHE-II, SAPS-I and SOFA) both in terms of mortality prediction performance (AUROC) and in terms of time (i.e. early prediction; higher prediction performance at 6 h after admission compared to that of the scoring systems at 24 h after admission). Table 2 displays the AUROC and the standard deviation of the best performing model RF at 6 h after admission and the scoring systems at 24 h after admission.

Figure 2.

The performance of RF, BN and PART after 6 h of ICU admission compared to SAPS-I, APACHE and SOFA on Rep1 + smote dataset after 24 h of ICU admission on the Yes class (patients at risk of dying inside the hospital).

Table 2.

The AUROC and the standard deviation of the best performing model RF at 6 h after admission and the scoring systems at 24 h after admission.

Scoring system	AUROC	SD
RF at 6 h	0.82	0.04
SAPS at 24 h	0.650	0.012
APACHE at 24 h	0.650	0.017
SOFA at 24 h	0.623	0.013

AUROC: area under the receiver operating characteristic curve; SD: standard deviation; SAPS: Simplified Acute Physiology Score; APACHE: Acute Physiology and Chronic Health Evaluation; SOFA: Sequential Organ Failure Assessment.

Missing values analysis

We also analysed the missing values over the 48-h time interval. Results displayed in Figure 1(a) show the percentage of available measurements during the first 48 h of ICU admission. As noted in the graph, a dramatic increase in available measurements is shown at hour 6 of ICU admission and no major increase between hours 24 and 48. In addition, Figure 1(b) displays the percentage of missing measurements of all attributes and vital signs attributes during the first 48 h of ICU admission. As noted on the graph, respiratory rate (RespRate) has the highest percentage of missing values, followed by invasive systolic arterial blood pressure (SysABP), then partial pressure of arterial oxygen (PaO2), while HR, Glasgow Coma Scale (GCS) and temperature (Temp) have the lowest percentage of missing values, while Creatinine is in the middle.

A framework for early ICU mortality prediction

In this section, we present the general framework for dealing with early ICU mortality prediction. Figure 3 illustrates how we handle EMPICU patients in this study. From the previous time-series analysis, it is clear that there are a number of challenges in ICU data. The framework addresses three of these: (1) attribute selection, (2) missing values in data and (3) the class imbalance problem. In this section, we investigate different attribute selections, different methods of handling missing values and class imbalance problem. The focus of the framework in this section is EMPICU patients; by early, we mean the first few hours of admission (i.e. 6 h). We particularly selected the first 6 h, as it is clear from Figure 1(a) that the percentage of missing measurements significantly increases at the 6-h threshold. In addition, after consulting several intensivists and considering gaps in literature, it appeared that analysing patient data at the 6-h threshold is a sensible time point, balancing the need for information early in the admission against data adequacy.

Figure 3.

Proposed framework of an early mortality prediction model in the ICU.

In this study, we used the MIMIC-II⁴⁰ database for analysis and modelling. In preparing the data for use, an extensive examination of data variables was conducted, which meant making a variety of choices and assumptions. Only patients with a single ICU stay at the age of 16 years old and above in MICU, SICU or CSICU are considered in the analysis; this cohort included 11,722 patients. Also patient mortality is defined as death inside the hospital.

The structure of data in the MIMIC-II database had to undergo some initial preprocessing and conversion in order to prepare it for use in this study, as shown in Figure 3. We initially combined patient chart and lab test variables in one relational database in order to facilitate variable extraction. Data extraction was conducted in two stages, as shown in Figure 4. In the first stage, we extracted all variables for the entire ICU patients in the database; then in the second stage, we filtered the variables based on the required time window, which is the first 6 h of a patient’s admission. The final process of attribute selection was based on three main criteria: (1) attribute coverage (measured for above 10% of the patients), (2) expertise of ICU consultants and (3) proposed variables from previous literature, as shown in Figure 4. We calculated both the coverage of each chart attribute and lab test for patients within the first 6 h to select those variables/tests with high coverage. We only ignored attributes with coverage below 10 per cent. This explains why some common variables in the literature might not be included in this study as they had low coverage in the first 6 h of admission. In addition to the initial statistical experiments on the chart attributes and lab tests, direct consultation with subject matter experts in intensive care medicine, data proposed in previous work and DM algorithms were also considered in attribute selection.

Figure 4.

Data extraction process.

The following section discusses thoroughly which attributes are considered in this study. Finally, we extracted the values ( maximum and/or minimum ) for each patient variable within the specific required time window (6 h after admission). It is important to note that there are several methods for selecting variable values. Each variable may have more than one value within the specified time window. For example, HR may have been measured seven times within the first few hours of admission. In this case, the minimum and maximum HR values within the specified time window are both considered, as very low or very high HR values indicate severity. On the contrary, there are some variables that are one direction, such as GCS, in which only the minimum value of the variable indicates severity of illness; that is why the maximum value of the variable is ignored from our attribute selection. However, in the case of RespRate, for instance, only the maximum value is considered as it indicates a more critical patient condition than low RespRate. In summary, we used three strategies for value selection of the attributes: (1) minimum value, (2) maximum value or (3) minimum and maximum values.

Following attribute selection, we used two methods to handle missing values in data: (1) replacing missing values with the mean (Rep1) and (2) replacing missing values using EMImputation. The SMOTE³⁹ was used to handle the issue of class imbalance. SMOTE is one of the most effective and widely used oversampling techniques that was used by several works in literature^41–46 to effectively handle the class imbalance problem. SMOTE increases the number of patient records who die inside the hospital (minority class) by inserting synthetic patient records. We employed RF, PART and BN algorithms to build the models. Building the models was done iteratively using different attribute selections, as shown in Figure 3, discussed thoroughly in the following section.

Selected attributes

We selected 33 chart attributes and 25 lab tests from the initially identified attributes with high coverage. Attributes with higher coverage were considered, resulting in a total of 20 unique variables (age, Temp, HR, RespRate, SysABP, PaO2, GCS, creatinine, fractional inspired oxygen, serum urea nitrogen, potassium, sodium, haematocrit, white blood cells, blood clotting – International Normalized Ratio (INR), platelets count, bilirubin, AIDS, metastatic cancer and type of admission); 29 if we count maximums and minimums.

Results

This section presents the results for the top performing EMPICU DM models – EMPICU-RF, EMPICU-DTs, EMPICU-NB and EMPICU-PART. It is important to note here that we have also evaluated a larger set of algorithms, such as SVMs and JRip; however, they were outperformed by the reported methods. DTs are extremely fast at classifying unknown records. They are quite robust in the presence of noise. They also provide a clear indication of which fields are most important for prediction. The NB algorithm affords fast, highly scalable model building and scoring. It scales linearly with the number of predictors and rows.⁴⁷

We conducted three different experiments. In the first experiment, we used all original 20 attributes and in the second we used only vital signs (age in addition to Temp, HR, RespRate, SysABP, PaO2, GCS and creatinine). We evaluated each of the four DM algorithms on each of six versions of the dataset mentioned earlier.

In the third experiment, we used filtered top 10 attributes that provide the highest information gain (IG) (i.e. those variables that contribute to better classification); we eliminated records missing any of the top 10 attributes. The InfoGainAttributeEval algorithm in Weka³⁸ evaluates the worth of an attribute by measuring the IG with respect to the class. We used four versions of the dataset:

Dataset with eliminated records and the 20 unique variables (original),

Dataset with eliminated records and the 20 unique variables and then applying smote (original + smote),

Dataset with eliminated records and the top filtered ranked variables only (filtered top 10), and

Dataset with eliminated records and the top filtered ranked variables only and then applying smote (filter + smote).

All experiments were done using Weka (Version 3.7.13). The results are noted in AUROC of the average of 10 runs; each run is 10-fold cross-validated. Table 3 ranks the experiments that showed the best performance (highest AUROC) using the best performing model, RF.

Table 3.

The table shows the top AUROC results shown in Table 1

Attributes	Experiment	AUROC
VS attributes	Rep1 + Smote	0.90 ± 0.01
Top 10 attributes	Original	0.89 ± 0.02
Top 10 attributes	Original + Smote	0.89 ± 0.02
Top 10 attributes	Filter	0.87 ± 0.03
Top 10 attributes	Filter + Smote	0.87 ± 0.03
All attributes	Rep1 + Smote	0.85 ± 0.02
All attributes	Rep1	0.85 ± 0.01
All attributes	Rep2	0.84 ± 0.02
All attributes	Rep2 + Smote	0.84 ± 0.02

AUROC: area under the receiver operating characteristic curve; DM: data mining; EMPICU: Early hospital mortality prediction for ICU patients; VS: vital signs.

Results’ discussion

By referring to Figure 1(a), when comparing the performance of all algorithms on the Yes class (patients at risk of dying inside the hospital) per hour during the first 48 h of ICU admission, we find that there is an abrupt improvement in performance at the 6th hour of ICU admission, after which the increase in performance is relatively smaller till the 48th hour of ICU admission. Similarly, the percentage of available values in the dataset increases dramatically at the 6th hour of ICU admission and continues to increase gradually till the 48th hour of ICU admission. In general, as time proceeds, the performance of the RF, BN and PART models increases, as shown in Figure 1(a) and Table 1. As displayed in Table 1, in general, both replacing the missing values with mean (rep1) and replacing the missing values with EMImputation (rep2) gave almost similar performance results. In addition, SMOTE oversampling technique has not enhanced the classification performance.

On the contrary, as shown in Table 3, when comparing the performance of all three experiments in the proposed early mortality prediction models – EMPICU, in general applying the SMOTE oversampling technique significantly enhances the classification performance. Both replacing the missing values with mean (rep1) and replacing the missing values with EMImputation (rep2) gave almost similar performance results. In addition, we also find that when using the vital signs and filtered top 10 attributes, the prediction performance is better than when using all 20 unique attributes. In general, in the filtered top 10 experiment categories, the models developed with the original attributes (without any filtering) performed better than those with filtering. In the experiments without filtering, top 10 (original) and (original + smote) performed best (AUROC = 0.89 ± 0.02). As for the filtered experiments, top 10 (filter) and (filter + smote) also performed best (AUROC = 0.87 ± 0.03).

Conclusion

The ICU is an information-rich environment, uniquely suited to data analysis. Several scoring systems and DM methods have been developed to predict clinical deterioration and mortality in the ICU. However, most of these methods are designed for prediction after 1 or more days of admission. To our knowledge, there have been no definitive studies comparing mortality prediction per hour during the first 48 h of a patient’s admission in order to define to clinicians when is the ideal time for ICU data analysis. This article aims to draw attention of the medical and data science communities to the importance of time-series analysis in the ICU taking into consideration the challenge of missing values in early patient data. The work in this research evaluated a wide range of DM methods on 4000 patients (from MIMIC-II database). We acknowledge the specific findings are particular to this database, but the methodology we have used is transferable. We intend to validate this work on the MIMIC-III⁴⁸ database, which was released in August 2015, 1 year after this research project has started.

From a DM perspective, the best performing model in this study is the EMPICU-RF, followed by EMPICU-BN and EMPICU-PART. In all experiments, EMPICU-RF performed significantly better than EMPICU-BN and EMPICU-PART (at a 5% confidence level). As mentioned earlier, other algorithms were tested, such as DTs, SVM and JRip; however, their performance was relatively poor. This finding supports work conducted by Ramon et al.³¹ Ramon et al.³¹ reported that AUROCs of a DT yielded smaller areas compared to an RF (DT, 65%; first-order RF, 81%).

Our results show the following:

There is a sharp improvement in performance at the 6th hour of ICU admission, after which the increase in performance is relatively smaller till the 48th hour of ICU admission.

The percentage of missing values in the dataset drops dramatically at the 6th hour of ICU admission and continues to decrease gradually till the 48th hour of ICU admission.

The discrimination power of the machine-learning classification methods after 6 h of admission outperformed the main scoring systems used in intensive care medicine (APACHE, SAPS-I and SOFA) after 24 h of admission. The best performing classifier was RF, followed by BN, then PART on different experimental settings.

Both replacing the missing values with mean (rep1) and replacing the missing values with EMImputation (rep2) gave almost similar performance results.

SMOTE oversampling technique did not enhance the classification performance when the dataset was 4000 patients only, while it did enhance the classification performance with the larger dataset of 11,722 patients.

For clinicians, this research draws attention to the problem of missing values in variables over time in order to emphasize on the importance of collecting certain measurements early on; this will influence the predictive performance of mortality prediction models. While we fully acknowledge that we have not developed a usable clinical tool in this work, we have shown that there exists rich information signal early in a critical care admission, which can provide guidance about likely individual outcome. We have shown this on a database with incomplete data. It is our view that this signal may in future be further strengthened by refinements to the methodology, which we have used, in order to assist both clinicians and patients in early outcome prediction.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: A PhD scholarship from the Arab Academy for Science and Technology, Egypt.

ORCID iD

Mohamed Bader-El-Den

References

Luo

Xin

Joshi

, et al Predicting ICU mortality risk by grouping temporal trends from a multivariate panel of physiologic measurements. In: Proceedings of the thirtieth AAAI conference on artificial intelligence, Phoenix, AZ, 12–17 February 2016.

Celi

Galvin

Davidzon

, et al. A database-driven decision support system: customized mortality prediction. J Pers Med 2012; 2(4): 138–148.

Pirracchio

Petersen

Carone

, et al. Mortality prediction in intensive care units with the Super ICU Learner Algorithm (SICULA): a population-based study. Lancet Resp Med 2015; 3(1): 42–52.

Ribas

López

Ruiz-Sanmartín

, et al. Severe sepsis mortality prediction with relevance vector machines. In: Proceedings of the annual international conference of the IEEE in Engineering in Medicine and Biology Society (EMBC), Boston, MA 30 August–3 September 2011, pp. 100–103. New York: IEEE.

Kim

Park

RW.

A comparison of intensive care unit mortality prediction models through the use of data mining techniques. Healthc Inform Res 2011; 17(4): 232–243.

Walker

Kadam

Predicting breast cancer survivability: a comparison of three data mining methods. Artif Intell Med 2005; 34(2): 113–127.

Crawford

Batuello

Snow

, et al. The use of artificial intelligence technology to predict lymph node spread in men with clinically localized prostate carcinoma. Cancer 2000; 88(9): 2105–2109.

Le Gall

J-R

Loirat

Alperovitch

, et al. A Simplified Acute Physiology Score for ICU patients. Crit Care Med 1984; 12(11): 975–977.

Knaus

Draper

Wagner

, et al. APACHE II: a severity of disease classification system. Crit Care Med 1985; 13(10): 818–829.

10.

Lemeshow

Teres

Klar

, et al. Mortality Probability Models (MPM II) based on an international cohort of intensive care unit patients. JAMA 1993; 270(20): 2478–2486.

11.

Vincent

J-L

Moreno

Takala

, et al. The SOFA (Sepsis-related Organ Failure Assessment) score to describe organ dysfunction/failure. Intens Care Med 1996; 22(7): 707–710.

12.

Le Gall

J-R

Lemeshow

Saulnier

. A new Simplified Acute Physiology Score (SAPS II) based on a European/North American multicenter study. JAMA 1993; 270(24): 2957–2963.

13.

Vincent

J-L

De Mendonça

Cantraine

, et al. Use of the SOFA score to assess the incidence of organ dysfunction/failure in intensive care units: results of a multicenter, prospective study. Crit Care Med 1998; 26(11): 1793–1800.

14.

Poole

Rossi

Latronico

, et al. Comparison between SAPS II and SAPS 3 in predicting hospital mortality in a cohort of 103 Italian ICUs. Is new always better? Intens Care Med 2012; 38(8): 1280–1288.

15.

Rosenberg

AL.

Recent innovations in intensive care unit risk-prediction models. Curr Opin Crit Care 2002; 8(4): 321–330.

16.

Vincent

J-L

Singer

Critical care: advances and future perspectives. Lancet 2010; 376(9749): 1354–1361.

17.

Gilani

Razavi

Azad

AM.

A comparison of Simplified Acute Physiology Score II, Acute Physiology and Chronic Health Evaluation II and Acute Physiology and Chronic Health Evaluation III scoring system in predicting mortality and length of stay at surgical intensive care unit. Nigerian Med J 2014; 55(2): 144.

18.

Knaus

Wagner

Draper

, et al. The APACHE III prognostic system. Risk prediction of hospital mortality for critically ill hospitalized adults. Chest J 1991; 100(6): 1619–1636.

19.

Le Gall

Neumann

Hemery

, et al. Mortality prediction using SAPS II: an update for French intensive care units. Crit Care 2005; 9(6): R645.

20.

Metnitz

Schaden

Moreno

, et al. Austrian validation and customization of the SAPS 3 admission score. Intens Care Med 2009; 35(4): 616–622.

21.

Moreno

Metnitz

PGH

Almeida

, et al. SAPS 3 – from evaluation of the patient to evaluation of the intensive care unit. Part 2: development of a prognostic model for hospital mortality at ICU admission. Intens Care Med 2005; 31(10): 1345–1355.

22.

Hoogendoorn

El Hassouni

Mok

, et al. Prediction using patient comparison vs. modeling: a case study for mortality prediction. In: Proceedings of the 38th annual international conference of the Engineering in Medicine and Biology Society (EMBC), Orlando, FL, 16–20 August 2016, pp. 2464–2467. New York: IEEE.

23.

Awad

Bader-El-Den

McNicholas

Patient length of stay and mortality prediction: a survey. Health Serv Manage Res 2017; 30(2): 105–120.

24.

Awad

Bader-El-Den

McNicholas

, et al. Early hospital mortality prediction of intensive care unit patients using an ensemble learning approach. Int J Med Inform 2017; 108: 185–195.

25.

Zimmerman

Kramer

McNair

, et al. Acute Physiology and Chronic Health Evaluation (APACHE) IV: hospital mortality assessment for today’s critically ill patients. Crit Care Med 2006; 34(5): 1297–1310.

26.

Kramer

Higgins

Zimmerman

JE.

Comparison of the Mortality Probability Admission Model III, National Quality Forum, and Acute Physiology and Chronic Health Evaluation IV hospital mortality models: implications for national benchmarking. Crit Care Med 2014; 42(3): 544–553.

27.

Higgins

Teres

Copes

, et al. Assessing contemporary intensive care unit outcome: an updated Mortality Probability Admission Model (MPM0-III). Crit Care Med 2007; 35(3): 827–835.

28.

Lemeshow

Teres

Pastides

, et al. A method for predicting survival and mortality of ICU patients using objectively derived weights. Crit Care Med 1985; 13(7): 519–525.

29.

Calvert

Mao

Hoffman

, et al. Using electronic health record collected clinical variables to predict medical intensive care unit mortality. Ann Med Surg 2016; 11: 52–57.

30.

Sadeghi

Banerjee

Romine

Early hospital mortality prediction using vital signals, 2018, https://arxiv.org/abs/1803.06589

31.

Ramon

Fierens

Güiza

, et al. Mining data from intensive care patients. Adv Eng Inform 2007; 21(3): 243–256.

32.

Yakovlev

Metsker

Kovalchuk

, et al. Prediction of in-hospital mortality and length of stay in acute coronary syndrome patients using machine-learning methods. J Am Coll Cardiol 2018; 71(11): A242.

33.

Citi

Barbieri

Physionet 2012 challenge: predicting mortality of ICU patients using a cascaded SVM-GLM paradigm. In: Proceedings of the computing in cardiology (CINC), Krakow, 9–12 September 2012, pp. 257–260. New York: IEEE.

34.

Davoodi

Moradi

MH.

Mortality prediction in intensive care units (ICUs) using a deep rule-based fuzzy classifier. J Biomed Inform 2018; 79: 48–59.

35.

Lee

Maslove

DM.

Customization of a severity of illness score using local electronic medical record data. J Intensive Care Med 2017; 32(1): 38–47.

36.

Berry

Linoff

Data mining techniques: for marketing, sales, and customer support. New York: John Wiley & Sons, Inc., 1997.

37.

Perry

Bader-El-Den

Cooper

Imbalanced classification using genetically optimized cost sensitive classifiers. In: Proceedings of the IEEE congress on evolutionary computation (CEC), Sendai, Japan, 25–28 May 2015, pp. 680–687. New York: IEEE.

38.

Hall

Frank

Holmes

, et al. The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 2009; 11(1): 10–18.

39.

Chawla

Bowyer

Hall

, et al. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 2002; 16: 321–357.

40.

Saeed

Villarroel

Reisner

, et al. Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II): a public-access intensive care unit database. Crit Care Med 2011; 39(5): 952.

41.

Imbalanced classification based on active learning SMOTE. Res J Appl Sci Eng Technol 2013; 5: 944–949.

42.

Bader-El-Den

Self-adaptive heterogeneous random forest. In: Proceedings of the 2014 IEEE/ACS 11th international conference on computer systems and applications (AICCSA), Doha, Qatar, 10–13 November 2014, pp. 640–646. New York: IEEE.

43.

Blagus

Lusa

LL.

SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics 2013; 14(1): 106.

44.

Wang

, et al. Classification of imbalanced data by using the SMOTE algorithm and locally linear embedding. In: Proceedings of the 2006 8th international conference on signal processing, Beijing, China, 16–20 November 2006, vol. 3.

45.

Bader-El-Den

Teitei

Perry

Biased random forest for dealing with the class imbalance problem. IEEE T Neur Net Lear 2019; 30(7): 2163–2172.

46.

Mohasseb

Bader-El-Den

Cocea

Question categorization and classification using grammar based approach. Inform Process Manag 2018; 54(6): 1228–1243.

47.

Hill

Lewicki

Statistics: methods and applications: a comprehensive reference for science, industry, and data mining. Tulsa, OK: StatSoft, Inc., 2006.

48.

Johnson

Pollard

Shen

, et al. MIMIC-III, a freely accessible critical care database. Sci Data 2016; 3: 160035.

Predicting hospital mortality for intensive care unit patients: Time-series analysis

Abstract

Keywords

Introduction

Related work in ICU mortality prediction

Scoring systems for mortality prediction

Traditional scoring systems for mortality prediction

Early scoring systems for mortality prediction

DM techniques for mortality prediction

Challenges in ICU data

Time-series analysis for mortality prediction using DM techniques

Experiment setting

Methods

Results

Performance analysis

Missing values analysis

A framework for early ICU mortality prediction

Selected attributes

Results

Results’ discussion

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References