Sage Journals: Discover world-class research

Abstract

We compare the performance of logistic regression with several alternative machine learning methods to estimate the risk of death for patients following an emergency admission to hospital based on the patients’ first blood test results and physiological measurements using an external validation approach. We trained and tested each model using data from one hospital (n = 24,696) and compared the performance of these models in data from another hospital (n = 13,477). We used two performance measures – the calibration slope and area under the receiver operating characteristic curve. The logistic model performed reasonably well – calibration slope: 0.90, area under the receiver operating characteristic curve: 0.847 compared to the other machine learning methods. Given the complexity of choosing tuning parameters of these methods, the performance of logistic regression with transformations for in-hospital mortality prediction was competitive with the best performing alternative machine learning methods with no evidence of overfitting.

Keywords

classification and prediction computationally intensive methods databases and data mining electronic health records modelling healthcare services statistical modelling

Introduction

Several predictive models are in widespread use to predict the risk of death for patients in hospital. Prominent examples include Acute Physiology and Chronic Health Evaluation (APACHE II),¹ Mortality Probability Model (MPM II)² and the Physiological and Operative Severity Score for the enUmeration of Mortality and morbidity (POSSUM).³

The development of such risk prediction models is less than straightforward, involving a number of important modelling choices,⁴ which require consideration of candidate covariates (e.g. the patients’ age, gender and comorbidities), the linearity of covariates, interaction effects and the choice of model (e.g. logistic regression).⁵ Model development is usually guided by a number of model diagnostics and performance statistics such as model calibration and model discrimination.⁶

Our motivation stems from attempting to predict the risk of dying for acutely ill patients who are admitted to hospital as unplanned or emergency medical admissions.⁷ The response variable is whether the patient died in hospital (yes/no) and the covariate set is based on previous work⁸ which identified the patients’ routine blood tests (seven blood tests, see later) and National Early Warning Score (NEWS;⁹ see later) as appropriate predictor variables along with the patients’ age (years) and gender (male/female).

A fundamental issue is the choice of model. Here we consider the more traditional approach (which tends to produce models which are more understandable by humans) versus more modern machine learning approaches: (1) logistic regression without transformations of continuous covariates (LOGIT), (2) logistic regression with transformations of continuous covariates (LOGIT†), (3) logistic regression with multivariable fractional polynomials (MFP),¹⁰ (4) logistic regression with restricted cubic splines (RCS) for continuous covariates,⁵ (5) recursive partitioning and regression trees (RPART),¹¹ (6) random forest (RF),¹² (7) generalised boosted regression modelling (GBM),¹³ (8) support vector machine (SVM)¹⁴ and (9) neural network (NNET).¹⁴

The rationale for investigating these alternative approaches is as follows. Logistic regression is widely used in medical applications and the model coefficients can be interpreted as odds ratios (and using a modified approach as risk ratios) which are clinically meaningful.⁴ For the logistic model, covariates can be included in the untransformed scale, with transformation and with/without the aid of RCS (which are advocated for use with continuous covariates^5,15) and MFP (which has also has been advocated for continuous covariates¹⁶). Furthermore, modern statistical machine learning methods have been advocated by several authors,^17–25 including decision trees, boosted models, SVM and NNET.

For this study, we consider logistic regression with/without transformation as being the more traditional approach and the use of RCS, MFP, RF, RPART, GBM and SVM as the more modern computationally intensive approaches.

Our aim is to compare the above modelling strategies and identify the model with the best performance statistics using external model validation to assess the performance of these models in terms of calibration and discrimination. The use of external validation to make these comparisons has become an important methodological development.^6,26

As prediction models inform patients and carers about prognosis, it is essential that predictions should be well calibrated.⁶ While the interest in the development and validation prediction models in clinical setting is growing, only a quarter of the studies reported prediction models with internal and external validation.^27,28 Usually, internal validation is done by splitting the development data into training and testing sets; however, cross-validation and bootstrapping can also be used.²⁹ External validation aims to address the performance of a model in patients from a different but possibly related setting, and it is a key step before disseminating the prediction model in clinical setting.^26,30

For discrimination, we use the area under the receiver operating characteristic (ROC) curve (AUC) or concordance (c)-statistic. The AUC is the probability that the model will predict a higher risk of death for a randomly selected patient who died, compared to a randomly selected patient who survived. Calibration is the relationship between the observed and predicted risk of death and can be usefully seen on a scatter plot (y-axis – observed risk, x-axis – predicted risk). Perfect predictions should be on the 45° line. The intercept (a) and slope (b) of this line give an assessment of ‘calibration-in-the-large’.⁶ In model development, $a = 0$ and $b = 1$ , but, in external validation, calibration-in-the-large problems are indicated if a is not 0 and if b is more/less than 1 as this reflects problems of under/overprediction. Specifically, for each modelling strategy, we determined the AUC or c-statistic, the scaled Brier score and the Hosmer–Lemeshow (HL) deciles of risk goodness-of-fit test.

Materials and methods

Data set

Our cohorts of emergency admissions are from two acute hospitals which are approximately 100 km apart in the Yorkshire & Humberside region of England – the Diana, Princess of Wales Hospital (managed by the Northern Lincolnshire and Goole NHS Foundation Trust (NLAG)), and York Hospital (managed by York Teaching Hospitals NHS Foundation Trust). All adult (age > 16 years) emergency admissions during the year 2014 (i.e. 1 January to 31 December 2014) were included. We obtained the following information for each admission: the patients’ age, gender and discharge status (alive/dead). We considered admissions, which had no missing data. We excluded 5137 (17%) admissions for NLAG Hospital and 4267 (24%) admissions for York Hospital, with incomplete data (albumin and creatinine test results were the most frequent missing data; Table 1). The covariates set were age (years), gender (male/female), albumin (g/L), creatinine (µmol/L), haemoglobin (g/dL), potassium (mmol/L), sodium (mmol/L), white cell count (10⁹ cells/L), urea (mmol/L) and NEWS. The NEWS ranged from 0 (indicating the lowest severity of illness) to 19 (the maximum NEWS value possible is 20).

Table 1.

Characteristics of the emergency admissions in the two hospitals.

Characteristic	Development data (NLAG Hospital)	Validation data (York Hospital)
N	24,696	13,477
Died	1159 (4.7%)	876 (6.5%)
Male	11,571 (46.9%)	6413 (47.6%)
Mean age [years] (SD)	63.1 (21.1)	68.3 (19.2)
Mean NEWS [1–19] (SD)	1.9 (2.1)	2.6 (2.6)
Mean albumin [g/L] (SD)	34 (6.2)	38 (5.8)
Mean creatinine [µmol/L] (SD)	100.1 (75.2)	104 (93.7)
Mean haemoglobin [g/dL] (SD)	128.8 (21.7)	125.1 (22.1)
Mean potassium [mmol/L] (SD)	4.1 (0.6)	4.3 (0.6)
Mean sodium [mmol/L] (SD)	137 (4.7)	136.7 (4.7)
Mean white cell count [10⁹ cells/L] (SD)	9.8 (5.1)	10.3 (7.2)
Mean urea [mmol/L] (SD)	7.3 (5.7)	8.2 (6.1)

NLAG: Northern Lincolnshire and Goole NHS Foundation Trust; SD: standard deviation; NEWS: National Early Warning Score.

Statistical analysis

We started with an exploratory analysis of the NEWS and the blood test results.

We truncated extreme observations of blood test results (very high (>99.9%) or very low (<0.1%)) to moderate the noise of outliers in the modelling process. We have excluded the incomplete data as follows: 17 per cent (5138/29,834) for NLAG Hospital and 24 per cent (4267/17,744) for York Hospital. We produced scatter plots showing the relationship between mortality and continuous covariates (grouped into deciles). We modelled the risk of death using the same set of covariates: age, gender, albumin, creatinine, haemoglobin, potassium, sodium, white cell count, urea and NEWS.

We used the qladder function (Stata version 13), which displays the quantiles of transformed variable against the quantiles of a normal distribution according to the ladder powers $(x^{3}, x^{2}, x^{1}, x, \sqrt{x}, \log (x), x^{- 1}, x^{- 2}, x^{- 3})$ for each variable $x$ . We randomly divided our development data (NLAG Hospital) into a training set (70%, n = 17,288) and a testing set (30%, n = 7408) for internal model validation.⁴ We further validated these models on an external validation dataset from York Hospital. Three commonly used performance measures were used to assess model performance: HL test, scaled Brier score and AUC.⁶ The 95% confidence interval (CI) for the c-statistic was derived using DeLong’s method as implemented in the pROC library.³¹ Discrimination relates to how well a model can separate (or discriminate between) those who died and those who did not. Calibration relates to the agreement between observed mortality and predicted risk. Overall statistical performance was assessed using the scaled Brier score which incorporates both discrimination and calibration.⁴ The Brier score is the squared difference between actual outcomes and predicted risk of death, scaled by the maximum Brier score such that the scaled Brier score ranges from 0 to 100 per cent. Higher values indicate superior models.

These analyses were undertaken in R.³² We used default tuning parameters for MFP in R package mfp³³ but RCS with three knots in R package rms.³⁴ We used the caret R package³⁵ for machine learning algorithms (RPART, RF, GBM, SVM, NNET) and optimised their tuning parameters using AUC as a loss function for a (five times) repeated 10-fold cross-validation method (see Supplementary Material). We used a linear kernel with two parameters (i.e. cost and gamma) for the SVM method.

Ethical approval

Although this type of study does not require ethical approval because it meets the exemption criteria (‘Research limited to secondary use of information previously collected in the course of normal care (without an intention to use it for research at the time of collection), provided that the patients or service users are not identifiable to the research team in carrying out the research’.³⁶), we obtained ethical approval for the main research project of which this is a substudy from Yorkshire & The Humber–Leeds West Research Ethics Committee (reference no. 15/YH/0348).

Results

There were 24,696 emergency admissions for development data (NLAG Hospital) and 13,477 for validation data (York Hospital). We further divide the development data into training set (70%, n = 17,288) and testing set (30%, n = 7408). For both hospitals, we have 12-month data where patient discharges were from 1 January to 31 December 2014. Descriptive statistics for the covariates are shown in Table 1. The risk of death in NLAG was 4.7 per cent (1159/24,696) compared with 6.5 per cent (876/13,477) in York Hospital. Patients in NLAG Hospital has a mean age of 63.1 years compared with 68.3 years in York Hospital and a lower NEWS (1.9 (NLAG)) compared to York Hospital (2.6).

Figure 1 shows boxplots of each covariate with respect to patient discharge status (alive/dead) in NLAG Hospital. In general, patients who died were older, had higher NEWS, lower albumin, higher creatinine, lower haemoglobin, higher potassium, higher urea, higher white cell counts and lower sodium levels.

Figure 1.

Boxplot without outliers for continuous covariates with respect to patient’s discharge status (alive/dead) in NLAG Hospital.

It can be seen from Figure 2 that the relationship between the continuous covariates and mortality in NLAG Hospital is generally non-linear. Using quantile–quantile (qq) plots, we arrived at the following transformations: ${(creatinine)}^{- 0.5}$ , $\log (potassium)$ , $\log (sodium)$ , $\log (white cell count)$ and $\log (urea)$ .

Figure 2.

Scatter plots showing the observed risk of death with continuous covariates in NLAG Hospital.

Statistical modelling results

We predicted the risk of in-hospital mortality using the following modelling approaches – LOGIT (no transformations), LOGIT† (with transformations), MFP, RCS, RPART, RF, GBM, SVM and NNET. The model performance statistics are shown in Table 2 and plotted in Figure 3.

Table 2.

Model performance statistics using regression methods for all models (LOGIT, LOGIT†, MFP, RCS, RPART, RF, GBM, SVM and NNET).

Model	Split	HL chi-square (df = 8)	HL p	Brier	AUC [95% CI]	Slope [95% CI]
LOGIT	Training	28.0	0.000	0.160	0.8832 [0.8736–0.8928]	−
LOGIT	Testing	15.3	0.053	0.111	0.8712 [0.8557–0.8868]	−
LOGIT	Validation	16.7	0.033	0.139	0.8470 [0.8351–0.8589]	0.90 [0.85–0.96]
LOGIT†	Training	20.1	0.010	0.162	0.8835 [0.8739–0.8931]	−
LOGIT†	Testing	14.9	0.061	0.118	0.8714 [0.8558–0.8871]	−
LOGIT†	Validation	11.9	0.155	0.149	0.8491 [0.8372–0.8610]	0.92 [0.84–1.00]
RCS	Training	19.0	0.015	0.169	0.8882 [0.8757–0.8947]	−
RCS	Testing	17.5	0.026	0.080	0.8715 [0.8560–0.8871]	−
RCS	Validation	27.3	0.001	0.138	0.8476 [0.8356–0.8596]	0.93 [0.85–1.02]
MFP	Training	19.1	0.014	0.164	0.8850 [0.8756–0.8945]	−
MFP	Testing	11.5	0.173	0.115	0.8714 [0.8559–0.8870]	−
MFP	Validation	16.1	0.041	0.145	0.8506 [0.8389–0.8624]	0.91 [0.86–0.97]
RPART	Training	0.0	1.000	0.261	0.8694 [0.8557–0.8830]	−
RPART	Testing	47.4	0.000	0.025	0.8137 [0.7898–0.8377]	−
RPART	Validation	65.7	0.000	0.048	0.7854 [0.7700–0.8007]	0.70 [0.65–0.75]
RF	Training	−	−	0.884	1.0000 [1.0000–1.0000]	−
RF	Testing	−	−	0.164	0.8569 [0.8397–0.8741]	−
RF	Validation	27.9	0.001	0.119	0.8044 [0.7899–0.8189]	0.93 [0.86–0.99]
GBM	Training	59.3	0.000	0.244	0.9058 [0.8968–0.9148]	−
GBM	Testing	30.0	0.000	0.116	0.8719 [0.8563–0.8875]	−
GBM	Validation	47.3	0.000	0.142	0.8483 [0.8365–0.8601]	0.99 [0.93–1.04]
SVM	Training	28.6	0.000	0.158	0.8840 [0.8744–0.8936]	−
SVM	Testing	15.9	0.041	0.131	0.8724 [0.8569–0.8880]	−
SVM	Validation	113.5	0.000	0.135	0.8470 [0.8351–0.8590]	0.89 [0.83–0.94]
NNET	Training	11.4	0.182	0.159	0.8842 [0.8747–0.8938]	−
NNET	Testing	14.8	0.064	0.123	0.8722 [0.8566–0.8877]	−
NNET	Validation	38.3	0.000	0.143	0.8475 [0.8357–0.8594]	0.86 [0.80–0.91]

LOGIT: logistic regression without transformations of continuous covariates; LOGIT†: logistic regression with transformations of continuous covariates; MFP: logistic regression with multivariable fractional polynomials; RCS: logistic regression with restricted cubic splines; RPART: recursive partitioning and regression trees; RF: random forest; GBM: generalised boosted regression modelling; SVM: support vector machine; NNET: neural network; HL: Hosmer–Lemeshow; AUC: area under the receiver operating characteristic curve; CI: confidence interval.

†

Covariate set is transformed using qladder function as follows: (creatinine)^–1/2, log(potassium), log(sodium), log(white cell count) and log(urea).

Figure 3.

Model performance statistics using regression methods for all models (LOGIT, LOGIT†, MFP, RCS, RPART, RF, GBM, SVM and NNET).

In the training phase, the AUC ranged from 0.87 to 1. RF had a perfect AUC (1) which is a reflection of the overfitting that usually occurs when RF trees are grown to the maximum size in training datasets using the default (and recommended) settings. GBM had the highest AUC (0.905). RPART had the lowest AUC (0.869). The other five methods (LOGIT, LOGIT†, MFP, NNET and SVM) had AUCs that ranged from 0.883 to 0.884. In the training phase, RF had the highest Brier score (0.884 which is also due to overfitting) followed by GBM and RPART (0.244, 0.261). The remaining six methods (LOGIT, LOGIT†, RCS, MFP, SVM and NNET) had Brier scores that ranged from 0.158 to 0.164.

In the testing phase, all methods had a reduction in their AUC (range: 0.814–0.872) and Brier scores (range: 0.025–0.164). RPART now had the lowest AUC (0.814) followed by RF (0.857). The remaining seven methods (LOGIT, LOGIT†, RCS, MFP, RF, NNET, GBM and SVM) had very similar AUC that ranged from 0.871 to 0.872. In the testing phase, RPART had the lowest Brier score (0.025) followed by RCS (0.080). The highest Brier Score was seen in RF (0.164). The remaining six methods (LOGIT, LOGIT†, MFP, GBM, NNET and SVM) had Brier scores that ranged from 0.111 to 0.131.

In the external validation phase, all methods had a reduction in their AUC (range: 0.785–0.851) and Brier Scores (0.048–0.149). The highest AUC (0.851) and Brier score (0.149) were seen in the LOGIT† model. The lowest AUC was seen in RPART (0.785) followed by RF (0.804). The remaining models (LOGIT, RCS, MFP, GBM, SVM and NNET) had AUC that ranged from 0.847 to 0.851. The lowest Brier score was seen in RPART (0.048) followed by RF (0.119), while the remaining methods (LOGIT, LOGIT†, RCS, MFP, GBM, NNET and SVM) had Brier scores which ranged from 0.135 to 0.149.

The external validation calibration slope (Figure 3, lower panel) ranged from 0.70 to 0.99, with RPART having the lowest value which showed considerable overfitting (slope < 1). Three methods (RCS, LOGIT† and GBM) had a 95% CI which included one. GBM had an external validation slope nearest to 1, that is, 0.99.

The LOGIT model without transformations performed reasonably well in the external validation phase – AUC (0.847), Brier score (0.139) and slope (0.90). The LOGIT† also performed well – AUC (0.847), Brier score (0.149) and slope (0.92). The RCS and MFP models also had similar AUC (0.85) and Brier scores (0.138 and 0.148, respectively). The RCS slope was higher which is 0.93 with a wider 95% CI that included one, while MFP had a slope of 0.91 and a narrower 95% CI which did not include one. As the sample size is the same for all methods, the source of variability in the width of the CIs is linear predictors from each method and identifies RCS as having the widest CIs for their estimates of the external validation slope.

Discussion

Using a high-quality electronically collected data set with large sample sizes and non-linear relationships between covariates and mortality, we examined the performance of nine methods for predicting the risk of in-hospital mortality by developing the model in one hospital and externally validating it in another hospital. This approach to model testing is infrequent^27,28 but methodologically more rigorous than simply considering internal validation.²⁶

We did not find any consistent evidence to suggest that modern machine learning approaches (RPART, RF, GBM, SVM and NNET) were superior to more conventional statistical modelling methods based on the logistic regression model. While there was no clear overall winner, GBM and LOGIT† exhibited the best overall performance. However, we did find that several methods (RPART and RF) exhibited sufficiently poor performance in the external validation phase to undermine their use. Furthermore, given the complexity of choosing tuning parameters of the alternative machine learning methods, the logistic regression with transformations has good performance characteristics and is relatively less complex.

Although a few studies have used external validation as a benchmark for machine learning and logistic regression methods in the following areas: detecting prostate cancer,^37,38 on simulated data,³⁹ predicting mortality risk after acute ischemic stroke⁴⁰ and predicting mortality risk after brain injury,^23,41 we predicted the risk of in-hospital mortality in acutely ill medical admissions. Our findings are consistent with the recently published study on predicting the risk of mortality after traumatic brain injury.⁴¹ As they found, logistic regression performs just as good as modern machine learning methods. A key reason for this may be that non-linear and non-additive signals are not strong enough to make modern machine learning methods advantageous.

While the extent to which our findings are generalisable is not clear, we suggest that the candidate models should include a simple logistic model as a benchmark for comparison with other more sophisticated models and that external validation (not internal validation) be used to compare and contrast model performance. Furthermore, the use of the AUC alone as a summary performance measure is limited and not necessarily a good discriminator between models. The Brier score, which combines calibration and discrimination, and the external calibration slope are also useful performance characteristics which merit consideration when comparing models.

Conclusion

Given the complexity of choosing tuning parameters of the modern machine learning methods considered above, the performance of logistic regression with transformations for in-hospital mortality prediction was competitive with the best performing alternative machine learning methods with no evidence of overfitting. The use of RPART and RF in our data is not supported. Our models were developed (using training and testing datasets) in one hospital and validated in a second (different) hospital within the region which increases the likelihood of generalisability to other hospitals. Having established the validity of the logistic regression modelling approach, we plan to evaluate its use in routine clinical practice to see if it can support clinical decision making to enhance the quality of care.

Supplemental Material

revised_supplementary_material_(1) – Supplemental material for A comparison of logistic regression models with alternative machine learning methods to predict the risk of in-hospital mortality in emergency medical admissions via external validation

Supplemental material, revised_supplementary_material_(1) for A comparison of logistic regression models with alternative machine learning methods to predict the risk of in-hospital mortality in emergency medical admissions via external validation by Muhammad Faisal, Andy Scally, Robin Howes, Kevin Beatson, Donald Richardson and Mohammed A Mohammed in Health Informatics Journal

Footnotes

Acknowledgements

M.A.M. and M.F. had the original idea for this work and undertook the statistical analyses. R.H. and K.B. extracted the necessary data frames. D.R. gave a clinical perspective. A.S. contributed to study design and interpretation of results. M.A.M. and M.F. wrote the first draft of this paper and all authors subsequently assisted in redrafting and have approved the final version.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the Health Foundation. The Health Foundation is an independent charity working to improve the quality of healthcare in the United Kingdom. This research was supported by the National Institute for Health Research (NIHR) Yorkshire and Humberside Patient Safety Translational Research Centre (NIHR YHPSTRC). The views expressed in this article are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care.

Supplemental material

Supplemental material for this article is available online.

ORCID iD

Muhammad Faisal

References

Knaus

Draper

Wagner

, et al. APACHE II: a severity of disease classification system. Crit Care Med 1985; 13: 818–829.

Lemeshow

Teres

Klar

, et al. Mortality Probability Models (MPM II) based on an international cohort of intensive care unit patients. JAMA 1993; 270: 2478–2486.

Neary

Heather

Earnshaw

JJ.

The Physiological and Operative Severity Score for the enUmeration of Mortality and morbidity (POSSUM). Br J Surg 2003; 90: 157–165.

Steyerberg

EW.

Clinical prediction models. A practical approach to development, validation and updating. New York: Springer, 2008.

Harrell

FE.

Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis. New York: Springer, 2001.

Steyerberg

Vickers

Cook

, et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology 2010; 21: 128–138.

Hippisley-Cox

Coupland

Predicting risk of emergency admission to hospital using primary care data: derivation and validation of QAdmissions score. BMJ Open 2013; 3: e003482.

Prytherch

Sirl

Schmidt

, et al. The use of routine laboratory data to predict in-hospital death in medical admissions. Resuscitation 2005; 66: 203–207.

Royal College of Physicians. National Early Warning Score (NEWS): standardising the assessment of acute illness severity in the NHS, 2012, https://www.rcplondon.ac.uk/sites/default/files/documents/national-early-warning-score-standardising-assessment-acute-illness-severity-nhs.pdf

10.

Royston

Sauerbrei

Fractional polynomials for one variable multivariable model building: a pragmatic approach to regression analysis based on fractional polynomials for modelling continuous variables. Hoboken, NJ: John Wiley & Sons, 2008.

11.

Breiman

Friedman

Stone

, et al. Classification and regression trees. 1st ed. Wadsworth, Davidson, NC: Wadsworth International Group, 1984.

12.

Breiman

Random forests. Mach Learn 2001; 45: 5–32.

13.

Friedman

JH.

Greedy function approximation: a gradient boosting machine. Ann Stat 2001; 29: 1189–1232.

14.

Hastie

Tibshirani

Friedman

HJ.

The elements of statistical learning: data mining, inference, and prediction. 2nd ed. New York: Springer, 2009.

15.

Marrie

Dawson

Garland

Quantile regression and restricted cubic splines are useful for exploring relationships between continuous variables. J Clin Epidemiol 2009; 62: 511–517.e1.

16.

Royston

Sauerbrei

Building multivariable regression models with continuous covariates in clinical epidemiology – with an emphasis on fractional polynomials. Methods Inf Med 2005; 44: 561–571.

17.

Kim

Park

RW.

A comparison of intensive care unit mortality prediction models through the use of data mining techniques. Healthc Inform Res 2011; 17: 232–243.

18.

Scott

Colborn

Machine learning for predicting sepsis in-hospital mortality: an important start. Acad Emerg Med 2016; 23: 1307.

19.

Churpek

Yuen

Winslow

, et al. Multicenter comparison of machine learning methods and conventional regression for predicting clinical deterioration on the wards. Crit Care Med 2016; 44: 368–374.

20.

Badriyah

Briggs

Prytherch

DR.

Decision trees for predicting risk of mortality using routinely collected data. Int J Soc Hum Sci 2012; 6: 660–663.

21.

Wang

Lam

Deng

, et al. Prediction of mortality after radical cystectomy for bladder cancer by machine learning techniques. Comput Biol Med 2015; 63: 124–132.

22.

Motwani

Dey

Berman

, et al. Machine learning for prediction of all-cause mortality in patients with suspected coronary artery disease: a 5-year multicentre prospective registry analysis. Eur Heart J 2016; 52: 468–476.

23.

Stylianou

Akbarov

Kontopantelis

, et al. Mortality risk prediction in burn injury: comparison of logistic regression with machine learning approaches. Burns 2015; 41: 925–934.

24.

Colombet

Ruelland

Chatellier

, et al. Models to predict cardiovascular risk: comparison of CART, multilayer perceptron and logistic regression. Proc AMIA Symp 2000; 156–160: PMC2244093.

25.

Ross

Shah

Dalman

, et al. The use of machine learning for the identification of peripheral artery disease and future mortality risk. J Vasc Surg 2016; 64: 1515–1522.e3.

26.

Bleeker

Moll

Steyerberg

, et al. External validation is necessary in prediction research: a clinical example. J Clin Epidemiol 2003; 56: 826–832.

27.

Collins

De Groot

Dutton

, et al. External validation of multivariable prediction models: a systematic review of methodological conduct and reporting. BMC Med Res Methodol 2014; 14: 40.

28.

Siontis

GCM

Tzoulaki

Castaldi

, et al. External validation of new risk prediction models is infrequent and reveals worse prognostic discrimination. J Clin Epidemiol 2015; 68: 25–34.

29.

Steyerberg

Harrell

Borsboom

GJJ

, et al. Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. J Clin Epidemiol 2001; 54: 774–781.

30.

Justice

Covinsky

Berlin

JA.

Assessing the generalizability of prognostic information. Ann Intern Med 1999; 130: 515–524.

31.

Robin

Turck

Hainard

, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 2011; 12: 77.

32.

R Development Core Team. R: a language and environment for statistical computing. R Foundation for Statistical Computing, 2015, http://www.r-project.org/

33.

Ambler

Benner

mfp: Multivariable Fractional Polynomials, 2015, http://cran.r-project.org/package=mfp

34.

Harrell

FE.

rms: Regression Modeling Strategies, 2015, http://cran.r-project.org/package=rms

35.

Kuhn

Wing

Weston

, et al. caret: Classification and Regression Training. 2015, http://cran.r-project.org/package=caret

36.

NHS Health Research Authority. Governance arrangements for research ethics committees, http://www.hra.nhs.uk/resources/research-legislation-and-governance/governance-arrangements-for-research-ethics-committees/ (accessed 10 August 2017).

37.

Ecke

Hallmann

Koch

, et al. External validation of an artificial neural network and two nomograms for prostate cancer detection. ISRN Urol 2012; 2012: 1–6.

38.

Nieboer

Vergouwe

Roobol

, et al. Nonlinear modeling was applied thoughtfully for risk prediction: the Prostate Biopsy Collaborative Group. J Clin Epidemiol 2015; 68: 426–434.

39.

Terrin

Schmid

Griffith

, et al. External validity of predictive models: a comparison of logistic regression, classification trees, and neural networks. J Clin Epidemiol 2003; 56: 721–729.

40.

König

Malley

Weimar

, et al. Practical experiences on the necessity of external validation. Stat Med 2007; 26: 5499–5511.

41.

Van Der Ploeg

Nieboer

Steyerberg

EW.

Modern modelling techniques had limited external validity in predicting mortality from traumatic brain injury. J Clin Epidemiol 2016; 78: 83–89.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.16 MB