Sage Journals: Discover world-class research

Abstract

Machine learning algorithms are increasingly used in the clinical literature, claiming advantages over logistic regression. However, they are generally designed to maximize the area under the receiver operating characteristic curve. While area under the receiver operating characteristic curve and other measures of accuracy are commonly reported for evaluating binary prediction problems, these metrics can be misleading. We aim to give clinical and machine learning researchers a realistic medical example of the dangers of relying on a single measure of discriminatory performance to evaluate binary prediction questions. Prediction of medical complications after surgery is a frequent but challenging task because many post-surgery outcomes are rare. We predicted post-surgery mortality among patients in a clinical registry who received at least one aortic valve replacement. Estimation incorporated multiple evaluation metrics and algorithms typically regarded as performing well with rare outcomes, as well as an ensemble and a new extension of the lasso for multiple unordered treatments. Results demonstrated high accuracy for all algorithms with moderate measures of cross-validated area under the receiver operating characteristic curve. False positive rates were $<$ 1%, however, true positive rates were $<$ 7%, even when paired with a 100% positive predictive value, and graphical representations of calibration were poor. Similar results were seen in simulations, with the addition of high area under the receiver operating characteristic curve ( $>$ 90%) accompanying low true positive rates. Clinical studies should not primarily report only area under the receiver operating characteristic curve or accuracy.

Keywords

Prediction classification machine learning ensembles mortality

Get full access to this article

View all access options for this article.

References

Rose

. Machine learning for prediction in electronic health data. JAMA Netw Open 2018; 1: e181404.

Friedman

Hastie

Tibshirani

. The Elements of Statistical Learning. New York: Springer, 2001.

Rose

. Mortality risk score prediction in an elderly population using machine learning. Am J Epidemiol 2013; 177: 443–452.

Makar

Ghassemi

Cutler

et al. Short-term mortality prediction for elderly patients using medicare claims data. Int J Mach Learn Comput 2015; 5: 192.

Verplancke

Van Looy

Benoit

et al. Support vector machine versus logistic regression modeling for prediction of hospital mortality in critically ill patients with haematological malignancies. BMC Med Inf Decis Making 2008; 8: 56.

Pirracchio

Petersen

Carone

et al. Mortality prediction in intensive care units with the super ICU learner algorithm (SICULA): A population-based study. Lancet Respir Med 2015; 3: 42–52.

Austin

Lee

Steyerberg

et al. Regression trees for predicting mortality in patients with cardiovascular disease: What improvement is achieved by using ensemble-based methods? Biom J 2012; 54: 657–673.

Motwani

Dey

Berman

et al. Machine learning for prediction of all-cause mortality in patients with suspected coronary artery disease: A 5-year multicentre prospective registry analysis. Eur Heart J 2016; 38: 500–507.

Shouval

Labopin

Bondi

et al. Prediction of allogeneic hematopoietic stem-cell transplantation mortality 100 days after transplantation using a machine learning algorithm: A European group for blood and marrow transplantation acute leukemia working party retrospective data mining study. J Clin Oncol 2015; 33: 3144–3152.

10.

Taylor

Pare

Venkatesh

et al. Prediction of in-hospital mortality in emergency department patients with sepsis: A local big data–driven, machine learning approach. Acad Emerg Med 2016; 23: 269–278.

11.

Hanley

McNeil

. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982; 143: 29–36.

12.

Cook

. Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation 2007; 115: 928–935.

13.

Lever

Krzywinski

Altman

. Points of significance: Classification evaluation. Nat Methods 2016; 13: 603–604.

14.

Saito

Rehmsmeier

. The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 2015; 10: e0118432.

15.

Labatut

Cherifi

. Evaluation of performance measures for classifiers comparison. arXiv preprint arXiv:1112.4133, 2011.

16.

Diamond

. What price perfection? calibration and discrimination of clinical prediction models. J Clin Epidemiol 1992; 45: 85–89.

17.

Van Calster

Nieboer

Vergouwe

et al. A calibration hierarchy for risk models was defined: From utopia to empirical data. J Clin Epidemiol 2016; 74: 167–176.

18.

Johnson

AEW

Pollard

Mark

. Reproducibility in critical care: A mortality prediction case study. Proc Mach Learn Res 2017; 68: 361–376.

19.

Forte

Wiering

Bouma

et al. Predicting long-term mortality with first week post-operative data after coronary artery bypass grafting using machine learning models. Proc Mach Learn Res 2017; 68: 39–58.

20.

Suresh

Hunt

Johnson

et al. Clinical intervention prediction and understanding with deep neural networks. Proc Mach Learn Res 2017; 68: 322–337.

21.

Roy

Stewart

. Prediction modeling using EHR data: Challenges, strategies, and a comparison of machine learning approaches. Med Care 2010; 48: S106–S113.

22.

Resnic

Ohno-Machado

Selwyn

et al. Simplified risk score models accurately predict the risk of major in-hospital complications following percutaneous coronary intervention. Am J Cardiol 2001; 88: 5–9.

23.

Huygens

Mokhles

Hanif

et al. Contemporary outcomes after surgical aortic valve replacement with bioprostheses and allografts: A systematic review and meta-analysis. Eur J Cardiothorac Surg 2016; 50: 605–616.

24.

Bates

. Treatment options in severe aortic stenosis. Circulation 2011; 124: 355–359.

25.

Joseph

Yaseen Naqvi

Giri

et al. Aortic stenosis: Pathophysiology, diagnosis, and therapy. Am J Med 2017; 130: 253–263.

26.

McKean

Kelman

et al. Early mortality after aortic valve replacement with mechanical prosthetic vs bioprosthetic valves among medicare beneficiaries: A population-based cohort study. JAMA Intern Med 2014; 174: 1788–1795.

27.

Goldstone

Chiu

Baiocchi

et al. Mechanical or biologic prostheses for aortic-valve and mitral-valve replacement. N Engl J Med 2017; 377: 1847–1857.

28.

LeDell

van der Laan

Petersen

. AUC-maximizing ensembles through metalearning. Int J Biostat 2016; 12: 203–218.

29.

Cortes

Mohri

. AUC optimization vs. error rate minimization. Adv Neural Inf Process Syst 2004; 313–320.

30.

van der Laan

Polley

Hubbard

. Super learner. Stat Appl Genet Mol Biol 2007; 6: Article 25.

31.

Tibshirani

. Regression shrinkage and selection via the lasso. J R Stat Soc B 1996; 58: 267–288.

32.

Heinze

Schemper

. A solution to the problem of separation in logistic regression. Stat Med 2002; 21: 2409–2419.

33.

Heinze

Ploner

logistf: Firth’s bias reduced logistic regression. 2013. URL https://CRAN.R-project.org/package=logistf. R package version 1.0.

34.

Yuan

Lin

. Model selection and estimation in regression with grouped variables. J R Stat Soc B 2006; 68: 49–67.

35.

Simon

Friedman

Hastie

, et al. A sparse-group lasso. J Comput Graph Stat 2013; 22: 231–245.

36.

Breiman

. Random forests. Mach Learn 2001; 45: 5–32.

37.

Liaw

Wiener

. Classification and regression by randomforest. R News 2002; 2: 18–22.

38.

Friedman

. Stochastic gradient boosting. Comput Stat Data Anal 2002; 38: 367–378.

39.

Friedman

. Greedy function approximation: A gradient boosting machine. Ann Stat 2001; 29: 1189–1232.

40.

Chipman

George

McCulloch

et al. BART: Bayesian additive regression trees. Ann Appl Stat 2010; 4: 266–298.

41.

Bishop

. Neural networks for pattern recognition. New York: Oxford University Press, 1995.

42.

Ripley

. Pattern recognition and neural networks. Cambridge: Cambridge University Press, 1996.

43.

Drucker

Burges

Kaufman

, et al. Support vector regression machines. Adv Neural Inf Process Syst 1997; 155–161.

44.

Zheng

Balzer

van der Laan

et al. Constrained binary classification using ensemble learning: An application to cost-efficient targeted prEP strategies. Stat Med 2018; 37: 261–279.

45.

Pavlou

Ambler

Seaman

et al. How to develop a more accurate risk prediction model when there are few events. BMJ 2015; 351: h3868.

46.

Rose

Normand

. Double robust estimation for multiple unordered treatments and clustered observations: Evaluating drug-eluting coronary artery stents. Biometrics 2019; 75: 289–296.

47.

Friedman

Hastie

Tibshirani

. Regularization paths for generalized linear models via coordinate descent. J Stat Softw 2010; 33: 1–22.

48.

Kosmidis

. brglm: Bias Reduction in Binary-Response Generalized Linear Models. 2017. https://CRAN.R-project.org/package=brglm. R package version 0.6.1.

49.

Yang

Zou

. gglasso: Group Lasso Penalized Learning Using a Unified BMD Algorithm, 2017. https://CRAN.R-project.org/package=gglasso. R package version 1.4.

50.

Vincent

. msgl: Multinomial sparse group lasso, 2017. https://CRAN.R-project.org/package=msgl. R package version 2.3.6.

51.

Chen

Benesty

. xgboost: eXtreme Gradient Boosting, 2019. https://CRAN.R-project.org/package=xgboost. R package version 0.90.0.2.

52.

Kapelner

Bleich

. bartMachine: Bayesian Additive Regression Trees, 2018. https://CRAN.R-project.org/package=bartMachine. R package version 1.2.4.2.

53.

Ripley

Venables

. nnet: Feed-Forward Neural Networks and Multinomial Log-Linear Models, 2016. https://CRAN.R-project.org/package=nnet. R package version 7.3-12.

54.

Karatzoglou

and others. kernlab: Kernel-Based Machine Learning Lab, 2019. https://CRAN.R-project.org/package=kernlab. R package version 0.9-29.

55.

Polley

LeDell

Kennedy

, et al. SuperLearner: Super Learner Prediction, 2018. https://CRAN.R-project.org/package=SuperLearner. R package version 2.0-23.

56.

Cawley

Talbot

NLC

. On over-fitting in model selection and subsequent selection bias in performance evaluation. J Mach Learn Res 2010; 11: 2079–2107.

57.

Steyerberg

Vickers

Cook

et al. Assessing the performance of prediction models: A framework for some traditional and novel measures. Epidemiology 2010; 21: 128.

58.

Crowson

Atkinson

Thernaeu

. Assessing calibration of prognostic risk scores. Stat Methods Med Res 2016; 25: 1692–1706.

59.

Mauri

Silbaugh

Garg

et al. Drug-eluting or bare-metal stents for acute myocardial infarction. N Engl J Med 2008; 359: 1330–1342.

60.

Vyas

Einstein

Jones

. Hidden in plain sight – reconsidering the use of race correction in clinical algorithms. N Engl J Med 2020; 383: 874–882.

61.

Horton

Kleinman

. Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models. Am Stat 2007; 61: 79–90.

62.

Bergquist

Brooks

Keating

et al. Classifying lung cancer severity with ensemble machine learning in health care claims data. Proc Mach Learn Res 2017; 68: 25–38.

63.

Chawla

Bowyer

Hall

et al. SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res, 2002; 16: 16321–357.

64.

Liu

X-Y

Zhou

. Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern B 2008; 39: 539–550.

65.

Blagus

Lusa

. SMOTE for high-dimensional class-imbalanced data. BMC Bioinf 2013; 14: 1471–2105.

66.

Koziarski

. Radial-Based undersampling for imbalanced data classification. arXiv preprint arXiv:1906.00452, 2019.

67.

Chouldechova

Roth

. The frontiers of fairness in machine learning. arXiv preprint arXiv:1810.08810, 2018.

68.

Zafar

Valera

Rodriguez

et al. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. arXiv pre-print arXiv:1610.08452, 2017.

69.

Zafar

Valera

Rodriguez

et al. Fairness constraints: Mechanisms for fair classification. arXiv pre-print arXiv:1507.05259, 2017.

70.

Zink

Rose

. Fair regression for health care spending. Biometrics 2020; 76: 973–982.

71.

Chen

Pierson

Rose

et al. Ethical machine learning in healthcare. Annu Rev Biomed Data Sci 2021; 4: 123–144.

72.

Dodd

Pepe

. Partial AUC estimation and regression. Biometrics 2003; 59: 614–623.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.54 MB

Revisiting performance metrics for prediction with rare outcomes

Abstract

Keywords

Get full access to this article

References

Supplementary Material