MISL: Multiple imputation by super learning

Abstract

Multiple imputation techniques are commonly used when data are missing, however, there are many options one can consider. Multivariate imputation by chained equations is a popular method for generating imputations but relies on specifying models when imputing missing values. In this work, we introduce multiple imputation by super learning, an update to the multivariate imputation by chained equations method to generate imputations with ensemble learning. Ensemble methodologies have recently gained attention for use in inference and prediction as they optimally combine a variety of user-specified parametric and non-parametric models and perform well when estimating complex functions, including those with interaction terms. Through two simulations we compare inferences made using the multiple imputation by super learning approach to those made with other commonly used multiple imputation methods and demonstrate multiple imputation by super learning as a superior option when considering characteristics such as bias, confidence interval coverage rate, and confidence interval width.

Keywords

Fully conditional specification machine learning missing data multiple imputation super learning

Get full access to this article

View all access options for this article.

References

Ayilara

Zhang

Sajobi

, et al. Impact of missing data on bias and precision when estimating change in patient-reported outcomes from a clinical registry. Health Qual Life Outcomes 2019; 17: 106.

Zhong

. The impact of missing data in the estimation of concentration index: a potential source of bias. Eur J Health Econ 2010; 11: 255–266.

Rubin

. Inference and missing data. Biometrika 1976; 63: 581–592.

Little

RJA

Rubin

. Statistical analysis with missing data. Newy York, United States: John Wiley & Sons, Incorporated, 2002, http://ebookcentral.proquestcom/lib/northeastern-ebooks/detail.action?docID=1775204 (accessed 23 September 2020).

Tsikriktsis

. A review of techniques for treating missing data in OM survey research. J Oper Manage 2005; 24: 53–62.

Wood

White

Thompson

. Are missing outcome data adequately handled? A review of published randomized controlled trials in major medical journals. Clinical Trials 2004; 1: 368–376.

Eekhout

de Boer

Twisk

JWR

, et al. Missing data: a systematic review of how they are reported and handled. Epidemiology 2012; 23: 729–732.

Balzer

van der Laan

Ayieko

, et al. Two-Stage TMLE to reduce bias and improve efficiency in cluster randomized trials. Biostatistics 2021; kxab043: 1–17.

Benitez

Petersen

van der Laan

, et al. Comparative Methods for the Analysis of Cluster Randomized Trials. arXiv:211009633 [stat], http://arxiv.org/abs/2110.09633 (2021, accessed 13 January 2022).

10.

Roth

. Missing data: a conceptual review for applied psychologists. Pers Psychol 1994; 47: 537–560.

11.

Patrician

. Multiple imputation for missing data. Res Nurs Health 2002; 25: 76–84.

12.

Pedersen

Mikkelsen

Cronin-Fenton

, et al. Missing data and multiple imputation in clinical epidemiological research. CLEP 2017; 9: 157–166.

13.

Kenward

Carpenter

. Multiple imputation: current perspectives. Stat Methods Med Res 2007; 16: 199–218.

14.

Van Buuren

Brand

JPL

Groothuis-Oudshoorn

CGM

, et al. Fully conditional specification in multivariate imputation. J Stat Comput Simul 2006; 76: 1049–1064.

15.

van Buuren

. Groothuis-Oudshoorn K. MICE: multivariate imputation by chained equations in R. J Stat Soft 45, Epub ahead of print 2011: 1–67. DOI: 10.18637/jss.v045.i03.

16.

van Buuren

. MICE, https://github.com/amices/mice.

17.

Breiman

(ed). Classification and regression trees. Repr. Boca Raton: Chapman & Hall [u.a.], 1998.

18.

Breiman

. Random forests. Mach Learn 2001; 45: 5–32.

19.

van der Laan

Polley

Hubbard

. Super learner. Stat Appl Genet Mol Biol 2007; 6: 1–21. Epub ahead of print 16 January 2007. DOI: 10.2202/1544-6115.1309.

20.

van Buuren

. Flexible imputation of missing data. Second edition. Boca Raton: CRC Press, Taylor & Francis Group, 2018.

21.

Rubin

(ed). Multiple imputation for nonresponse in surveys. Hoboken, NJ, USA: John Wiley & Sons, Inc. Epub ahead of print 9 June 1987. DOI: 10.1002/9780470316696.

22.

Azur

Stuart

Frangakis

, et al. Multiple imputation by chained equations: what is it and how does it work?: multiple imputation by chained equations. Int J Methods Psychiatr Res 2011; 20: 40–49.

23.

Little

RJA

. Missing-Data adjustments in large surveys. J Bus Econ Stat 1988; 6: 287.

24.

Kleinke

. Multiple imputation under violated distributional assumptions: a systematic evaluation of the assumed robustness of predictive mean matching. J Educ Behav Stat 2017; 42: 371–404.

25.

Morris

White

Royston

. Tuning multiple imputation by predictive mean matching and local residual draws. BMC Med Res Methodol 2014; 14: 75.

26.

Doove

Van Buuren

Dusseldorp

. Recursive partitioning for missing data imputation in the presence of interaction effects. Comput Stat Data Anal 2014; 72: 92–104.

27.

Burgette

Reiter

. Multiple imputation for missing data via sequential regression trees. Am J Epidemiol 2010; 172: 1070–1076.

28.

work(s): RTR. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodological) 1996; 58: 267–288.

29.

Cherifa

Blet

Chambaz

, et al. Prediction of an acute hypotensive episode during an ICU hospitalization with a super learner machine-learning algorithm. Anesth Analg 2020; 130: 1157–1166.

30.

Petersen

LeDell

Schwab

, et al. Super learner analysis of electronic adherence data improves viral prediction and may provide strategies for selective HIV RNA monitoring. JAIDS J Acquir Immune Defic Syndr 2015; 69: 109–118.

31.

James

Witten

Hastie

, et al. (eds). An introduction to statistical learning: with applications in R. New York: Springer, 2013.

32.

van der Laan

Rose

. Targeted learning in data science: causal inference for Complex longitudinal studies. Cham: Springer International Publishing. Epub ahead of print 2018. DOI: 10.1007/978-3-319-65304-4.

33.

Schouten

Lugtig

Vink

. Generating missing values for simulation purposes: a multivariate amputation procedure. J Stat Comput Simul 2018; 88: 2909–2930.

34.

Coyle

Hejazi

Malenica

, et al. sl3: Pipelines for Machine Learning and Super Learning. R, https://github.com/tlverse/sl3.

35.

Brand

JPL

Buuren

Groothuis-Oudshoorn

, et al. A toolkit in SAS for the evaluation of multiple imputation methods. Statistica Neerland 2003; 57: 36–45.

36.

Lantz

. Machine learning with R: learn how to use R to apply powerful machine learning methods and gain an insight into real-world applications. 1. publ. Birmingham: Packt Publ, 2013.

37.

Lantz

. Machine learning with R - second edition. Birmingham, UK: Packt Publishing, https://github.com/PacktPublishing/Machine-Learning-with-R-Second-Edition .

38.

White

Daniel

Royston

. Avoiding bias due to perfect prediction in multiple imputation of incomplete categorical variables. Comput Stat Data Anal 2010; 54: 2267–2275.

39.

Laqueur

Shev

Kagawa

RMC

. SuperMICE: an ensemble machine learning approach to multiple imputation by chained equations. Am J Epidemiol 2022; 191: 516–525.

40.

Polley

Ledell

Kennedy

, et al. SuperLearner: Super Learner Prediction, https://CRAN.R-project.org/package=SuperLearner (2019).

41.

Kleinke

. Multiple imputation by predictive mean matching when sample size is small. Methodology (Gott) 2018; 14: 3–15.

42.

Rubin

. Causal inference using potential outcomes: design, modeling, decisions. J Am Stat Assoc 2005; 100: 322–331.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.12 MB

0.00 MB