Sage Journals: Discover world-class research

Abstract

Variable selection in the presence of both missing covariates and outcomes is an important statistical research topic. Parametric regression are susceptible to misspecification, and as a result are sub-optimal for variable selection. Flexible machine learning methods mitigate the reliance on the parametric assumptions, but do not provide as naturally defined variable importance measure as the covariate effect native to parametric models. We investigate a general variable selection approach when both the covariates and outcomes can be missing at random and have general missing data patterns. This approach exploits the flexibility of machine learning models and bootstrap imputation, which is amenable to nonparametric methods in which the covariate effects are not directly available. We conduct expansive simulations investigating the practical operating characteristics of the proposed variable selection approach, when combined with four tree-based machine learning methods, extreme gradient boosting, random forests, Bayesian additive regression trees, and conditional random forests, and two commonly used parametric methods, lasso and backward stepwise selection. Numeric results suggest that, extreme gradient boosting and Bayesian additive regression trees have the overall best variable selection performance with respect to the $F_{1}$ score and Type I error, while the lasso and backward stepwise selection have subpar performance across various settings. There is no significant difference in the variable selection performance due to imputation methods. We further demonstrate the methods via a case study of risk factors for 3-year incidence of metabolic syndrome with data from the Study of Women’s Health Across the Nation.

Keywords

Missing at random variable importance tree ensemble bootstrap imputation,variable selection

Get full access to this article

View all access options for this article.

References

Bleich

Kapelner

George

et al. Variable selection for BART: an application to gene regulation. Ann Appl Stat 2014; 8: 1750–1781.

Liu

et al. Tree-based machine learning to identify and understand major determinants for stroke at the neighborhood level. J Am Heart Assoc 2020; 9: e016745.

Hapfelmeier

Ulm

. A new variable selection approach using random forests. Comput Stat Data Anal 2013; 60: 50–69.

et al. Quantile regression forests to identify determinants of neighborhood stroke prevalence in 500 cities in the USA: implications for neighborhoods with high prevalence. J Urban Health 2021; 98: 259–270.

. Machine learning to identify and understand key factors for provider-patient discussions about smoking. Prev Med Rep 2020; 20: 101238.

Liu

et al. Identifying and assessing the impact of key neighborhood-level determinants on geographic variation in stroke: a machine learning and multilevel modeling approach. BMC Public Health 2020; 20: 1–12.

Chipman

George

McCulloch

. BART: Bayesian additive regression trees. Ann Appl Stat 2010; 4: 266–298.

Breiman

. Random forests. Mach Learn 2001; 45: 5–32.

Chen

Guestrin

. XGBoost: a scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 785–794.

10.

Breiman

Friedman

Stone

et al. Classification and Regression Trees. Boca Raton, FL: CRC Press, 1984.

11.

Hothorn

Zeileis

. partykit: a modular toolkit for recursive partytioning in R. J Mach Learn Res 2015; 16: 3905–3909.

12.

. Estimation of causal effects of multiple treatments in healthcare database studies with rare outcomes. Health Serv Outcomes Res Methodol 2021; 21: 287–308.

13.

. Estimating heterogeneous survival treatment effect in observational data using machine learning. Stat Med, 2021; 40: 4691–4713.

14.

Lin

JYJ

Sigel

et al. Estimating heterogeneous survival treatment effects of lung cancer screening approaches: A causal machine learning analysis. Ann Epidemiol 2021; 62: 36–42.

15.

Lopez

et al. Estimation of causal effects of multiple treatments in observational studies with a binary outcome. Stat Methods Med Res 2020; 29: 3218–3214.

16.

Mazumdar

Lin

JYJ

Zhang

et al. Comparison of statistical and machine learning models for healthcare cost data: a simulation study motivated by oncology care model (OCM) data. BMC Health Serv Res 2020; 20: 350.

17.

Liu

. Ranking sociodemographic, health behavior, prevention, and environmental factors in predicting neighborhood cardiovascular health: a Bayesian machine learning approach. Prev Med 2020; 141: 106240.

18.

Díaz-Uriarte

De Andres

. Gene selection and classification of microarray data using random forest. BMC Bioinform 2006; 7: 3.

19.

Little

D’Agostino

Cohen

et al. The prevention and treatment of missing data in clinical trials. N Engl J Med 2012; 367: 1355–1360.

20.

Sterne

White

Carlin

et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ 2009; 338: b2393.

21.

National Research Council. The prevention and treatment of missing data in clinical trials. Washington, DC: The National Academies Press, 2010.

22.

Hogan

Daniels

. A Bayesian perspective on assessing sensitivity to assumptions about unobserved data. In Molenberghs

Fitzmaurice

Kenward

Tsiatis

Verbeke

(eds.). Handbook of Missing Data Methodology. chapter 18. Boca Raton, FL: CRC Press, 2014. pp. 405–434.

23.

Hogan

Mwangi

et al. Modeling the causal effect of treatment initiation time on survival: application to HIV/TB co-infection. Biometrics 2018; 74: 703–713.

24.

Tsiatis

. Semiparametric theory and missing data. Berlin: Springer, 2007.

25.

Garcia

Ibrahim

Zhu

. Variable selection for regression models with missing data. Stat Sin 2010; 20: 149.

26.

Long

Johnson

. Variable selection in the presence of missing data: resampling and imputation. Biostatistics 2015; 16: 596–610.

27.

Wood

White

Royston

. How should variable selection be performed with multiply imputed data?. Stat Med 2008; 27: 3227–3246.

28.

Meinshausen

Bühlmann

. Stability selection. J R Stat Soc: Series B (Stat Methodol) 2010; 72: 417–473.

29.

Kapelner

Bleich

. Prediction with missing data via bayesian additive regression trees. Can J Stat 2015; 43: 224–239.

30.

Tang

Ishwaran

. Random forest missing data algorithms. Statistical Analysis and Data Mining: The ASA Data Science Journal 2017; 10: 363–377.

31.

Hothorn

Hornik

Zeileis

. Unbiased recursive partitioning: A conditional inference framework. J Comput Graph Stat 2006; 15: 651–674.

32.

Efron

. Missing data, imputation, and the bootstrap. J Am Stat Assoc 1994; 89: 463–475.

33.

Breiman

. Bagging predictors. Mach Learn 1996; 24: 123–140.

34.

Strobl

Boulesteix

Zeileis

et al. Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinform 2007; 8: 25.

35.

Friedman

Hastie

Tibshirani

et al. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). Ann Stat 2000; 28: 337–407.

36.

Tibshirani

. Regression shrinkage and selection via the lasso. J R Stat Soc: Series B (Methodological) 1996; 58: 267–288.

37.

Friedman

Hastie

Tibshirani

. The elements of statistical learning. New York: 2nd ed.: Springer Series in Statistics, 2001.

38.

Schouten

Lugtig

Vink

. Generating missing values for simulation purposes: a multivariate amputation procedure. J Stat Comput Simul 2018; 88: 2909–2930.

39.

Greendale

Huang

Leung

et al. Dietary phytoestrogen intakes and cognitive function during the menopause transition: results from the SWAN phytoestrogen study. Menopause (New York, NY) 2012; 19: 894.

40.

Van Buuren

. Flexible imputation of missing data. 2nd ed. Boca Raton, FL: Chapman & HallCRC, 2018.

41.

Van Buuren

Groothuis-Oudshoorn

CGM

. mice: multivariate imputation by chained equations in R. J Stat Softw 2010; 45: 1–67.

42.

Stekhoven

Bühlmann

. Missforest–non-parametric missing value imputation for mixed-type data. Bioinformatics 2012; 28: 112–118.

43.

Van Buuren

Boshuizen

Knook

. Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med 1999; 18: 681–694.

44.

Waljee

Mukherjee

Singal

et al. Comparison of imputation methods for missing laboratory data in medicine. BMJ Open 2013; 3: e002847.

45.

Chen

Benesty

et al. XGboost: Extreme Gradient Boosting, 2021. R package version 1.4.

46.

Friedman

Hastie

Tibshirani

. Regularization paths for generalized linear models via coordinate descent. J Stat Softw 2010; 33: 1.

47.

Madley-Dowd

Hughes

Tilling

et al. The proportion of missing data should not be used to guide decisions on multiple imputation. J Clin Epidemiol 2019; 110: 63–73.

48.

Hughes

Heron

Sterne

et al. Accounting for missing data in statistical analyses: multiple imputation is not always the answer. Int J Epidemiol 2019; 48: 1294–1304.

49.

Janssen

Powell

Crawford

et al. Menopause and the metabolic syndrome: the Study of Women’s Health Across the Nation. Arch Intern Med 2008; 168: 1568–1575.

50.

Kazlauskaite

Janssen

Wilson

et al. Is midlife metabolic syndrome associated with cognitive function change? The Study of Women’s Health Across the Nation. J Clin Endocrinol Metabol 2020; 105: e1093–e1105.

51.

Carnethon

Loria

Hill

et al. risk factors for the metabolic syndrome: the coronary artery risk development in young adults (CARDIA) study, 1985–2001. Diabetes Care 2004; 27: 2707–2715.

52.

Wei

Liu

Lin

et al. Dietary fiber intake and risk of metabolic syndrome: a meta-analysis of observational studies. Clin Nutr 2018; 37: 1935–1942.

53.

Han

Fang

et al. Dietary calcium intake and the risk of metabolic syndrome: A systematic review and meta-analysis. Sci Rep 2019; 9: 1–7.

54.

Feng

Gao

Yao

et al. Low apoA-I is associated with insulin resistance in patients with impaired glucose tolerance: a cross-sectional study. Lipids Health Dis 2017; 16: 1–7.

55.

Rao

Disraeli

McGregor

. Impaired glucose tolerance and impaired fasting glucose. Am Fam Physician 2004; 69: 1961–1968.

56.

Lin

JYJ

Huang

et al. Strategies for variable selection in large-scale healthcare database studies with missing covariate and outcome data. arXiv preprint 2021; arXiv:2107.09730.

57.

Hogan

. Causal comparative effectiveness analysis of dynamic continuous-time treatment initiation rules with sparsely measured outcomes and death. Biometrics 2019; 75: 695–707.

58.

Zou

et al. A flexible sensitivity analysis approach for unmeasured confounding with multiple treatments and a binary outcome with application to SEER-Medicare lung cancer data. Ann Appl Stat 2021; In press.

59.

Ishwaran

. Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival. Stat Med 2019; 38: 558–582.

60.

Williamson

Gilbert

Carone

et al. Nonparametric variable importance assessment using machine learning techniques. Biometrics 2021; 77: 9–22.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

13.25 MB

Variable selection with missing data in both covariates and outcomes: Imputation and machine learning

Abstract

Keywords

Get full access to this article

References

Supplementary Material