Bi-level variable selection for case-cohort studies with group variables

Abstract

The case-cohort design is an economical approach to estimate the effect of risk factors on the survival outcome when collecting exposure information or covariates on all patients is expensive in a large cohort study. Variables often have group structure such as categorical variables and highly correlated continuous variables. The existing literature for case-cohort data is limited to identifying non-zero variables at individual level only. In this article, we propose a bi-level variable selection method to select non-zero group and within-group variables for case-cohort data when variables have group structure. The proposed method allows the number of variables to diverge as the sample size increases. The asymptotic properties of the estimator including bi-level variable selection consistency and the asymptotic normality are shown. We also conduct simulations to compare our proposed method with some existing method and apply them to the Busselton Health data.

Keywords

Case-cohort design efficiency multiple diseases survival analysis variable selection

Get full access to this article

View all access options for this article.

References

Prentice

. A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika 1986; 73: 1–11.

Self

Prentice

. Asymptotic distribution theory and efficiency results for case-cohort studies. Ann Statist 1988; 34: 103–119.

Barlow

. Robust variance estimation for the case-cohort design. Biometrics 1994; 50: 1064–1072.

Kulich

Lin

. Improving the efficiency of relative-risk estimation in case-cohort study. J Am Statist Assoc 2004; 99: 832–844.

Kang

Cai

. Marginal hazard model for case-cohort studies with multiple disease outcomes. Biometrika 2009; 96: 887–901.

Kim

Cai

. More efficient estimators for case-cohort studies. Biometrika 2013; 100: 695–708.

Cai

Zeng

. Variable selection for case-cohort studies with failure time outcome. Biometrika 2016; 103: 547–562.

Cullen

. Mass health examinations in the Busselton population, 1996 to 1970. Aust J Med 1972; 2: 714–718.

Knuiman

Divitini

Olynyk

, et al. Serum ferritin and cardiovascular disease: a 17-year following-up study in Busselton, Western Australia. Am J Epidemiol 2003; 158: 144–149.

10.

Cai

Fan

, et al. Variable selection for multivariate failure time data. Biometrika 2005; 92: 303–316.

11.

Song

Huang

. Supervised group lasso with applications to microarray. BMC Bioinformatics 2007; 3: 60–60.

12.

Kim

Sohn

Jung

, et al. Analysis of survival data with group lasso. Comm Statist Simulation Comput 2012; 41: 1593–1605.

13.

Huang

Liu

, et al. Group selection in the cox model with a diverging number of covariates. Statist Sin 2014; 24: 1787–1810.

14.

Wang

Nan

. Hierarchically penalized cox regression with grouped variables. Biometrika 2009; 96: 307–322.

15.

Ahn KW, Banerjee A, Sahr N, et al. Group and within-group variable selection for competing risks data. Lifetime DataÚnalysis 2018; 24, 407–424.

16.

Cox

. Regression models and life-tables (with discussion). J R Statist Soc B 1972; 34: 187–220.

17.

Kalbfleisch

Lawless

. Likelihood analysis of multistate models for disease incidence and mortality. Statist Med 1988; 7: 149–160.

18.

Borgan

Langholz

Samuelsen

, et al. Exposure stratified case-cohort designs. Lifetime Data Anal 2000; 6: 39–58.

19.

Huang

Xie

, et al. A group bridge approach for variable selection. Biometrika 2009; 96: 339–355.

20.

Zou

. The adaptive lasso and its oracle properties. J Am Statist Assoc 2006; 101: 1418–1429.

21.

Fan

. Variable selection for cox’s proportional hazards model and frailty properties. J Am Statist Assoc 2002; 30: 74–99.

22.

Clayton

Cuzick

. Multivariate generalizations of the proportional hazards model (with discussion). J R Statist Soc A 1985; 148: 82–117.

23.

Lee

Sun

, et al. Exact post-selection inference, with application to the lasso. Ann Stat 2016; 44: 907–927.

24.

Lockhart

Jonathan Taylor

Tibshirani

, et al. A significance test for the lasso. Ann Stat 2014; 42: 413–468.

25.

Tibshirani

Taylor

Lockhart

, et al. Exact post-selection inference for sequential regression procedures. J Am Statist Assoc 2016; 111: 600–620.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.46 MB