A generalization of moderated statistics to data adaptive semiparametric estimation in high-dimensional biology

Abstract

The widespread availability of high-dimensional biological data has made the simultaneous screening of many biological characteristics a central problem in computational and high-dimensional biology. As the dimensionality of datasets continues to grow, so too does the complexity of identifying biomarkers linked to exposure patterns. The statistical analysis of such data often relies upon parametric modeling assumptions motivated by convenience, inviting opportunities for model misspecification. While estimation frameworks incorporating flexible, data adaptive regression strategies can mitigate this, their standard variance estimators are often unstable in high-dimensional settings, resulting in inflated Type-I error even after standard multiple testing corrections. We adapt a shrinkage approach compatible with parametric modeling strategies to semiparametric variance estimators of a family of efficient, asymptotically linear estimators of causal effects, defined by counterfactual exposure contrasts. Augmenting the inferential stability of these estimators in high-dimensional settings yields a data adaptive approach for robustly uncovering stable causal associations, even when sample sizes are limited. Our generalized variance estimator is evaluated against appropriate alternatives in numerical experiments, and an open source R/Bioconductor package, biotmle, is introduced. The proposal is demonstrated in an analysis of high-dimensional DNA methylation data from an observational study on the epigenetic effects of tobacco smoking.

Keywords

Variance shrinkage semiparametric estimation nonparametric inference efficient estimation causal machine learning differential expression differential methylation

Get full access to this article

View all access options for this article.

References

Smyth

. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 2004; 3: 1–25.

R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2022. https://www.R-project.org/.

Smyth

. Limma: linear models for microarray data. In Bioinformatics and Computational Biology Solutions using R and Bioconductor. Springer, 2005. pp. 397–420.

Law

Chen

Shi

, et al. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 2014; 15: R29.

Dudoit

van der Laan

. Multiple testing procedures with applications to genomics. New York, NY: Springer, 2008.

Pearl

. Causality: models, reasoning, and inference. Cambridge University Press, 2000.

Reifeis

Hudgens

Civelek

, et al. Assessing exposure effects on gene expression. Genet Epidemiol 2020; 44: 601–610.

Reifeis

. Causal Inference for Observational Genomics Data. PhD Thesis, University of North Carolina at Chapel Hill, 2020.

Bembom

Petersen

Rhee

, et al. Biomarker discovery using targeted maximum-likelihood estimation: application to the treatment of antiretroviral-resistant HIV infection. Stat Med 2009; 28: 152–172.

10.

van der Laan

Rose

. Targeted learning: Causal inference for observational and experimental data. New York, NY: Springer Science & Business Media, 2011.

11.

van der Laan

Rubin

. Targeted maximum likelihood learning. Int J Biostat 2006; 2.

12.

van der Laan

Polley

Hubbard

. Super learner. Stat Appl Genet Mol Biol 2007; 6.

13.

Wang

Campbell

, et al. Distinct epigenetic effects of tobacco smoking in whole blood and among leukocyte subtypes. PLoS ONE 2016; 11.

14.

Tuglus

van der Laan

. Targeted methods for biomarker discovery. In: Targeted Learning: Causal Inference for Observational and Experimental Data. Springer, 2011. pp. 367–382.

15.

Hernán

Robins

. Causal inference: what if. Boca Raton, FL: CRC Press, 2022.

16.

Bickel

Klaassen

Ritov

, et al. Efficient and adaptive estimation for Semiparametric models. Baltimore, MD: Johns Hopkins University Press, 1993.

17.

Kennedy

EH.

: Semiparametric theory and empirical processes in causal inference. In Statistical Causal Inferences and Their Applications in Public Health Research. Springer, 2016. pp. 141–167.

18.

van der Laan

Dudoit

Keles

. Asymptotic optimality of likelihood-based cross-validation. Stat Appl Genet Mol Biol 2004; 3: 1–23.

19.

Dudoit

van der Laan

. Asymptotics of cross-validated risk estimation in estimator selection and performance assessment. Stat Methodol 2005; 2: 131–154.

20.

Breiman

. Stacked regressions. Mach Learn 1996; 24: 49–64.

21.

Coyle

Hejazi

Malenica

, et al. sl3: modern pipelines for machine learning and super learning, 2022. doi:10.5281/zenodo.1342293. https://github.com/tlverse/sl3. R package version 1.4.4.

22.

Klaassen

. Consistent estimation of the influence function of locally asymptotically linear estimators. Ann Stat 1987; 0: 1548–1562.

23.

Zheng

van der Laan

. Cross-validated targeted minimum-loss-based estimation. In: Targeted Learning: Causal Inference for Observational and Experimental Data. Springer, 2011. pp. 459–474.

24.

Gentleman

Carey

Bates

, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004; 5: 1–6.

25.

Tuglus

van der Laan

. Modified FDR controlling procedure for multi-stage analyses. Stat Appl Genet Mol Biol 2009; 8.

26.

Boucheron

Lugosi

Massart

. Concentration inequalities: a nonasymptotic theory of independence. Oxford University Press, 2013.

27.

Gerlovina

van der Laan

Hubbard

. Big data, small sample: Edgeworth expansions provide a cautionary tale. Int J Biostat 2017; 13.

28.

Benjamini

Hochberg

. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc: Ser B (Statistical Methodology) 1995; 57: 289–300.

29.

Hejazi

Cai

Hubbard

. biotmle: targeted learning for biomarker discovery. J Open Sour Softw 2017; 2.

30.

Hejazi

van der Laan

Hubbard

. biotmle: targeted learning with moderated statistics for biomarker discovery, 2020. doi:10.18129/B9.bioc.biotmle. https://bioconductor.org/packages/biotmle. R package version 1.12.0.

31.

Benkeser

Hejazi

. drtmle: doubly-robust nonparametric estimation and inference, 2022. doi:10.5281/zenodo.844836. R package version 1.1.1.

32.

Benkeser

Hejazi

. Doubly-robust inference in R using drtmle. Under review at Observational Studies, 2022.

33.

Moore

Neugebauer

van der Laan

, et al. Causal inference in epidemiological studies with strong confounding. Stat Med 2012; 31: 1380–1404.

34.

Polley

LeDell

Kennedy

, et al. SuperLearner: super learner prediction, 2019. https://github.com/ecpolley/SuperLearner. R package version 2.0-26-9000.

35.

Hastie

Tibshirani

. Generalized additive models. Routledge, 1990.

36.

Friedman

, et al. Multivariate adaptive regression splines. Ann Stat 1991; 19: 1–67.

37.

Chen

Guestrin

. Xgboost: a scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp. 785–794.

38.

Breiman

. Random forests. Mach Learn 2001; 45: 5–32.

39.

Jones

Moore

Thomas

, et al. Factors affecting HPRT mutant frequency in T-lymphocytes of smokers and nonsmokers. Cancer Epidemiol Prevent Biomark 1993; 2: 249–260.

40.

Bell

Liu

Cortopassi

. Occurrence of bcl-2 oncogene translocation with increased frequency in the peripheral blood of heavy smokers. JNCI: J Nat Cancer Inst 1995; 87: 223–224.

41.

Teschendorff

Marabita

Lechner

, et al. A beta-mixture quantile normalization method for correcting probe design bias in illumina infinium 450K DNA methylation data. Bioinformatics 2013; 29: 189–196.

42.

Morris

Butcher

Feber

, et al. ChAMP: 450K chip analysis methylation pipeline. Bioinformatics 2014; 30: 428–430.

43.

Houseman

Accomando

Koestler

, et al. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformat 2012; 13: 1–16.

44.

Houseman

Molitor

Marsit

. Reference-free cell mixture adjustments in analysis of DNA methylation data. Bioinformatics 2014; 30: 1431–1439.

45.

Holm

. A simple sequentially rejective multiple test procedure. Scandinavian J Stat 1979; 65–70.

46.

Grieshober

Graw

Barnett

, et al. AHRR methylation in heavy smokers: associations with smoking, lung cancer risk, and lung cancer mortality. BMC Cancer 2020; 20.

47.

Fasanelli

Baglietto

Ponzi

, et al. Hypomethylation of smoking-related genes is associated with future lung cancer in four prospective cohorts. Nat Commun 2015; 6: 10192.

48.

Zhang

Elgizouli

Schöttker

, et al. Smoking-associated DNA methylation markers predict lung cancer incidence. Clin Epigenet 2016; 8: 127.

49.

Bojesen

Timpson

Relton

, et al. AHRR (cg05575921) hypomethylation marks smoking behaviour, morbidity and mortality. Thorax 2017; 72: 646–653.

50.

Battram

Richmond

Baglietto

, et al. Appraising the causal relevance of DNA methylation for risk of lung cancer. Int J Epidemiol 2019; 48: 1493–1504.

51.

Díaz

van der Laan

. Population intervention causal effects based on stochastic interventions. Biometrics 2012; 68: 541–549.

52.

Hejazi

van der Laan

Janes

et al. Efficient nonparametric inference on the effects of stochastic interventions under two-phase sampling, with applications to vaccine efficacy trials. Biometrics 2020; 77: 1241–1253.

53.

Díaz

Hejazi

. Causal mediation analysis for stochastic interventions. J R Stat Soc: Ser B (Statistical Methodology) 2020; 82: 661–683.

54.

Hejazi

Rudolph

van der Laan

, et al. Nonparametric causal mediation analysis for stochastic interventional (in)direct effects. Biostatistics 2022. in press.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

6.67 MB