Sage Journals: Discover world-class research

Abstract

Many prediction methods have been proposed in the literature, but most of them ignore heterogeneity between populations. Either only data from a single study or population is available for model building and evaluation, or when data from multiple studies make up the training dataset, studies are pooled before model building. As a result, prediction models might perform less than expected when applied to new subjects from new study populations. We propose a linear method for building prediction models with high-dimensional data from multiple studies. Our method explicitly addresses between-population variability and tends to select predictors that are predictive in most of the study populations. We employ empirical Bayes estimators and hence avoid selection bias during the variable selection process. Simulation results demonstrate that the new method works better than other linear prediction methods that ignore the between-study variability. Our method is developed for classification into two groups.

Keywords

Empirical Bayes high-dimensional data multiple studies heterogeneity naive Bayes

Get full access to this article

View all access options for this article.

References

Friedman

Hastie

Tibshirani

. The elements of statistical learning 2001; Vol. 1, New York: Springer series in statistics.

Altman

Royston

. What do we mean by validating a prognostic model? Stat Med 2000; 19: 453–473.

McGinn

Guyatt

Wyer

, et al. Users’ guides to the medical literature: XXII: how to use articles about clinical decision rules. Evidence-Based Medicine Working Group. JAMA 2000; 284: 79–84.

Reilly

Evans

. Translating clinical research into clinical practice: impact of using prediction rules to make decisions. Ann Intern Med 2006; 144: 201–209.

Harrell

Lee

Mark

. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med 1996; 15: 361–387.

Justice

Covinsky

Berlin

. Assessing the generalizability of prognostic information. Ann Intern Med 1999; 130: 515–524.

Mbah

Thierens

Thas

, et al. Pitfalls in prediction modeling for normal tissue toxicity in radiation therapy: an illustration with the individual radiation sensitivity and mammary carcinoma risk factor investigation cohorts. Int J Radiat Oncol Biol Phys 2016; 95: 1466–1476.

Sweeney

Haynes

Vallania

, et al. Methods to increase reproducibility in differential gene expression via meta-analysis. Nucleic Acids Res 2017; 45: e1–e1.

Edgar

Domrachev

Lash

. Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 2002; 30: 207–210.

10.

Einecke

Reeve

Sis

, et al. A molecular classifier for predicting future graft loss in late kidney transplant biopsies. J Clin Invest 2010; 120: 1862–1872.

11.

Reeve

Sellares

Mengel

, et al. Molecular diagnosis of T cell-mediated rejection in human kidney transplant biopsies. Am J Transplant 2013; 13: 645–655.

12.

Halloran

Pereira

Chang

, et al. Potential impact of microarray diagnosis of T cell-mediated rejection in kidney transplants: The INTERCOM study. Am J Transplant 2013; 13: 2352–2363.

13.

Khatri

Roedder

Kimura

, et al. A common rejection module (CRM) for acute rejection across multiple organs identifies novel therapeutics for organ transplantation. J Exp Med 2013; 210: 2205–2221.

14.

Efron

. Empirical Bayes estimates for large-scale prediction problems. J Am Stat Assoc 2009; 104: 1015–1028.

15.

Deng

. Demystifying the bias from selective inference: a revisit to Dawid’s treatment selection problem. Stat Probab Lett 2016; 118: 8–15.

16.

Dawid

. Selection paradoxes of Bayesian inference. Lect Notes Monogr Ser 1994; 24: 211–220.

17.

Senn

. A note concerning a selection paradox of Dawid’s. Am Stat 2008; 62: 206–210.

18.

Efron

. Tweedies formula and selection bias. J Am Stat Assoc 2011; 106: 1602–1614.

19.

Efron

. Are a set of microarrays independent of each other? Ann Appl Stat 2009; 3: 922–942.

20.

Friedman

Hastie

Tibshirani

. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). Ann Stat 2000; 28: 337–407.

21.

Lehmann

Casella

. Theory of point estimation, New York: Springer Science & Business Media, 2006.

22.

Efron

Tibshirani

, et al. Using specially designed exponential families for density estimation. Ann Stat 1996; 24: 2431–2461.

23.

Tibshirani

Hastie

Narasimhan

, et al. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci U S A 2002; 99: 6567–6572.

24.

Davis J and Goadrich M. The relationship between precision-recall and roc curves. In: Proceedings of the 23rd international conference mach learning, Pittsburgh, Pennsylvania, USA, 25–29 June 2006, pp.233–240. ACM.

25.

Brier

. Verification of forecasts expressed in terms of probability. Mon Weather Rev 1950; 78: 1–3.

26.

Dudoit

Fridlyand

Speed

. Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 2002; 97: 77–87.

27.

Bickel

Levina

. Some theory for fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli 2004; 10: 989–1010.

28.

Steyerberg

. Clinical prediction models: a practical approach to development, validation, and updating, New York: Springer Science & Business Media, 2008.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.19 MB

High-dimensional prediction of binary outcomes in the presence of between-study heterogeneity

Abstract

Keywords

Get full access to this article

References

Supplementary Material