Reproducible feature selection in heterogeneous multicenter datasets via sign-consistency criteria

Abstract

The identification of risk features associated with disease plays a crucial role in biomedical fields. These features are often used to provide evidence for clinical decision-making. However, in the presence of between-center heterogeneity, covariate effects across data centers may exhibit inconsistent directions, making feature selection challenging. In this work, we propose a novel framework to select reproducible risk features whose underlying effects are consistent across different centers. We quantify the feature reproducibility based on the sign-consistency criterion, which provides an acceptable level of heterogeneity in effect sizes and ensures the reasonable similarity of reproducible signals. Compared with the existing feature selection methods, our proposed method effectively protects data privacy and does not rely on the assumption of data homogeneity. Extensive simulations demonstrated that the proposed method has greater power than existing methods do. We apply the proposed approach to analyze data from the China Health and Retirement Study Longitudinal Study (CHARLS) and identify nine important risk factors that show reproducible associations with depression.

Keywords

Data heterogeneity replicability multicenter research distributed inference feature selection

Get full access to this article

View all access options for this article.

References

Pearson

Mensah

Alexander

, et al. Markers of inflammation and cardiovascular disease: application to clinical and public health practice: a statement for healthcare professionals from the Centers for Disease Control and Prevention and the American Heart Association. Circulation 2003; 107: 499–511.

Lai

Wang

, et al. Factors associated with mental health outcomes among health care workers exposed to coronavirus disease 2019. JAMA Netw Open 2020; 3: e203976.

Hong

Rush

Liu

, et al. Clinical knowledge extraction via sparse embedding regression (KESER) with multi-center large scale electronic health record data. NPJ Digit Med 2021; 4: 151.

Guo

Fries

Steinberg

, et al. A multi-center study on the adaptability of a shared foundation model for electronic health records. NPJ Digit Med 2024; 7: 171.

Deeks

Higgins

Altman

, et al. Analysing data and undertaking meta—analyses. In: Cochrane handbook for systematic reviews of interventions, 2019, pp.241–284. Chichester (UK): JohnWiley & Sons.

Sotiriou

Piccart

. Taking gene-expression profiling to the clinic: When will molecular signatures become relevant to patient care? Nat Rev Cancer 2007; 7: 545–553.

Tibshirani

. Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B: Stat Methodol 1996; 58: 267–288.

Fan

. Variable selection via nonconcave penalized likelihood and its Oracle properties. J Am Stat Assoc 2001; 96: 1348–1360.

Zhang

C-H

. Nearly unbiased variable selection under minimax concave penalty. Ann Stat 2010; 38: 894–942.

10.

Cai

Liu

Xia

. Individual data protected integrative regression analysis of high-dimensional heterogeneous data. J Am Stat Assoc 2022; 117: 2105–2119.

11.

Stangl

Berry

. Meta-analysis in medicine and health policy. Boca Raton, FL: CRC Press, 2000.

12.

Sutton

Higgins

JPT

. Recent developments in meta-analysis. Stat Med 2008; 27: 625–650.

13.

Hedges

Olkin

. Statistical methods for meta-analysis. New York: Academic Press, 2014.

14.

Wang

DelRocco

Lin

. Comparisons of various estimates of the

I^{2}

statistic for quantifying between-study heterogeneity in meta-analysis. Stat Methods Med Res 2024; 33: 745–764.

15.

Lee

Liu

Sun

, et al. Communication-efficient sparse regression. J Mach Learn Res 2017; 18: 1–30.

16.

Tang

Zhou

Song

PX-K

. Distributed simultaneous inference in generalized linear models via confidence distribution. J Multivar Anal 2020; 176: 104567.

17.

Vandierendonck

. A comparison of methods to combine speed and accuracy measures of performance: a rejoinder on the binning procedure. Behav Res Methods 2017; 49: 653–673.

18.

Lin

. Challenges and opportunities in statistics and data science: ten research areas. Harv Data Sci Rev 2020; 2: 1–8.

19.

Higgins

JPT

Thompson

Spiegelhalter

. A re-evaluation of random-effects meta-analysis. J R Stat Soc Ser A: Stat Soc 2009; 172: 137–159.

20.

Higgins

JPT

Thompson

. Quantifying heterogeneity in a meta-analysis. Stat Med 2002; 21: 1539–1558.

21.

Goodman

. A comment on replication, p-values and evidence. Stat Med 1992; 11: 875–879.

22.

Gilbert

King

Pettigrew

, et al. Comment on “estimating the reproducibility of psychological science”. Science 2016; 351: 1037–1037.

23.

Gibson

. The role of p-values in judging the strength of evidence and realistic replication expectations. Stat Biopharm Res 2021; 13: 6–18.

24.

Huffman

. Examining the current standards for genetic discovery and replication in the era of mega-biobanks. Nat Commun 2018; 9: 1–4.

25.

McGuire

Jiang

Liu

, et al. Model-based assessment of replicability for genome-wide association meta-analysis. Nat Commun 2021; 12: 1964.

26.

Harville

. Maximum likelihood approaches to variance component estimation and to related problems. J Am Stat Assoc 1977; 72: 320–338.

27.

Sidik

Jonkman

. Simple heterogeneity variance estimation for meta-analysis. J R Stat Soc Ser C: Appl Stat 2005; 54: 367–384.

28.

Tukey

. The future of data analysis. Ann Math Stat 1962; 33: 1–67.

29.

Stephens

. False discovery rates: a new deal. Biostatistics 2017; 18: 275–294.

30.

Zhao

Sampson

Wen

. Quantify and control reproducibility in high-throughput experiments. Nat Methods 2020; 17: 1207–1213.

31.

Rashid

Yeh

, et al. Modeling between-study heterogeneity for improved replicability in gene signature selection and clinical prediction. J Am Stat Assoc 2020; 115: 1125–1138.

32.

Guo

. Statistical inference for maximin effects: identifying stable associations across multiple studies. J Am Stat Assoc 2024; 119: 1968–1984.

33.

Cohen

. Statistical power analysis. Curr Dir Psychol Sci 1992; 1: 98–101.

34.

Bauldry

. Variation in the protective effect of higher education against depression. Soc Ment Health 2015; 5: 145–161.

35.

Steger

Kashdan

. Depression and everyday social activity, belonging, and well-being. J Couns Psychol 2009; 56: 289–300.

36.

Fernández-Abascal

Martín-Díaz

. Longitudinal study on affect, psychological well-being, depression, mental and physical health, prior to and during the COVID-19 pandemic in Spain. Pers Individ Dif 2021; 172: 110591.

37.

Sharpe

Rossiter

. Siblings of children with a chronic illness: a meta-analysis. J Pediatr Psychol 2002; 27: 699–710.

38.

Culbertson

. Depression and gender: an international review. Am Psychol 1997; 52: 25.

39.

Fergusson

McLeod

GFH

Horwood

, et al. Life satisfaction and mental health problems (18 to 35 years). Psychol Med 2015; 45: 2427–2436.

40.

Uemura

Makizako

Lee

, et al. Behavioral protective factors of increased depressive symptoms in community-dwelling older adults: a prospective cohort study. Int J Geriatr Psychiatry 2018; 33: e234–e241.

41.

Hystad

Payette

Noisel

, et al. Green space associations with mental health and cognitive function: results from the Quebec CARTaGENE cohort. Environ Epidemiol 2019; 3: e040.

42.

Mojtabai

Stuart

Hwang

, et al. Long-term effects of mental disorders on educational attainment in the National Comorbidity Survey ten-year follow-up. Soc Psychiatry Psychiatr Epidemiol 2015; 50: 1577–1591.

43.

Waldron

Haibe-Kains

Culhane

, et al. Comparative meta-analysis of prognostic gene signatures for late-stage ovarian cancer. J Natl Cancer Inst 2014; 106: dju049.

44.

Cai

Duan

. Targeting underrepresented populations in precision medicine: a federated transfer learning approach. Ann Appl Stat 2023; 17: 2970–2992.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.57 MB