Decision variables to be discovered in modelling high-dimensional omics data for cancer studies *

Abstract

High-dimensional omics data are often contaminated by sources of unwanted variations caused by platforms, batches, or other external factors. These interferences and noise can obscure critical signals related to cancer. Contaminated data are modeled as a combination of variables derived from the phenotype of interest (POI) and confounding factors. To identify these variables, a novel method called Decision Variable Analysis (DVA) is proposed. The novelty of DVA is to iteratively extract independent decisive variables for modeling the data. Specifically, a priori knowledge introduced as the definite variable linked with POI is removed from data through a residual operation. The number of variables is estimated from the residual matrix based on the zero gradient of singular values, rather than relying on random matrix theory or principal components analysis, which can produce unreliable results when the number of features exceeds the number of samples. Applications of DVA to both synthetic and real data demonstrate superior performance in identifying variables compared to conventional approaches. Improvements offered by DVA are illustrated across high-dimensional omics datasets, particularly those with smaller sample sizes relative to the number of features on different platforms. The results indicate that DVA is an effective method for dissecting sources of variation in high-dimensional data with disturbances.

Keywords

High-dimensional data decision variable analysis confounding factors deconvolution

Get full access to this article

View all access options for this article.

References

Zhou

Zhu

. A grouping feature selection method based on feature interaction. Intell Data Anal 2023; 27: 361–377.

Chen

Ryu

Vinyard

, et al. SIMBA: single-cell embedding along with features. Nat Methods 2024; 21: 1003–1013.

Zhao

Ruan

Koestler

, et al. Epigenome-wide scan identifies differentially methylated regions for lung cancer using pre-diagnostic peripheral blood. Epigenetics 2022; 17: 460–472.

Lin

Guo

, et al. Multi-omic characterization of genome-wide abnormal DNA methylation reveals diagnostic and prognostic markers for esophageal squamous-cell carcinoma. Signal Transduction Targeted Ther 2022; 7: 13.

Varghese

Barefoot

Jain

, et al. Integrative analysis of DNA methylation and microrna expression reveals mechanisms of racial heterogeneity in hepatocellular carcinoma. Front Genet 2021; 12: 15.

Luo

Vogeli

. Reclassification of kidney clear cell carcinoma based on immune cell gene-related DNA CpG Pairs. Biomedicines 2021; 9: 24.

Chen

Fujimoto

, et al. Immunogenomic intertumor heterogeneity across primary and metastatic sites in a patient with lung adenocarcinoma. J Exp Clin Cancer Res 2022; 41: 13.

Grant

Wang

Kumari

, et al. Characterising sex differences of autosomal DNA methylation in whole blood using the Illumina EPIC array. Clin Epigenetics 2022; 14: 16.

Nitsche

Vedire

Kannisto

, et al. Visceral obesity in non-small cell lung cancer. Cancers 2022; 14: 3450.

10.

Khomicheva

Vityaev

Ananko

, et al. ExpertDiscovery system application for the hierarchical analysis of eukaryotic transcription regulatory regions based on DNA codes of transcription. Intell Data Anal 2008; 12: 481–494.

11.

Bartish

Abraham

Goncalves

, et al. The role of eiF4F-driven mRNA translation in regulating the tumour microenvironment. Nat Rev Cancer 2023; 23: 408–425.

12.

Chen

Higgins

, et al. DNA methylation differences in noncoding regions in ER-negative breast tumors between black and white women. Front Oncol 2023; 13: 11.

13.

Simonea

Sergioa

Diegob

, et al. An unsupervised clustering approach for leukaemia classification based on DNA micro-arrays data. Intell Data Anal 2007; 11: 175–188.

14.

Salim

Molania

Wang

, et al. RUV-III-NB: normalization of single cell RNA-seq data. Nucleic Acids Res 2022; 50: e96.

15.

González

Guerra

Robles

, et al. CliDaPa: a new approach to combining clinical data with DNA microarrays. Intell Data Anal 2010; 14: 207–223.

16.

Nguyen

Yao

, et al. Associations between DNA methylation and BMI vary by metabolic health status: a potential link to disparate cardiovascular outcomes. Clin Epigenetics 2021; 13: 12.

17.

Jukka

Martti

. Comparing identification methods for DNA investigations of crimes and accidents. Intell Data Anal 2008; 12: 409–423.

18.

Katsaounou

Nicolaou

Vogazianos

, et al. Colon cancer: from epidemiology to prevention. Metabolites 2022; 12: 26.

19.

Boniolo

Dorigatti

Ohnmacht

, et al. Artificial intelligence in early drug discovery enabling precision medicine. Expert Opin Drug Discov 2021; 16: 991–1007.

20.

Chen

, et al. scGMAI: a Gaussian mixture model for clustering single-cell RNA-seq data based on deep autoencoder. Brief Bioinformatics 2021; 22: 1–10.

21.

Dressler

Brägelmann

Reischl

, et al. Normics: proteomic normalization by variance and data-inherent correlation structure. Mol Cell Proteomics MCP 2022; 21: 100269.

22.

Radua

Vieta

Shinohara

, et al. Increased power by harmonizing structural MRI site differences with the ComBat batch method in ENIGMA. Neuroimage 2020; 218: 14.

23.

Zhang

Parmigiani

Johnson

. ComBat-seq: batch effect adjustment for RNA-seq count data. NAR Genomics Bioinf 2020; 2: lqaa078.

24.

Raula

Pedroa

Monicab

. Independent component analysis algorithms for microarray data analysis. Intell Data Anal 2010; 14: 193–206.

25.

Xue

Yazar

Neavin

, et al. Pitfalls and opportunities for applying latent variables in single-cell eQTL analyses. Genome Biol 2023; 24: 11.

26.

Lee

Sun

Wright

, et al. An improved and explicit surrogate variable analysis procedure by coefficient adjustment. Biometrika 2017; 2: 303–316.

27.

Teschendorff

Zhuang

Widschwendter

. Independent surrogate variable analysis to deconvolve confounding factors in large-scale microarray profiling studies. Bioinformatics 2011; 27: 1496–1505.

28.

Ricardoa

Gildardoa

Marco

, et al. Improving pattern classification of DNA microarray data by using PCA and logistic regression. Intell Data Anal 2016; 20: S53–S67.

29.

Hyvarinen

Oja

. Independent component analysis: algorithms and applications. Neural Netw 2000; 13: 411–430.

30.

Plerou

Gopikrishnan

Rosenow

, et al. Random matrix approach to cross correlations in financial data. Phys Rev E 2002; 65: 18.

31.

Zhu

Zhang

, et al. Free vibration of self-powered nanoribbons subjected to thermal-mechanical-electrical fields based on a nonlocal strain gradient theory. Appl Math Model 2022; 110: 583–602.

32.

Jin

Sui

Zhu

, et al. Axial free vibration of rotating FG piezoelectric nano-rods accounting for nonlocal and strain gradient effects. J Vib Eng Technol 2023; 11: 537–549.

33.

McAlpine

Chiu

Nout

, et al. Evaluation of treatment effects in patients with endometrial cancer and POLE mutations: an individual patient data meta-analysis. Cancer 2021; 127: 2409–2422.

34.

Xie

Leung

Chen

, et al. Differential methylation values in differential methylation analysis. Bioinformatics 2019; 35: 1094–1097.

35.

Reho

Saez-Atienzar

Ruffo

, et al. Differential methylation analysis in neuropathologically confirmed dementia with Lewy bodies. Commun Biol 2024; 7.

36.

Lawlor

Marquez

Lee

, et al. V-SVA: an R shiny application for detecting and annotating hidden sources of variation in single-cell RNA-seq data. Bioinformatics 2020; 36: 3582–3584.

37.

Xiao

Jiang

, et al. Genes associated with inflammation for prognosis prediction for clear cell renal cell carcinoma: a multi-database analysis. Transl Cancer Res 2023; 12: 2629–2645.