Abstract
High-dimensional omics data are often contaminated by sources of unwanted variations caused by platforms, batches, or other external factors. These interferences and noise can obscure critical signals related to cancer. Contaminated data are modeled as a combination of variables derived from the phenotype of interest (POI) and confounding factors. To identify these variables, a novel method called Decision Variable Analysis (DVA) is proposed. The novelty of DVA is to iteratively extract independent decisive variables for modeling the data. Specifically, a priori knowledge introduced as the definite variable linked with POI is removed from data through a residual operation. The number of variables is estimated from the residual matrix based on the zero gradient of singular values, rather than relying on random matrix theory or principal components analysis, which can produce unreliable results when the number of features exceeds the number of samples. Applications of DVA to both synthetic and real data demonstrate superior performance in identifying variables compared to conventional approaches. Improvements offered by DVA are illustrated across high-dimensional omics datasets, particularly those with smaller sample sizes relative to the number of features on different platforms. The results indicate that DVA is an effective method for dissecting sources of variation in high-dimensional data with disturbances.
Get full access to this article
View all access options for this article.
