Abstract
Robust interpretation of experimental results measuring discreet biological domains remains a significant challenge in the face of complex biochemical regulation processes such as organismal versus tissue versus cellular metabolism, epigenetics, and protein post-translational modification. Integration of analyses carried out across multiple measurement or omic platforms is an emerging approach to help address these challenges. This review focuses on select methods and tools for the integration of metabolomic with genomic and proteomic data using a variety of approaches including biochemical pathway-, ontology-, network-, and empirical-correlation-based methods.
Introduction
Over the past decade, major advancements in omic technologies (eg, genomics, proteomics, and metabolomics) have enabled high-throughput monitoring of a variety of molecular and organismal processes. These techniques have been widely applied to identify biological variants (eg, biomarkers), to characterize complex biochemical systems and to study pathophysiological processes. While many omic platforms target comprehensive analysis of genes (genomics), mRNA (transcriptomics), proteins (proteomics), and metabolites (metabolomics), 1 challenges remain for within and between omic-domain data integration.
Biological interpretation of changes in discreet omic domains is challenging in the face of complex biochemical regulation such as organismal versus tissue versus cellular-level processes, epigenetics, 2 and mRNA or protein post-translational modification.3,4 Combining experimental results from multiple omic platforms is an emerging approach, which aims to help identify latent biological relationships that may become evident only through holistic analyses integrating measurements across multiple biochemical domains. This article focuses on select methods and tools for the integration of metabolomic with genomic and proteomic data.
Metabolomics, the analysis of small molecules (eg, <1200 Da) and biochemical intermediates (metabolites), has been widely used to study interactions between gene and protein downstream products and environmental stimuli. Over the past decade, metabolomics has been widely used to study various pathophysiological process including type 1 diabetes and cancer, with typical goals involving identification of biomarkers predictive of disease onset, prognosis, and treatment efficacy monitoring.5–8 The metabolome is highly responsive to both environmental and biological regulatory mechanisms (eg, epigenetics, transcription, post-translational modification), the analysis of which presents a unique approach to characterize the organismal phenotype. However, metabolomics by itself may not be sufficient to fully characterize complex biological systems or pathologies (eg, cancer). For example, many researchers focus on the analysis of circulating metabolites (eg, serum or plasma), but this pool is the integrated input and output of many biological systems, making it challenging to derive insights into tissue- and cellular-level mechanisms. Other challenges include effective integration of metabolomic-based analyses in cases of limited biochemical domain knowledge, which may result in sparse and disconnected biological interpretations. 9
To date, a variety of software tools have been developed to help integrate multiple omic datasets based on biochemical pathway, ontology, network or empirical correlation (Table 1). A selection of approaches and tools for omic data integration are discussed below.
Key features of a selection of tools for omic data analysis and integration.
Pathway- or Biochemical-Ontology-Based Integration
It is becoming increasingly evident that integrative analyses across multiple omic platforms are required to interrogate complex biological systems. Over the past several years, enrichment analyses methods such as gene set enrichment analysis (GSEA) 10 have been widely used to help interpret gene expression data. These methods facilitate biological interpretation by integrating biological domain knowledge (eg, biochemical pathways, biological processes) with gene expression results. Even though these approaches are highly sensitive to the expert definitions of what constitutes a biochemical pathway or a set of related molecular functions, they remain key methods for omic data integration. Existing tools such as IMPALA, 11 iPEAP, 12 and the integrated pathway analysis in MetaboAnalyst 3.0 13 support integration of different omic platforms through pathway enrichment and overrepresentation analyses. However, pathway-based approaches rely on predefined pathways, which may not accurately represent the complexity of biological systems and could potentially bias the analysis results.
Biological-Network-Based Integration
Network-based analyses are another set of promising tools used to study a variety of organismal and cellular mechanisms. 14 Biological networks represent complex connections among diverse types of cellular components such as genes, proteins, and metabolites. These networks can be used to integrate or map multiple omic experimental results and help identify altered graph neighborhoods, which do not depend on any predefined biochemical pathways. For example, SAMNetWeb 15 and pwOmics 16 support integration of transcriptomic, proteomic, and interactomic data for biological network computation, visualization and functional enrichment analysis. Metscape, 17 a plug-in for the widely used network analysis software Cytoscape, 18 supports calculation, analysis, and visualization of gene-to-metabolite networks in the context of metabolism. 17 Another software, MetaMapR, 9 leverages the KEGG 19 and PubChem 20 databases to provide methods for integration and visualization of complex metabolomic results even in cases where biochemical domain knowledge or molecular annotations are unknown. 9 For example, MetaMapR has been used to integrate both biochemical reaction information with molecular structural and mass spectral similarity to identify pathway-independent relationships, including, between molecules with unknown structure or biological function.7,8,21 However, biological-network-based methods alone may yield limited insight in cases of insufficient domain knowledge of gene, protein, and metabolite interactions, and are often extended through the incorporation of empirical relationships or correlations between measured species.
Empirical Correlation Analysis
Correlation-based analyses are useful for omic data integration when there is a lack of biochemical domain knowledge and to integrate biological and other meta data (eg, clinical outcomes). The R package 22 mixOmics supports correlation analysis between two high-dimensional datasets through methods such as regularized sparse principal component analysis (sPCA), canonical correlation analysis (rCCA), and sparse PLS discriminant analysis (sPLS-DA). 23 Weighted gene correlation network analysis (WGCNA) R package extends the concept of correlations to also include measures of graph topology and has been widely used to analyze gene coexpression networks. 24 WGCNA can be used to relate clusters of highly connected genes to additional information such as single-nucleotide polymorphisms (SNPs) as well as proteomic and clinical data. Other correlation-based approaches, such as the R package DiffCorr, can be used to focus on differences in patterns of relationships between two physiological conditions. 25 Other tools such as MetaMapR incorporate correlation analysis with other relationships such as biochemical reactions and molecular structural and mass spectral similarity. 9 The recently developed R package Grinn 26 implements a Neo4j 27 graph database 27 to provide a dynamic interface to rapidly integrate gene, protein, and metabolite data using both biological-network-based and correlation-based approaches.
While correlation-based analyses are relatively simple to implement and widely used for multi-omic data integration, these approaches may provide limited insight in cases of highly multicollinear systems (eg, hairball graphs). Gaussian graphical models, partial correlation and Bayesian networks are more sophisticated approaches that are gaining favor over simple correlations due to their ability to decouple direct from indirect variable associations. For example, the R packages glasso, 28 qpgraph, 29 and huge 30 have been used to identify conditionally independent pairwise relationships (ie, adjusting for all other possible relationships), which can greatly simplify network interpretation. However, these methods may be computationally challenging to implement on typical omic data, which contains far many more measured variables than samples. Bayesian-network-based analyses have been used to robustly integrate multiple high-dimensional datasets even in cases of low sample sizes.31,32 However, one potential limitation of this approach is the need to use prior knowledge to estimate probabilistic interactions between 31 modeled variables, 31 which may lead to biased conclusions.
Future Directions
Development of methods that can deal with both large, complex, high-dimensional data and sparse biological domain knowledge are required to effectively integrate the massive amounts of biochemical information produced from current and next-generation omic platforms. Developers of future tools need to consider the variety of steps required to effectively integrate multi-omic experiments. Incorporation of scalable and quickly searchable databases, machine-learning methods, and scientific application programming interfaces (APIs) are promising approaches to meet the rapidly growing needs to support current and future omic data analysis and integration pipelines. For example, current state-of-the-field metabolomic experiments may require integration of multiple analytical instruments, data processing methods, robust statistical analyses, machine-learning-based predictive modeling, pathway enrichment and network-based analyses to fully interrogate and interpret the biological systems in question (Fig. 1). 7 Development of comprehensive omic analysis tools combining statistical and multivariate analysis with biochemical domain knowledge, such as MetaboAnalyst 13 or DeviumWeb, 33 are required to enable efficient omic data analysis and integration. Moreover, it is important that advanced statistical methods, computational packages, and tools are easily accessible and well documented, in order to gain wide adoption by the scientific community. As omic technologies proceed to become higher throughput and grow in coverage and complexity, the bottleneck for omic data analysis will become increasingly shifted to effective integration and interpretation. To meet this need, it will become increasingly necessary to expand currently used data integration approaches including pathway analysis, biochemical and empirical networks to include scalable databases, intuitive user interface, interactive visualizations, machine-learning tools and scientific APIs.

Example of a modern metabolomic data analysis workflow integrating three discreet mass spectral analysis platforms. 7 Data from three independent analytical platforms were merged and evaluated using statistical and machine-learning methods to identify significant metabolomic differences and top 10% discriminants between experimental treatments. Partial correlation networks, biochemical enrichment analysis, hierarchical clustering, and biochemical network integration were used to visualize and integrate the high-dimensional omic data within a biological context.
Author Contributions
Conceived and designed experiments: DG. Analyzed the data: KW, JF, DG. Wrote the first draft of the manuscript: KW, JF, DG. Contributed to the writing of the manuscript: KW, JF, DG. Agree with the manuscripts results and conclusions: KW, JF, DG. Jointly developed the structure and arguments for the paper: KW, JF, DG. Made critical revisions and approved final version: DG. All authors reviewed and approved of the final manuscript.
Footnotes
Acknowledgements
The authors thank Prof Oliver Fiehn for his support.
