CoMM: A Collaborative Mixed Model That Integrates GWAS and eQTL Data Sets to Investigate the Genetic Architecture of Complex Traits

Abstract

Genome-wide association study (GWAS) analyses have identified thousands of associations between genetic variants and complex traits. However, it is still a challenge to uncover the mechanisms underlying the association. With the growing availability of transcriptome data sets, it has become possible to perform statistical analyses targeted at identifying influential genes whose expression levels correlate with the phenotype. Methods such as PrediXcan and transcriptome-wide association study (TWAS) use the transcriptome data set to fit a predictive model for gene expression, with genetic variants as covariates. The gene expression levels for the GWAS data set are then ‘imputed’ using the prediction model, and the imputed expression levels are tested for their association with the phenotype. These methods fail to account for the uncertainty in the GWAS imputation step, and we propose a collaborative mixed model (CoMM) that addresses this limitation by jointly modelling the multiple analysis steps. We illustrate CoMM’s ability to identify relevant genes in the Northern Finland Birth Cohort 1966 data set and extend the model to handle the more widely available GWAS summary statistics.

Keywords

Transcriptome-wide association studies Linear mixed model Probabilistic model EM algorithm

Comment on: Yang C, Wan X, Lin X, Chen M, Zhou X, Liu J. CoMM: a collaborative mixed model to dissecting genetic contributions to complex traits by leveraging regulatory information. Bioinformatics. 2018;35:1644-1652. doi:10.1093/bioinformatics/bty865. PubMed PMID: 30295737. https://www.ncbi.nlm.nih.gov/pubmed/30295737

Introduction

In the last decade, genome-wide association studies (GWASs) have identified thousands of genetic variants associated with complex traits. However, an understanding of the mechanisms linking a genetic variant to a complex trait is relatively limited. As many GWAS loci are located outside coding regions and regulatory variation plays an important role in shaping observed traits, gene expression has been proposed as an informative intermediate phenotype.¹ Incorporating regulatory information into statistical analyses provides us with a principled approach to study the genetic contribution to complex traits through the regulation of gene expression.

One approach to incorporate functional information is to do so without explicitly modelling the relationship between gene expression and phenotype. Sequence Kernel Association Tests (SKATs),² for instance, can be used to prioritize genetic variants based on functional annotation. It is based on the idea that genetic variants with known biological functions are more likely to be associated with a trait. Hence, when testing for association between genetic variants and a trait, the genetic variants are prioritized by placing larger weights on those with known functions.²

Recently, the growing availability of transcriptome data has given rise to methods that evaluate genetically regulated gene expression using both GWAS and transcriptome data sets. Large-scale transcriptome data sets, which contain information on genotypes and gene expression levels, include the Genotype-Tissue Expression Consortium (GTEx),³ the Genetic European in Health and Disease (GEUVADIS) Project,⁴ Braineac,⁵ Depression Genes and Networks⁶ and eQTLGen.⁷

Methods that have leveraged on both GWAS and transcriptome data include PrediXcan^8,9 and transcriptome-wide association study (TWAS)¹⁰ and generally proceed in 3 steps. First, using transcriptome data, they fit predictive models for gene expression with genetic variants near a gene as covariates. PrediXcan proposed the use of elastic net regression or ridge regression to build a predictive model, while TWAS proposed the use of a linear mixed model. The fitted models are then used to predict the gene expression levels for individuals in the GWAS data set. Finally, a simple linear regression is used to examine the association between the predicted expression levels and the complex trait in the GWAS data set.

Methods that proceed in such a stage-wise manner do not account for the uncertainty that arises when imputing the gene expression levels in the GWAS data set, which may lead to a loss in statistical power. To address this limitation, we proposed the collaborative mixed model (CoMM),¹¹ which accounts for the uncertainty in the ‘imputation’ model by jointly fitting the imputation and the association analysis models.

Method

Suppose that the gene expression levels for $G$ genes and the allele counts for $M$ single-nucleotide polymorphisms (SNPs) are measured for $n_{1}$ samples in the transcriptome data set and that the phenotype values and the allele counts for the same $M$ SNPs are measured for $n_{2}$ samples in the GWAS data set. The transcriptome data set consists of the $n_{1}$ by $G$ gene expression data matrix $Y$ and the $n_{1}$ by $M$ genotype data matrix $W_{1}$ . The GWAS data set consists of the phenotype vector $z$ and the $n_{2}$ by $M$ genotype data matrix $W_{2}$ . We predict the gene expression levels one gene at a time and denote the gene expression levels at the gth gene by $y_{g}$ . As we are interested in the variation in gene expression attributable to variation in its cis-SNPs, we model the gene expression levels and phenotype value using only nearby SNPs. Let $W_{1 g}$ and $W_{2 g}$ denote the genotype matrix corresponding to the gene’s nearby SNPs, in the transcriptome data set and the GWAS data set, respectively. Let $M_{g}$ denote the number of SNPs corresponding to the gth gene. We assume that $y_{g}$ is mean centred, and $W_{1 g}$ and $W_{2 g}$ are standardized (columns have zero mean and unit variance).

In the CoMM, we first model the relationship between gene expression $y_{g}$ and genotype $W_{1 g}$ in the transcriptome data set

\begin{matrix} y_{g} = W_{1 g} γ_{g} + e_{1} \end{matrix}

(1)

where $γ_{g}$ is an $M_{g}$ vector of SNP effects on the gene expression level, and $e_{1} \sim N (0, σ_{1}^{2} I_{n_{1}})$ is an $n_{1}$ vector representing the error associated with the expression level. Next, we model the relationship between phenotype $z$ and genotype $W_{2 g}$ in the GWAS data set as

\begin{matrix} z = X β + α_{g} W_{2 g} γ_{g} + e_{2} \end{matrix}

(2)

where $X$ contains covariates that control for population stratification and other confounding variables, $β$ is a vector of fixed-effects coefficients, and $e_{2} \sim N (0, σ_{2}^{2} I_{n_{2}})$ is an $n_{2}$ vector representing the error associated with the phenotype.

The quantity of interest is $α_{g}$ , the effect of gene g’s expression level on the phenotype. In addition, we assume the prior distribution on $γ_{g}$

\begin{matrix} γ_{g} \sim N (0, σ_{g}^{2} I_{M_{g}}) \end{matrix}

(3)

effectively treating the effects of genotype on gene expression as random. An accelerated expectation–maximization (EM) algorithm using parameter expansion¹² is used to estimate all parameters in the joint model given by equations (1) and (2). Figure 1 summarizes the data input, the joint model, and output for CoMM.

Figure 1.

Schematic of CoMM. The transcriptome and GWAS data sets are used to fit the parameters in the model given by equations (1) and (2). The parameter estimates are used to evaluate the likelihood ratio test statistic, which tests for association between the phenotype and genetically regulated gene expression. CoMM indicates collaborative mixed model; GWAS, genome-wide association studies.

As our objective is to evaluate whether a gene is associated with the phenotype via gene expression, we perform the hypothesis test

\begin{matrix} ℋ_{0} : α_{g} = 0, v . s . ℋ_{1} : α_{g} \neq 0 \end{matrix}

(4)

The likelihood ratio test is used, and the test statistic is given by

Λ_{g} = 2 (\log \Pr (y_{g}, z, γ_{g} | \hat{θ}) - \log \Pr (y_{g}, z, γ_{g} | {\hat{θ}}_{0}))

where $\hat{θ}$ and ${\hat{θ}}_{0}$ contain the parameter estimates obtained under the full model and $ℋ_{0}$ , respectively. The test statistic $Λ_{g}$ is asymptotically distributed as $χ_{d f = 1}^{2}$ under the null hypothesis. The key thing to note is that the likelihood reflects the uncertainty in both the imputation and association analysis models (equations (1) and (2)). As such, the CoMM test statistic for expression-trait association takes into account the uncertainty in the imputation model.

Results in the Northern Finland Birth Cohort 1966

We analysed the GWAS data set from the Northern Finland Birth Cohort 1966 (NFBC1966)¹³ with the aid of transcriptome data from GTEx (tissue: subcutaneous adipose).³ The NFBC1966 data set records traits such as body mass index (BMI), low-density lipoprotein cholesterol (LDL), triglycerides (TGs), total cholesterol (TC), systolic blood pressure (SysBP), and diastolic blood pressure (DiaBP).

The CoMM returns a larger number of significant findings than PrediXcan and SKAT, as indicated by the QQ-plots of the P values (Figure 2). In particular, CoMM reported 12 significant genes associated with triglyceride (TG) levels, whereas PrediXcan:Enet, PrediXcan:Ridge, and SKAT reported 2, 1, and 0 significant genes, respectively. Among the 12 identified genes, 2 (OST4 and EIF2B4) have nonnegligible cellular heritability ( $h_{C}^{2} = 2.03 % and 1.33 %$ , respectively) and have reported associations with TG in previous studies.^14,15 In this instance, CoMM performs better than SKAT due to the use of gene regulation information in the GTEx data, and outperforms PrediXcan by taking into account the uncertainty in the imputation model.

Figure 2.

The QQ-plots of P-values for the quantitative traits in NFBC1966.

Extension of CoMM to Analyse GWAS Summary Data

A limitation of CoMM is that it requires individual-level data and is unable to make use of large-scale GWAS that provide only summary statistics. To capitalize on these GWAS, we extend CoMM so that summary statistics, in the form of estimated SNP effect sizes and their variances, can replace the role of individual-level GWAS data.

We adapt the method of Zhu and Stephens,¹⁶ which made use of summary statistics by introducing a regression with summary statistics (RSS) likelihood in a Bayesian framework. As gene expression levels are modelled using multiple SNPs, we additionally require information on the correlations among SNPs (linkage disequilibrium). Fortunately, such information is available in public data sets such as the 1000 Genomes Project Consortium.¹⁷ In CoMMs for GWAS summary statistics (CoMM-S²),¹⁸ the association between phenotype and genetically regulated gene expression is evaluated by combining the distribution for individual-level transcriptome data with an RSS distribution for GWAS summary statistics, while taking into account linkage disequilibrium as estimated from a reference panel.

Even though CoMM-S² utilizes GWAS summary statistics, it has comparable performance as CoMM. To illustrate the performance of CoMM-S² relative to CoMM, we use NFBC1966 as the GWAS data set, GTEx as the transcriptome data set and the 1000 Genomes Project as a reference panel to estimate linkage disequilibrium. In general, the test statistic values from CoMM-S² are close to their corresponding values from CoMM: the regression slope is around 1 and R² ranges from 0.91 to 0.99 (Figure 3). The close correspondence in test statistic values is most apparent in the null region. In the nonnull region, the test statistics for CoMM-S² may be inflated (Figure 3). One possible reason for this is linkage disequilibrium misspecification,¹⁸ due to the genetic differences between the Finnish cohort in NFBC1966 and the European sample in the 1000 Genomes Project.¹⁹ Nonetheless, as a strong inflation occurs only when the CoMM statistics is large, CoMM-S² maintains a reasonable false-positive rate in the presence of misspecified linkage disequilibrium.

Figure 3.

Scatter plot of test statistics from the likelihood ratio test for CoMM-S² versus CoMM, using GTEx (tissue: subcutaneous adipose) transcriptome data, NFBC1966 GWAS data, and 1000 Genomes Project reference panel data.

Finally, we note that both CoMM and CoMM-S² are designed for single-tissue analysis. When a multi-tissue transcriptome data set is available, an approach that takes into account genetic correlation across tissues may be preferable. Such an approach would be better equipped to identify biologically relevant tissues for each gene, and may also provide an increase in statistical power for tissues that are difficult to obtain.²⁰ Recently, 2 multi-tissue approaches, UTMOST²⁰ and MultiXcan,²¹ have been proposed. They are more powerful than single-tissue approaches in expression-trait association analyses. However, they ignore the uncertainty due to the imputation step, and similar to what has been proposed for CoMM and CoMM-S², the ability to detect relevant genes can be further improved by combining the imputation model and association analysis model via a unified likelihood framework. This remains a promising avenue for further research.

Footnotes

Acknowledgements

The authors thank the National Supercomputing Centre, Singapore, for providing computational resources for the project.

Funding:

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Duke-NUS Medical School (grant no. R-913-200-098-263) and Singapore’s Ministry of Education Academic Research Fund (AcRF) Tier 2 (grant nos. MOE2016-T2-2-029, MOE2018-T2-1-046, and MOE2018-T2-2-006).

Declaration of conflicting interests:

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Author Contributions

JL and CY conceived this commentary. YY performed the analysis and computation. K-FY wrote the manuscript in consultation with JL and CY.

Data Availability and Implementation

The developed R package is available at .

ORCID iDs

Can Yang

Jin Liu

References

Gallagher

Chen-Plotkin

AS.

The post-GWAS era: from association to function. Am J Hum Genet. 2018;102:717-730. doi:10.1016/j.ajhg.2018.04.002.

Lee

Cai

Boehnke

Lin

Rare-variant association testing for sequencing data with the Sequence Kernel Association Test. Am J Hum Genet. 2011;89:82-93. doi:10.1016/j.ajhg.2011.05.029.

GTEx Consortium. Genetic effects on gene expression across human tissues. Nature. 2017;550:204-213. doi:10.1038/nature24277.

Lappalainen

Sammeth

Friedländer

, et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013;501:506-511. doi:10.1038/nature12531.

Ramasamy

Trabzuni

Guelfi

, et al. Genetic variability in the regulation of gene expression in ten regions of the human brain. Nat Neurosci. 2014;17:1418-1428. doi:10.1038/nn.3801.

Battle

Mostafavi

Zhu

, et al. Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals. Genome Res. 2014;24:14-24. doi:10.1101/gr.155192.113.

Võsa

Claringbould

Westra

, et al. Unraveling the polygenic architecture of complex traits using blood eQTL meta-analysis. bioRxiv. 2018:447367. doi:10.1101/447367.

Gamazon

Wheeler

Shah

, et al. A gene-based association method for mapping traits using reference transcriptome data. Nat Genet. 2015;47:1091-1098. doi:10.1038/ng.3367.

Barbeira

Dickinson

Bonazzola

, et al. Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nat Commun. 2018;9:1825. doi:10.1038/s41467-018-03621-1.

10.

Gusev

Shi

, et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat Genet. 2016;48:245-252. doi:10.1038/ng.3506.

11.

Yang

Wan

Lin

Chen

Zhou

Liu

CoMM: a collaborative mixed model to dissecting genetic contributions to complex traits by leveraging regulatory information. Bioinformatics. 2018;35:1644-1652. doi:10.1093/bioinformatics/bty865.

12.

Liu

Rubin

YN.

Parameter expansion to accelerate EM: the PX-EM algorithm. Biometrika. 1998;85:755-770. doi:10.1093/biomet/85.4.755.

13.

Sabatti

Service

Hartikainen

, et al. Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nat Genet. 2009;41:35-46. doi:10.1038/ng.271.

14.

Rundblad

Larsen

Myhrstad

, et al. Differences in peripheral blood mononuclear cell gene expression and triglyceride composition in lipoprotein subclasses in plasma triglyceride responders and non-responders to omega-3 supplementation. Genes Nutr. 2019;14:10. doi:10.1186/s12263-019-0633-y.

15.

Chen

Zhang

, et al. Multivariate analysis of genomics data to identify potential pleiotropic genes for type 2 diabetes, obesity and dyslipidemia using meta-CCA and gene-based approach. PLoS ONE. 2018;13:e0201173. doi:10.1371/journal.pone.0201173.

16.

Zhu

Stephens

Bayesian large-scale multiple regression with summary statistics from genome-wide association studies. Ann Appl Stat. 2017;11:1561-1592.

17.

1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56-65. doi:10.1038/nature11632.

18.

Yang

Shi

Jiao

, et al. CoMM-S2: a collaborative mixed model using summary statistics in transcriptome-wide association studies. bioRxiv. 2019:652263. doi:10.1101/652263.

19.

Salmela

Genetic Structure in Finland and Sweden: Aspects of Population History and Gene Mapping [dissertation]. Helsinki, Finland: University of Helsinki; 2012.

20.

, et al. A statistical framework for cross-tissue transcriptome-wide association analysis. Nat Genet. 2019;51:568-576. doi:10.1038/s41588-019-0345-7.

21.

Barbeira

Pividori

Zheng

Wheeler

Nicolae

HK.

Integrating predicted transcriptome from multiple tissues improves association detection. PLoS Genet. 2019;15:e1007889. doi:10.1371/journal.pgen.1007889.