Interactions between genes and environmental factors may play a key role in the etiology of many common disorders. Several regularized generalized linear models have been proposed for hierarchical selection of gene by environment interaction effects, where a gene-environment interaction effect is selected only if the corresponding genetic main effect is also selected in the model. However, none of these methods allow to include random effects to account for population structure, subject relatedness and shared environmental exposure. In this article, we develop a unified approach based on regularized penalized quasi-likelihood estimation to perform hierarchical selection of gene-environment interaction effects in sparse regularized mixed models. We compare the selection and prediction accuracy of our proposed model with existing methods through simulations under the presence of population structure and shared environmental exposure. We show that for all simulation scenarios, including and additional random effect to account for the shared environmental exposure reduces the false positive rate and false discovery rate of our proposed method for selection of both gene-environment interaction and main effects. Using the score as a balanced measure of the false discovery rate and true positive rate, we further show that in the hierarchical simulation scenarios, our method outperforms other methods for retrieving important gene-environment interaction effects. Finally, we apply our method to a real data application using the Orofacial Pain: Prospective Evaluation and Risk Assessment (OPPERA) study, and found that our method retrieves previously reported significant loci.
Genome-wide association studies (GWASs) have led to the identification of hundreds of common genetic variants, or single nucleotide polymorphisms (SNPs), associated with complex traits1 and are typically conducted by testing association on each SNP independently. However, these studies are plagued with the multiple testing burden that limits discovery of potentially important predictors, as genome-wide significance -value threshold of has become the standard. Moreover, GWASs have brought to light the problem of missing heritability, that is, identified variants only explain a low fraction of the total observed variability for traits under study.2 Beyond the identified genetic variants, interactions between genes and environmental factors may play a key role in the multifactorial etiology of many complex diseases that are subject to both genetic and environmental risk factors. For example, in assessing interactions between a polygenic risk score (PRS) and non-genetic risk factors for young-onset breast cancers (YOBC), Shi et al.3 showed a decreased association between the PRS and YOBC risk for women who had ever used hormonal birth control, suggesting that environmental exposure might result in risk stratification by interacting with genetic factors. Thus, there is a rising interest for discovering gene-environment interaction (GEI) effects as they are fundamental to better understand the effect of environmental factors in disease and to increase risk prediction accuracy.4
Several regularized generalized linear models (GLMs) have been proposed for selection of both genetic and GEI effects in genetic association studies,5–7 but currently no such method allows to include any random effect to account for genetic similarity between subjects. Indeed, one can control for population structure and/or closer relatedness by including in the model a polygenic random effect with variance-covariance structure proportional to a kinship or genetic similarity matrix (GSM).8 However, because kinship is a high-dimensional process, it cannot be fully captured by including only a few principal components (PCs) as fixed effects in the model.9 Hence, while both PC analysis (PCA) and mixed models (MMs) share the same underlying model, MMs are more robust in the sense that they do not require distinguishing between the different types of confounders.10 Moreover, MMs alleviate the need to evaluate the optimal number of PCs to retain in the model as fixed effects.
Except for normal responses, the joint estimation of variance components and fixed effects in regularized models is challenging both from a computational and analytical point of view, as the marginal likelihood for a generalized linear mixed model (GLMM) has no analytical form. To address these challenges, penalized quasi-likelihood (PQL) estimation is conceptually attractive as under this method, random effects can be treated as fixed effects, which allows to perform regularized estimation of both fixed and random effects as in the GLM framework. The computational efficiency of multivariable methods for high-dimensional MMs rely on performing a single spectral or Cholesky decomposition of the covariance matrix to rotate the phenotype and design matrix such that the transformed data become uncorrelated. For very large sample sizes, computing these decompositions can be very burdensome, with complexity of , where is the sample size. Secondly, to obtain regularized estimates for the genetic predictors and GEI effects in linear mixed models (LMMs), we need to perform matrix multiplications with complexity of and to rotate the phenotype and genotype matrices respectively, with the number of genetic predictors. Even for moderately small cohorts, the number of predictors in GWAS is often greater than 1 million, such that the genotype matrix itself will require around one terabyte of space to be loaded in memory in a normal double-precision format.11 In PQL regularized models, by minimizing the objective function with respect to the fixed effects vector only, we need not rotate the genotype matrix as we are conditioning on the random effects vector estimate.
Several authors have proposed to combine PQL estimation in presence of sparsity by inducing regularization to perform joint selection of fixed and/or random effects in multivariable GLMMs.12–14 However, these methods were not developed to specifically address selection of GEI effects. Although it is possible to perform naive selection of fixed and GEI effects by simply considering interaction terms as additional predictors, the aforementioned methods are not tailored to perform hierarchical selection, where interaction terms are only allowed to be selected if their corresponding main effects are active (i.e. non-zero) in the model.15 Hierarchical variable selection of GEI effects is appealing both for increasing statistical power16 and for enhancing model interpretability because interaction terms that have large main effects are more likely to be retained in the model.
Population structure and closer relatedness may also cause dependence between gene and environment, leading to selection of spurious GEI effects.17 In the context of GWAS, Sul et al.18 showed that under the polygenic model, ignoring this dependence may largely increase the false positive rate of GEI statistics. They proposed introducing an additional random effect that captures the similarity of individuals due to polygenic GEI effects to account for the fact that individuals who are genetically related and who share a common environmental exposure are more closely related. To our knowledge, the spurious selection of GEI effects in regularized models due to the dependence between gene and shared environmental exposure has not been explored yet. Thus, further work is needed to develop sparse regularized GLMMs for hierarchical selection of GEI effects in genetic association studies, while explicitly accounting for the complex correlation structure between individuals that arises from both genetic and environmental factors.
In this article, we develop a unified approach based on regularized PQL estimation to perform hierarchical selection of GEI effects in sparse regularized logistic mixed models. Similar to Sul et al.,18 we use a random effect that captures population structure and closer relatedness through a genetic kinship matrix, and shared environmental exposure through a GxE kinship matrix. We propose to use a composite absolute penalty (CAP) for hierarchical variable selection19 to seek a sparse subset of genetic and GEI effects that gives an adequate fit to the data. We derive a proximal Newton-type algorithm with block coordinate descent for PQL estimation with mixed lasso and group lasso penalties, relying on our previous work to address computational challenges associated with regularized PQL estimation in high-dimensional data.12 We compare the prediction and selection accuracy of our proposed model with existing methods through simulations under the presence of population structure and environmental exposure. Finally, we also apply our method to a real data application using the Orofacial Pain: Prospective Evaluation and Risk Assessment (OPPERA) study cohort20 to study the sex-specific association between temporomandibular disorder (TMD) and genetic predictors.
Methodology
Model
We have the following GLMM
for , where , is a row vector of covariates for subject , is a row vector of genotypes for subject taking values as the number of copies of the minor allele, is a column vector of fixed covariate and additive genotype effects including the intercept, is the exposure of individual to a binary or continous environmental factor with fixed effect , and is the vector of fixed GEI effects. Thus, we have a total of coefficients. We assume that is an column vector of random effects, with the variance components that account for the relatedness between individuals. is a known GSM or kinship matrix and is an additional kinship matrix that describes how individuals are related both genetically and environmentally, because a pair of individuals who are genetically related and share the same environment exposure have a non-zero kinship coefficient. The kinship matrix corrects for the spurious association of GEI effects due to population structure and subjects relatedness, in the same way that the kinship matrix corrects for population structure and subjects relatedness on the main effects. Thus, the matrix can be interpreted as the covariance matrix between individuals that captures the residual variance explained by the sum of many small GEI effects across the genome. For a binary exposure, we define if , and otherwise. For a continuous exposure, one possibility is to set , where is a metric with range . The phenotypes ’s are assumed to be conditionally independent and identically distributed given and follow any exponential family distribution with canonical link function , mean and variance where is a dispersion parameter, are known weights, and is the variance function.
Regularized PQL estimation
In order to estimate the model parameters and perform variable selection, we use an approximation method to obtain an analytical closed form for the marginal likelihood of model (1). We propose to fit (1) using a PQL method,12,21 from where the log integrated quasi-likelihood function is equal to
where is a diagonal matrix containing weights for each observation, is the quasi-likelihood for the th individual given the random effects , and is the solution which maximizes .
In typical genome-wide studies, the number of genetic predictors is much greater than the number of observations (), and the fixed effects parameter vector becomes underdetermined when modeling SNPs jointly. Moreover, we would like to induce a hierarchical structure, that is, a GEI effect can be present only if both exposure and genetic main effects are also included in the model. Thus, we propose to add a sparse group lasso penalty22 to the negative quasi-likelihood function in (2) to seek a sparse subset of genetic and GEI effects that gives an adequate fit to the data. Indeed, the sparse group lasso is part of the family of CAPs that can induce hierarchical variable selection.19 We define the following objective function which we seek to minimize with respect to :
where controls the strength of the overall regularization and controls the relative sparsity of the GEI effects for each SNP. In our modeling approach, we do not penalize the environmental exposure fixed effect . Thus, a value of is equivalent to a group lasso penalty where we only include a predictor in the model if both its main effect and GEI effect are non-zero. A value of is equivalent to a sparse group lasso penalty where main effects can be selected without their corresponding GEI effects due to the different strengths of penalization, but a GEI effect is still only included in the model if the corresponding main effect is non-zero.
Estimation of variance components
Jointly estimating the variance components and scale parameter with the regression effects vector and random effects vector is a computationally challenging non-convex optimization problem. Updates for and based on a majorization-minimization (MM) algorithm23 would require inverting three different matrices, with complexity , at each iteration. Thus, even for moderately small sample sizes, this is not practicable for genome-wide studies. Instead, we propose a two-step method where variance components and scale parameter are estimated only once under the null association of no genetic effect, that is assuming , using the average information restricted maximum likelihood (AI-REML) algorithm.12,24
Spectral decomposition of the random effects covariance matrix
Given , and estimated under the null, spectral decomposition of the random effects covariance matrix yields
where is an orthonormal matrix of eigenvectors and is a diagonal matrix of eigenvalues when both and are positive definite. In practice, if is rank-deficient, one can replace it by for small, to ensure that both and are positive definite.
Using (4) and assuming that the weights in vary slowly with the conditional mean,25 minimizing (3) is now equivalent to
where is the minimizer of . Thus, iteratively solving (5) also requires updating the solution at each step until convergence. Conditioning on the previous solution for , is obtained by minimizing a generalized ridge weighted least-squares (WLS) problem with as the regularization matrix. Then, conditioning on , is found by minimizing a WLS problem with a sparse group lasso penalty. We present in Appendix A our proposed proximal Newton-type algorithm that cycles through updates of and .
Simulation study
We first evaluated the performance of our proposed method, called pglmm, against that of a standard logistic lasso, using the Julia package GLMNet which wraps the Fortran code from the original R package glmnet.26 Then, among logistic models that impose hierarchical interactions, we compared our method with the glinternet7 and gesso6 models which are both implemented in R packages. The glinternet method relies on overlapping group lasso, and even though it is optimized for selection of gene by gene interactions in high-dimensional data, it is applicable for selection of GEI effects. An advantage of the method is that it only requires tuning a single parameter value. On the other hand, gesso uses a CAP penalty with a group norm (iCAP) to induce a hierarchical structure, and the default implementation fits solutions paths across a two-dimensional grid of tuning parameter values. For all methods, selection of the tuning parameters is performed by cross-validation. The default implementation for glmnet, glinternet and pglmm is to find the smallest value of the tuning parameter such that no predictor are selected in the model, and then to solve the penalized minimization problem over a grid of decreasing values of . For these three methods, we used a grid of 50 values of on the log10 scale with , where is chosen such that no predictors are selected in the model. In addition, for pglmm, we solved the penalized minimization problem over a grid of 10 values of the tuning parameter evenly spaced from 0 to 0.9, fitting a total of 500 models. The default implementation for gesso is to solve the minimization problem over a 2020 two-dimensional grid of the tuning parameters values , starting from the smallest value such that all coefficients are zero, and setting . Finally, for glmnet, gesso, and glinternet, population structure and environmental exposure is accounted for by adding the top 10 PCs of the kinship matrix as additional covariates.
Simulation model
We performed a total of 100 replications for each of our simulation scenarios, drawing anew genotypes and simulated traits, using real genotype data from a high quality harmonized set of 4097 whole genomes from the Human Genome Diversity Project (HGDP) and the 1000 Genomes Project (1000 G).27 At each replication, we sampled 10,000 candidate SNPs from the chromosome 21 and randomly selected 100 () to be causal. Let be the set of candidate causal SNPs, with , then the causal SNPs fixed effects were generated from a Gaussian distribution , where is the fraction of variance on the logit scale that is due to total additive genetic fixed effects. Let be the set of candidate causal SNPs, not necessarily overlapping with , that have a non-zero GEI effect, with , then the GEI effects were generated from a Gaussian distribution , where is the fraction of variance on the logit scale that is due to total additive GEI fixed effects. Further, we simulated a random effect from a Gaussian distribution , where and are the fractions of variance explained by the polygenic and polygenic by environment effects, respectively. The kinship matrices and were calculated using a set of 50,000 randomly sampled SNPs excluding the set of candidate SNPs, and PCs were obtained from the singular value decomposition of . We simulated a covariate for age using a Normal distribution and used the sex covariate provided with the data as a proxy for environmental exposure. Then, for , binary phenotypes were generated using the following model:
where , for , was simulated using a distribution to specify a different prevalence for each population in Table 1 under the null, and is the th column of the standardized genotype matrix and is the minor allele frequency (MAF) for the th predictor.
Number of samples by population for the high quality harmonized set of 4097 whole genomes from the Human Genome Diversity Project (HGDP) and the 1000 Genomes Project (1000 G).
Population
1000 genomes
HGDP
Total
African
879 (28%)
110 (12%)
989 (24%)
Admixed American
487 (15%)
62 (7%)
549 (13%)
Central/South Asian
599 (19%)
184 (20%)
783 (19%)
East Asian
583 (18%)
234 (25%)
817 (20%)
European
618 (20%)
153 (16%)
771 (19%)
Middle Eastern
0
158 (17%)
158 (4%)
Oceanian
0
30 (3%)
30 (1%)
Total
3166
931
4097
Unrelated individuals
2520
880
3400
In all simulation scenarios, we set and such that each of the main effects () or GEI effects () explains 0.2% of the total variability on the logit scale. We compared the methods when and (i.e. low polygenic effects with ), and when and (i.e. high polygenic effects with ), respectively. In the first simulation scenario, we induced a hierarchical structure for the simulated data by imposing for , such that the total number of causal SNPs is equal to 100, with half of them having non-zero GEI effects. In the second simulation scenario, we repeated the simulations from the first scenario, but without enforcing any hierarchical structure, such that the number of causal SNPs is equal to 150, with 100 of them having non-zero main effects, and 50 having non-zero GEI effects.
Metrics
To compare the performance of all methods in discovering important genetic predictors and estimating their main and interaction effects, we define in this section the performance metrics that will be used. First, we define the model size as simply the number of non-zero coefficients estimated by a model, that is, for the main effects, and for the GEI effects. The false positive rate (FPR) is defined as the number of non-causal predictors that are falsely identified as causal (false positives), divided by the total number of non-causal predictors. The true positive rate (TPR), also known as sensitivity or recall, is defined as the number of true causal predictors that are correctly identified (true positives), divided by the total number of causal predictors. The false discovery rate (FDR) is defined as the number of false positives divided by the total number of selected predictors in the model. Thus, while FPR and TPR measure the ability of a model to distinguish between causal and non-causal predictors, the FDR actually measures the proportion of predictors that are not causal among those declared significant. Moreover, in genetic association studies where the number of non-causal predictors is very high, we are more interested in controlling the FDR rather than the FPR. Alternatively, we can define the precision as 1 minus the FDR, which measures the proportion of causal predictors among those declared significant. The score is defined as the harmonic mean of the precision and TPR, and it can be used to take into account that methods with a large number of selected predictors will likely have a higher TPR, and inversely that methods with a lower number of selected predictors will likely have a higher precision. Finally, the area under the curve (AUC) is used as a measure of the predictive performance of all methods when predicting the binary status of individuals. It takes into account the TPR of all methods at various FPR values when making individual predictions. A higher AUC means that a method has a better capacity at distinguishing between cases and controls.
Results
We obtained solutions paths across a one dimensional (glmnet, glinternet) or two-dimensional grid of tuning parameter values (gesso, pglmm) for the hierarchical and non-hierarchical simulation scenarios and reported the mean precision, that is, the proportion of selected predictors that are causal, over 100 replications for the selection of GEI effects (Figure 1) and main genetic effects (Figure 2) respectively. We see from Figure 1 that in the hierarchical simulation scenario, gesso and pglmm retrieve important GEI effects with better precision than glmnet and glinternet. When we simulate a low random polygenic GEI effect, gesso slightly outperforms pglmm, but when we increase the heritability of the two random effects, both methods perform similarly. When we simulate data under no hierarchical assumption, precision for all hierarchical models fall drastically, although they still perform better than the standard lasso model. We note that gesso retrieves important GEI effects with equal or better precision than other methods in all simulation settings. This is explained by the fact that gesso is using a CAP penalty with group norm which has been shown to perform better than the sparse group lasso for retrieving interaction effects.19 On the other hand, we see from Figure 2 that pglmm outperforms all methods for retrieving important main effects for both hierarchical and non-hierarchical simulation scenarios. When we simulate low polygenic effects, pglmm and glmnet perform comparably. We also note that gesso retrieves main effects with less precision than glmnet and pglmm in all scenarios. At last, the precision of glinternet is considerably lower than all other methods until the number of selected main genetic effects in the model is large.
Precision of compared methods averaged over 100 replications as a function of the number of active gene-environment interaction (GEI) effects.
Precision of compared methods averaged over 100 replications as a function of the number of active main effects in the model.
In practice, we often do not have any a priori knowledge for the number of main effects and/or GEI effects that we want to include in the final model. Thus, instead of comparing methods at a fixed number of selected predictors along their regularization paths, we used cross-validation to compare how each method performs when having to select an optimal number of predictors in the model for the same two simulation scenarios that we previously described. We randomly split the data () into training and test subjects, using a 80/20 ratio, and fitted the full lasso solution path on the training set for 100 replications. We report the model size, FPR, TPR, FDR, and score on the training sets, and the area under the ROC curve (AUC) when making predictions on the independent test subjects. To assess the potential spurious association of both main and GEI effects due to shared environmental exposure, we compare our method when including only the kinship matrix (plmm (1 Random effect (RE))) and when including both and matrices (pglmm (2 REs)).
With respect to selection of the GEI effects (Table 2), the comparative performance of each method varies depending on the simulation scenario. As expected, we see that including an additional random effect reduces the FPR for all simulation scenarios for our proposed method. Unsurprisingly, glinternet and glmnet have the lowest FPRs of all methods since they always select the least number of GEI effects in the final models. Consequently, they have the smallest TPRs in all scenarios. By using the score to account for the trade-off between FDR and TPR, we have that pglmm performs the best in hierarchical simulation scenarios, while gesso performs better in the non-hierarchical scenarios.
Results for the GEI effects .
Non-hierarchical model
Hierarchical model
Metric
Method
Low
High
Low
High
Model size
pglmm (1 RE)
84.6
99.2
95.5
103
pglmm (2 REs)
58.4
57.8
63.9
59.8
glmnet
21.6
38.8
22.7
38.6
glinternet
17.5
39.2
19.5
39.6
gesso
64.3
102
78.6
110
FPR
pglmm (1 RE)
pglmm (2 REs)
glmnet
glinternet
gesso
TPR
pglmm (1 RE)
0.039
0.043
0.126
0.124
pglmm (2 REs)
0.030
0.031
0.084
0.091
glmnet
0.016
0.020
0.018
0.025
glinternet
0.016
0.030
0.024
0.038
gesso
0.052
0.068
0.104
0.103
FDR
pglmm (1 RE)
0.966
0.965
0.926
0.908
pglmm (2 REs)
0.936
0.968
0.921
0.889
glmnet
0.962
0.974
0.955
0.967
glinternet
0.948
0.956
0.923
0.949
gesso
0.945
0.948
0.912
0.930
pglmm (1 RE)
0.035
0.033
0.091
0.084
pglmm (2 REs)
0.039
0.036
0.082
0.090
glmnet
0.036
0.033
0.035
0.036
glinternet
0.040
0.038
0.045
0.048
gesso
0.052
0.048
0.083
0.067
Note: For each simulation scenario, we report the mean value over 100 replications when we simulate only one random effect with low heritability (low ) and we simulate two random effects with high heritability (high ). Bolded values indicate the method with the best performance according to each metric.
With respect to the genetic main effects (Table 3), pglmm selects the lowest number of predictors in the model, and thus has the lowest FPR and FDR in all simulation scenarios. Again, adding an additional random effect reduces the FPR for pglmm, but to a lower extent than for selection of GEI effects. On the other hand, glinternet always selects the largest number of predictors in all scenarios, and hence has the highest TPR and FPR values. Using the score to balance FDR and TPR, we see that pglmm performs the best for retrieving the important main genetic effects in all simulation scenarios. Also, we see that gesso and pglmm perform similarly when the heritability of the polygenic random effects is low, but when we increase the heritability, the FDR for gesso increases drastically, and the number of selected main effects becomes on average more than 1.5 times higher than for pglmm. Results for the accuracy of predicting binary outcomes in independent test sets are included in Table 4. We see that pglmm with two random effects outperforms all other methods for all simulation scenarios.
Results for the genetic predictors main effects .
Non-hierarchical model
Hierarchical model
Metric
Method
Low
High
Low
High
Model size
pglmm (1 RE)
227
212
220
204
pglmm (2 REs)
206
190
214
179
glmnet
278
444
286
450
glinternet
299
481
312
480
gesso
212
361
224
367
FPR
pglmm (1 RE)
pglmm (2 REs)
glmnet
glinternet
gesso
TPR
pglmm (1 RE)
0.195
0.181
0.208
0.196
pglmm (2 REs)
0.188
0.177
0.206
0.188
glmnet
0.215
0.244
0.226
0.257
glinternet
0.216
0.246
0.237
0.271
gesso
0.190
0.220
0.210
0.238
FDR
pglmm (1 RE)
0.895
0.895
0.891
0.883
pglmm (2 REs)
0.888
0.885
0.885
0.870
glmnet
0.920
0.944
0.917
0.942
glinternet
0.925
0.948
0.922
0.943
gesso
0.906
0.937
0.903
0.932
pglmm (1 RE)
0.123
0.120
0.136
0.134
pglmm (2 REs)
0.126
0.124
0.138
0.140
glmnet
0.115
0.091
0.119
0.094
glinternet
0.109
0.085
0.116
0.094
gesso
0.123
0.096
0.131
0.103
Note: For each simulation scenario, we report the mean value over 100 replications when we simulate two random effects with low heritability (low ) and high heritability (high ). Bolded values indicate the method with the best performance according to each metric.
Results for the prediction accuracy of a binary outcome on test sets.
Non-hierarchical model
Hierarchical model
Metric
Method
Low
High
Low
High
AUC
pglmm (1 RE)
0.719
0.786
0.728
0.788
pglmm (2 REs)
0.723
0.790
0.730
0.792
glmnet
0.688
0.753
0.695
0.751
glinternet
0.702
0.760
0.710
0.761
gesso
0.695
0.750
0.707
0.751
Note: For each simulation scenario, we report the mean AUC value over 100 replications when we simulate two random effects with low heritability (low ) and high heritability (high ). Bolded values indicate the method with the best performance according to each metric.
RE: random effect; AUC: area under the curve.
Discovering sex-specific genetic predictors of painful temporomandibular disorder
Significant associations between temporomandibular disorder (TMD), which is a painful disease of the jaw, and four distinct loci have been previously reported in combined or sex-segregated analyses on the OPPERA study cohort.28 Moreover, TMD has much greater prevalence in females than in males and is believed to have some sex-specific pathophysiologic mechanisms.29 In this analysis, we wanted to explore the comparative performance of our proposed method pglmm in selecting important sex-specific predictors of TMD and its performance predicting the risk of painful TMD in independent subjects from two replication cohorts, the OPPERA II Chronic TMD Replication case-control study, and the Complex Persistent Pain Conditions (CPPC): Unique and Shared Pathways of Vulnerability study, using the OPPERA cohort as discovery cohort. Sample sizes and distribution of sex, cases and ancestry for the three studies are shown in Table 5, and further details on study design, recruitment, subject characteristics, and phenotyping for each study are provided in the Supplemental Material of Smith et al.28 (available at http://links.lww.com/PAIN/A688).
Demographic data for the OPPERA training cohort, and for the OPPERA2 and CPPC test cohorts.
We used the imputed data described by Smith et al.28 Genotypes were imputed to the 1000 Genomes Project phase 3 reference panel using the software packages SHAPEIT30 for prephasing and IMPUTE version 2.31 For each cohort independently, we assessed imputation quality taking into account the number of minor alleles as well as the information score such that a SNP with rare MAF must pass a higher quality information threshold for inclusion. After merging all three cohorts, we tested for significant deviations of the Hardy-Weinberg equilibrium (HWE) separately in cases and controls, using a more strict p-value threshold for hypothesis testing among cases to avoid discarding disease-associated SNPs that are possibly under selection32 ( in controls, in cases). We filtered using a SNP call rate > 95% on the combined dataset to retain imputed variants present in all cohorts, which resulted in a total of 4.8 million imputed SNPs. PCs and kinship matrices were calculated on the merged genotype data using the --pca and --make-rel flags in PLINK,33 after using the same HWE p-value threshold and SNP call rate as for the imputed data. To reduce the number of candidate predictors in the regularized models, we performed a first screening by testing genome-wide association with TMD for subjects in the OPPERA discovery cohort using PLINK. We fitted a logistic regression for additive SNP effects, with age, sex, and enrollment site as covariates and the first 10 PCs to account for population stratification, and retained all SNPs with a p-value below 0.05, which resulted in a total of 243 thousand predictors.
We present in Table 6 the estimated odds ratios (OR) by each method, pglmm, gesso, and glmnet, for the selected SNPs for both main and GEI effects. Of note, it was not possible to use the glinternet package due to computational considerations, its memory requirement being too large for the joint analysis of the 243 thousand preselected predictors. All three methods selected the imputed insertion/deletion (indel) polymorphism on chromosome 4 at position 146,211,844 (rs5862730), which was the only reported SNP that reached genome-wide significance in the full OPPERA cohort (, confidence interval (CI): , ).28 In a females-only analysis, rs5862730 was likewise associated with TMD (, CI: , ), and both pglmm and gesso selected the GEI term between rs5862730 and sex.
Selected SNPs by each method with their estimated odds ratios (OR) for the main effects () and GEI effects () from the TMD real data analysis.
pglmm
gesso
glmnet
Chromosome
Position
OR
OR
OR
OR
OR
OR
OR
3
5,046,726
–
–
–
1.0042
1.0087
1.0129
–
3
153,536,154
1.0020
–
–
–
–
–
–
4
42,549,777
1.0068
1.0042
1.0110
1.0029
1.0060
1.0089
–
4
146,211,844
1.0252
1.0448
1.0712
1.0261
1.0553
1.0829
1.0312
11
17,086,381
1.0076
–
–
1.0014
1.0029
1.0042
–
11
132,309,606
0.9965
–
–
–
–
–
–
12
19,770,625
–
–
–
1.0045
1.0094
1.0140
–
12
47,866,802
1.0184
1.0001
1.0184
–
–
–
1.0140
12
47,870,741
–
–
–
1.0152
1.0320
1.0477
–
14
24,345,235
1.0013
–
–
–
–
–
–
16
81,155,867
–
–
–
1.0039
1.0082
1.0122
–
17
46,592,346
–
–
–
1.0025
1.0052
1.0077
–
17
52,888,414
–
–
–
1.0005
1.0011
1.0017
–
17
69,061,947
–
–
–
1.0021
1.0043
1.0064
–
18
36,210,549
–
–
–
1.0186
1.0392
1.0585
–
19
37,070,882
–
–
–
1.0020
1.0042
1.0062
–
21
32,760,615
–
–
–
1.0051
1.0107
1.0159
–
Note: All three methods selected the imputed insertion/deletion (indel) polymorphism on chromosome 4 at position 146,211,844 (rs5862730), which was the only reported SNP that reached genome-wide significance in the full OPPERA cohort.
SNPs: single nucleotide polymorphisms; GEI: gene-environment interaction; TMD: temporomandibular disorder; OPPERA: Orofacial Pain: Prospective Evaluation and Risk Assessment.
Moreover, we present in Table 7, the AUC in the training and test cohorts, the number of predictors selected in each model and the total computation time to fit each method. We see that pglmm has the highest AUC on the training data, as well as the best predictive performance on the CPPC cohort alone. On the other hand, glmnet and gesso both have a greater predictive performance in the OPPERA2 cohort compared to pglmm. When combining the predictions for OPPERA2 and CPPC cohorts, all three methods have similar predictive performance. In term of the number of predictors selected by each model, glmnet has selected two SNPs with important main effects and no GEI effects, while gesso has selected the highest number of predictors, that is a total of 13 SNPs with both main and GEI effects. On the other hand, our proposed method pglmm has selected a total of seven SNPs, among which three had a selected GEI effect with sex. Finally, we report for each method the computational time to fit the model on the training cohort using 10-folds cross-validation. While glmnet only took two hours to fit, it failed to retrieve any potentially important GEI effects between TMD and sex, albeit we note that it had a similar predictive performance than the hierarchical methods on the combined test sets. On the other hand, pglmm had the highest computational time required to fit the model, because it requires iteratively estimating a random effects vector of size , while both glmnet and gesso only require to estimate a vector of fixed effects of size 10 for the PCs. However, pglmm had the highest AUC on the train set, and was able to retrieve potentially important GEI effects for some of the select SNPs in the model, while selecting half as many predictors than gesso.
Area under the ROC curve (AUC), model size and computational time for the analysis of TMD.
We have developed a unified approach based on regularized PQL estimation, for selecting important predictors and GEI effects in high-dimensional GWAS data, accounting for population structure, close relatedness, shared environmental exposure and binary nature of the trait. We proposed to combine PQL estimation with a CAP for hierarchical selection of main genetic and GEI effects, and derived a proximal Newton-type algorithm with block coordinate descent to find coordinate-wise updates. We showed that for all simulation scenarios, including and additional random effect to account for the shared environmental exposure reduced the FPR of our proposed method for selection of both GEI and main effects. Using the score as a balanced measure of the FDR and TPR, we showed that in the hierarchical simulation scenarios, pglmm outperformed all other methods for retrieving important GEI effects. Moreover, using real data from the OPPERA study to explore the comparative performance of our method in selecting important predictors of TMD, we found that our proposed method was able to retrieve a previously reported significant loci in a combined or sex-segregated GWAS.
A limitation of pglmm compared to a logistic lasso or group lasso with PC adjustment is the computational cost of performing multiple matrix calculations that comes from incorporating a GSM to account for population structure and relatedness between individuals. These computations become prohibitive when the sample size increases, and this may hinder the use of random effects in hierachichal selection of both genetic and GEI fixed effects in genetic association studies. Solutions to explore in order to increase computation speed and decrease memory usage would be the use of conjugate gradient methods with a diagonal preconditioner matrix, as proposed by Zhou et al.,34 and the use of sparse GSMs to adjust for the sample relatedness.35
In this study, we focused solely on the sparse group lasso as a hierarchical regularization penalty. Although previous work has shown that using a CAP penalty with a group norm (iCAP) might perform better than a sparse group lasso penalty for retrieving important interaction terms,19 substantive work is needed to develop an efficient algorithm to fit the iCAP penalty in the presence of random effects. It is also important to highlight that for selection of main effects, the sparse group lasso penalty might perform better than the iCAP penalty. Thus, the choice of which group penalty to use should reflect this trade off between improving the selection of main effects versus selection of important GEI effects. Moreover, it is known that estimated effects by lasso will have large biases because the resulting shrinkage is constant irrespective of the magnitude of the effects. Alternative regularizations like the Smoothly Clipped Absolute Deviation (SCAD)36 and Minimax Concave Penalty (MCP)37 could be explored, although we note that both SCAD and MCP require tuning an additional parameter which controls the relaxation rate of the penalty. Another alternative includes refitting the sparse group lasso penalty on the active set of predictors only, similarly to the relaxed lasso, which has shown to produce sparser models with equal or lower prediction loss than the regular lasso estimator for high-dimensional data.38
Another interesting question to address in the context of high-dimensional GLMMs would be to assess the goodness of fit of the selected sparse model. In the context of high-dimensional GLMs, a recent methodology has been proposed to test for any signal left in the residuals after fitting a sparse model in order to assess whether a sparse non-linear model would be more appropriate.39 Although there exist graphical and numerical methods for checking the adequacy of GLMMs,40 to our knowledge no such procedure has been extended to high-dimensional mixed models. Finally, it would also be of interest to explore if joint selection of fixed and random effects could result in better selection and/or predictive performance. Future work includes tuning the generalized ridge regularization on the random effects,41 or replacing it by a lasso regularization to perform selection of individual random effects.14,42
Footnotes
Acknowledgements
This study was enabled in part by support provided by Calcul Québec (https://www.calculquebec.ca) and Compute Canada (https://www.computecanada.ca). The authors would like to recognize the contribution from SB Smith, L Diatchenko, and the analytical team at McGill University, in particular M Parisien, for providing support with the data from OPPERA, OPPERA II, and CPPC studies. OPPERA was supported by the National Institute of Dental and Craniofacial Research (NIDCR; https://www.nidcr.nih.gov/): grant number U01DE017018. The OPPERA program also acknowledges resources specifically provided for this project by the respective host universities: University at Buffalo, University of Florida, University of Maryland–Baltimore, and University of North Carolina–Chapel Hill. Funding for genotyping was provided by NIDCR through a contract to the Center for Inherited Disease Research at Johns Hopkins University (HHSN268201200008I). Data from the OPPERA study are available through the NIH dbGaP: phs000796.v1.p1 and phs000761.v1.p1. L Diatchenko and the analytical team at McGill University were supported by the Canadian Excellence Research Chairs (CERC) Program grant (http://www.cerc.gc.ca/home-accueil-eng.aspx, CERC09). The Complex Persistent Pain Conditions: Unique and Shared Pathways of Vulnerability Program Project were supported by NIH/National Institute of Neurological Disorders and Stroke (NINDS; ) grant NS045685 to the University of North Carolina at Chapel Hill, and genotyping was funded by the Canadian Excellence Research Chairs (CERC) Program (grant CERC09). The OPPERA II study was supported by the NIDCR under Award Number U01DE017018, and genotyping was funded by the Canadian Excellence Research Chairs (CERC) Program (grant CERC09).
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Fonds de recherche Québec-Santé [267074 to K.O.]; and the Natural Sciences and Engineering Research Council of Canada [RGPIN-2019-06727 to K.O., RGPIN-2020-05133 to S.B.].
ORCID iD
Julien St-Pierre
Supplemental material
Supplemental material for this article is available online: Our Julia package PenalizedGLMM and codes for simulating data are available on github. The data from the real data application that supports the findings of this study are available from the corresponding author upon reasonable request.
A Appendix
References
1.
VisscherPMWrayNRZhangQ, et al. 10 years of GWAS discovery: biology, function, and translation. Am J Hum Genet2017; 101: 5–22.
2.
ManolioTACollinsFSCoxNJ, et al. Finding the missing heritability of complex diseases. Nature2009; 461: 747–753.
3.
ShiMO’BrienKMWeinbergCR. Interactions between a polygenic risk score and non-genetic risk factors in young-onset breast cancer. Sci Rep2020; 10. DOI: https://doi.org/10.1038/s41598-020-60032-3.
4.
MukherjeeBAhnJGruberSB, et al. Case-control studies of gene-environment interaction: Bayesian design and analysis. Biometrics2009; 66: 934–948.
5.
FangKLiJZhangQ, et al. Pathological imaging-assisted cancer gene–environment interaction analysis. Biometrics2023. DOI: https://doi.org/10.1111/biom.13873.
6.
ZemlianskaiaNGaudermanWJLewingerJP. A scalable hierarchical lasso for gene–environment interactions. J Comput Graph Stat2022; 1–13. DOI: https://doi.org/10.1080/10618600.2022.2039161.
YuJPressoirGBriggsWH, et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat Genet2005; 38: 203–208.
9.
HoffmanGE. Correcting for population structure and kinship using the linear mixed model: theory and extensions. PLoS ONE2013; 8: e75707.
10.
PriceALZaitlenNAReichD, et al. New approaches to population stratification in genome-wide association studies. Nat Rev Genet2010; 11: 459–463.
11.
QianJTanigawaYDuW, et al. A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank. PLoS Genet2020; 16: e1009141.
12.
St-PierreJOualkachaKBhatnagarSR. Efficient penalized generalized linear mixed models for variable selection and genetic risk prediction in high-dimensional data. Bioinformatics2023; 39. DOI: https://doi.org/10.1093/bioinformatics/btad063.
13.
HuLLuWZhouJ, et al. MM algorithms for variance component estimation and selection in logistic linear mixed model. Stat Sin2019. DOI: https://doi.org/10.5705/ss.202017.0220.
14.
HuiFKCMüllerSWelshAH. Joint selection in mixed models using regularized PQL. J Am Stat Assoc2017; 112: 1323–1333.
SulJHBilowMYangWY, et al. Accounting for population structure in gene-by-environment interactions in genome-wide association studies using mixed models. PLoS Genet2016; 12: e1005849.
19.
ZhaoPRochaGYuB. The composite absolute penalties family for grouped and hierarchical variable selection. Ann Stats2009; 37. DOI: https://doi.org/10.1214/07-aos584.
20.
MaixnerWDiatchenkoLDubnerR, et al. Orofacial pain prospective evaluation and risk assessment study – the OPPERA study. J Pain2011; 12: T4–T11.e2.
21.
ChenHWangCConomosMP, et al. Control for population structure and relatedness for binary traits in genetic association studies via logistic mixed models. Am J Hum Genet2016; 98: 653–666.
22.
SimonNFriedmanJHastieT, et al. A sparse-group lasso. J Comput Graph Stat2013; 22: 231–245.
23.
ZhouHHuLZhouJ, et al. MM algorithms for variance components models. J Comput Graph Stat2019; 28: 350–361.
24.
GilmourARThompsonRCullisBR. Average information REML: an efficient algorithm for variance parameter estimation in linear mixed models. Biometrics1995; 51: 1440.
25.
BreslowNEClaytonDG. Approximate inference in generalized linear mixed models. J Am Stat Assoc1993; 88: 9–25.
26.
FriedmanJHastieTTibshiraniR. Regularization paths for generalized linear models via coordinate descent. J Stat Softw2010; 33: 1–22. https://www.jstatsoft.org/v33/i01/.
SmithSBParisienMBairE, et al. Genome-wide association reveals contribution of MRAS to painful temporomandibular disorder in males. Pain2018; 160: 579–591.
29.
BuenoCHPereiraDDPattussiMP, et al. Gender differences in temporomandibular disorders in adult populational studies: a systematic review and meta-analysis. J Oral Rehabil2018; 45: 720–729.
30.
DelaneauOMarchiniJZaguryJF. A linear complexity phasing method for thousands of genomes. Nat Methods2011; 9: 179–181.
31.
HowieBNDonnellyPMarchiniJ. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet2009; 5: e1000529.
32.
MareesATde KluiverHStringerS, et al. A tutorial on conducting genome-wide association studies: quality control and statistical analysis. Int J Methods Psychiatr Res2018; 27: e1608.
33.
ChangCCChowCCTellierLC, et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience2015; 4. DOI: https://doi.org/10.1186/s13742-015-0047-8.
34.
ZhouWNielsenJBFritscheLG, et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat Genet2018; 50: 1335–1341.
35.
JiangLZhengZQiT, et al. A resource-efficient tool for mixed model association analysis of large-scale data. Nat Genet2019; 51: 1749–1755.
36.
FanJLiR. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc2001; 96: 1348–1360.
37.
ZhangCH. Nearly unbiased variable selection under minimax concave penalty. Ann Stats2010; 38: 894–942.
38.
MeinshausenN. Relaxed lasso. Comput Stat Data Anal2007; 52: 374–393.
39.
JankováJShahRDBühlmannP, et al. Goodness-of-fit testing in high dimensional generalized linear models. J R Stat Soc Ser B: Stat Methodol2020; 82: 773–795.
40.
PanZLinDY. Goodness-of-fit methods for generalized linear mixed models. Biometrics2005; 61: 1000–1009.
41.
ShenXAlamMFikseF, et al. A novel generalized ridge regression method for quantitative genetics. Genetics2013; 193: 1255–1268.
42.
BondellHDKrishnaAGhoshSK. Joint variable selection for fixed and random effects in linear mixed-effects models. Biometrics2010; 66: 1069–1077.
43.
ØdegårdJIndahlUStrandénI, et al. Large-scale genomic prediction using singular value decomposition of the genotype matrix. Genet Sel Evol2018; 50. DOI: https://doi.org/10.1186/s12711-018-0373-2.
44.
KooijA. Prediction accuracy and stability of regression with optimal scaling transformations. PhD Thesis, Faculty of Social and Behavioural Sciences, Leiden University, 2007.
45.
TibshiraniRBienJFriedmanJ, et al. Strong rules for discarding predictors in lasso-type problems. J R Stat Soc: Ser B (Stat Methodol)2011; 74: 245–266.
46.
LiangXCohenAHeinsfeldAS, et al. sparsegl: an R package for estimating sparse group lasso. arXiv preprint arXiv:220802942 2022.
47.
BhatnagarSRYangYLuT, et al. Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models. PLoS Genet2020; 16: e1008766.
48.
WuTTLangeK. Coordinate descent algorithms for lasso penalized regression. Ann Appl Stat2008; 2. DOI: https://doi.org/10.1214/07-aoas147.