Abstract
Over the years, a more prominent role has been given to gene–gene interaction (epistasis) detection, in view of precision medicine and the hunt for novel drug targets and biomarkers for complex diseases. Acknowledging data complexity as embodied by epistasis potentially increases the power of genome-wide association studies (GWAS) and may reveal relevant biological and biochemical pathways previously undetected. Although confounding of GWAS due to shared genetic ancestry has been well recognized, the extent and impact of such confounding in gene–gene interaction association epistasis studies is much less understood. In the same spirit, the role of population substructure in epistasis detection is largely under-investigated, especially outside a regression framework. This is surprising as inadequate handling of such confounding is likely to lead to spurious epistatic associations, hampering replication and the identification of causal loci in synergy. To improve interpretability and replicability of epistasis results, we introduce “MB-MDR for Structured Populations” (Model-Based Multifactor Dimensionality Reduction-Principal Component [MBMDR-PC]). It extends classic MBMDR that was developed for samples sharing the same genetic ancestry, by replacing the original phenotypes with new phenotypes that have been adjusted for population structure as captured by PCs. The method is applied on International Inflammatory Bowel Disease Genetics Consortium Crohn's disease (CD) data from 15 countries. Significant interacting single nucleotide polymorphisms are found within NOD2 and CYLD genes among others that suggest a potential synergetic effect of NOD2 and CYLD on CD. The study highlights the value of examining epistasis effects and indicates the need for further studies to understand the epistasis effects on CD.
Introduction
In the context of genome-wide association studies (GWAS), population stratification refers to genetically distinct subgrouping.1,2 Several causes exist for population stratification. The basic one being shared genetic ancestry as a result of nonrandom mating between subgroups in a population due to various reasons, which may include social, cultural, or geographical ones. Potential consequences of population stratification are confounding, cryptic relatedness (i.e., unobserved ancestral relationships between individual cases and controls causing them to be nonindependent) and selection bias.1,2
Several strategies exist for protecting against population structure in case–control GWA studies, whether due to population stratification or admixture, the most popular one being based on principal component (PC) analysis. 3 The situation is quite less obvious for genome-wide association interaction studies (GWAIS). One of the main issues is that the increased complexity that comes within reach by performing a GWAIS goes along with a plethora of methodologies (let alone software tools) to detect epistasis or gene–gene interactions. The growing belief in the importance of gene–gene interactions in the development and progression of complex diseases has led to an exponential growth in such tools; to name but a few: generalized linear regression models (GLM), BOOST, 4 Model-Based Multifactor Dimensionality Reduction (MBMDR),5,6 MDR, 7 Random Forest, 8 PLINK, 9 BiForce, 10 Bayesian Models (e.g., BEAM), 11 and several others. For extensive reviews and appropriate references, please refer to Refs.12–16
The literature on epistasis detection in structured populations is very limited, apart from scenarios using a regression framework for association testing. A few exceptions include those strategies that allow submitting population structure corrected phenotypes (i.e., residuals derived from regression models for the original phenotype on selected PCs). 8 Ideally, a general framework and software tool for epistasis detection is available that can offer flexible maneuvering between different measurement scales for phenotypes and genomic predictors. MBMDR offers one such framework and tool.5,6,17 Interestingly, the MDR-SP method, 18 to our knowledge the only MDR-inspired method that can deal with structured populations, combines MDR 7 with ideas implemented in the EIGENSTRAT software. 3 It adds high/low risk labels to multilocus genotypes after adjusting original phenotypes and genotypes by the first few PCs. For feasibility, the same PCs are used for every multilocus genotype and (pairs, triples, etc.) of loci.
In this article, we incorporate strategy to account for population structure in GWAIS using the MBMDR framework with PCs referred to as MBMDR-PC. In particular, for the remainder of this article, we restrict attention to case–control study designs (binary original traits) and biallelic single nucleotide polymorphisms (SNPs) as genetic markers. We fully describe the MBMDR-PC approach and real-life data on inflammatory bowel disease (IBD) from the International IBD Genetics Consortium, 19 are used to further validate the MBMDR-PC approach to correct for confounding due to population structure in epistasis detection. The genome-wide data used for epistasis detection for Crohn's disease (CD) include 15 countries.
CD, a subcategory of IBD, is a complex disorder that can affect any part of the digestive system. Early genome-wide search based on families with multiple affected members with CD reported strong linkage on chromosome 16 and identified the first gene NOD2 as a possible causative locus.20–22 Genetic studies have identified 163 susceptibility loci for IBD, mostly shared between CD and ulcerative colitis as another major form of IBD. 23 The most significant signals were found in the NOD2 region. This region has been shown to exhibit substantial genetic heterogeneity. 24 Elding et al. 24 demonstrated the independent involvement of a neighboring gene on chromosome 16, called CYLD, also playing a role in the immune system. Moreover, Hrdinka et al. 25 demonstrated that CYLD restricts deposition of Lys63-Ub and Met1-Ub on the Linear Ubiquitin Assembly Complex (LUBAC) substrate RIPK2 to limit NOD2-dependent inflammatory signaling. Since both NOD2 and CYLD are involved in the regulation of immune response, following Elding et al. it is interesting to explore whether the combined mutations of these two genes have a synergistic effect on CD. 24 We will, however, consider epistasis screening, including all available curated genetic markers on chromosome 16.
Our study is important in the light of an increasing number of consortium-based epistasis studies that are marked by heterogeneous sample collections due to complex fine-scale or large-scale population structure.
Materials and Methods
The proposed genome-wide epistasis screening strategy in structured populations is built on the MBMDR.5,6,26,27 Even though the MBMDR framework can be used for higher level interaction detection and various outcome measurement scales and study designs, in this study we restrict attention to pairwise interactions. A graphical overview of the newly introduced methods is provided in Figure 1 and explained in more detail as follows.

MBMDR-PC: accounting for genomic structure by PCs
Similar to EIGENSTRAT, 3 we use either linear or nonlinear (kernel) PCs to correct for population structure, and coin our strategy MBMDR-PC. PCs as confounders of population structure due to shared genetic ancestry are assumed to relate to both phonotype and genetic markers. In case–control studies to avoid PCs not to reflect case–control differences, PCs can be first computed from within the controls, and then projected onto the cases to generate PCs for the entire data set.28–30 This approach has been preferred over generating PCs on the pooled case–control data and used in consortium-based IBD GWAS. 19 In MBMDR-PC we append the classic data organization step in MBMDR with one that computes new phenotypes, adjusted for population structure. The new phenotypes are taken as input to classic MBMDR, in an attempt to capture genetic interactions that are not spurious due to inadequate handling of population structures. In what follows, we give a detailed outline of MBMDR-PC (Fig. 1).
Step 1: data organization
As in classic MBMDR, for every two SNP loci j and k, individuals with nonmissing genotype data will exhibit one of nine possible two-locus genotypes {00, 01, 02, 10, 11, 12, 20, 21, 22}, for an individual's genotype Gj at locus j and Gk at locus k taking on 0, 1, 2 as possible values (i.e., number of minor alleles). Specific to MBMDR-PC, the regression framework is adhered to create new phenotypes that are adjusted for population structure captured by a number of PCs. In case–control epistasis studies, where the phenotype Y represents disease status (1 affected, 0 unaffected), and Gj and Gk refer to genetic information at two SNP loci j and k (e.g., using additive encoding), epistasis can be investigated by making inference on the interaction parameter θ in the following logistic regression model:
where
This can be accomplished in R using the package glm.
32
The fitted model should be adapted according to the measurement type of the original phenotype, by selecting an appropriate link function linking the linear predictor
Step 2: multilocus prioritization and dimensionality reduction
The adjusted phenotype obtained in Step 1 is subsequently investigated for distributional differences between multilocus genotypes (
for which
The result is thus a new categorical variable with values H, L, and O, which captures information about the importance of the pair of SNPs with respect to the adjusted phenotype. Subsequently, a final aggregate association test is performed on the new construct and the adjusted phenotype. In particular, we consider the maximum of two t-test statistics denoted by
Step 3: significance assessment
For every pair of SNPs in the data, we obtain a single test statistic given by
from Step 2. This maximum test statistic will no longer follow a t-distribution, due to the compounding of evidences in Step 2. The significance of T per SNP pair is, therefore, assessed through resampling-based strategies. In particular, in this study we generate 999 permutation-based replicates by permuting adjusted trait labels, yet keeping the correlation structure between SNPs intact. Multiple testing is taken care of by a step-down maxT approach as described in-adjusted p-values, 33 hereby ensuring partial strong control of family-wise error rate at 5%. Strong control properties can be stated under the assumption of the subset pivotality property. 30 For large samples, such as in the real-life data application, we adhered to an approximated step-down maxT adjusted p-values, as described in Westfall and Young. 33
Application on real-life data from International IBD Genetics Consortium (CD)
In particular, we considered data from the International IBD Genetics Consortium, comprising 68,427 samples from 15 countries. This data set underwent rigorous quality control by the IBD consortium. 19 In addition, in view of the intended epistasis analyses on chromosome 16, we furthermore filtered SNPs with linkage disequilibrium (LD) pruning by PLINK 9 using a sliding window of size 50, moving step of 5, and r2=0.75 following epistasis protocol recommendations in Gusareva and Van Steen. 14 This resulted in a final data set of 1434 SNPs on chromosome 16 and 52,277 samples of which 18,227 CD cases and 34,050 controls. We used the first five PCs as computed by the IBD consortium, based on Immunochip data with 19,111 SNPs that remained after LD pruning that was performed three times within European controls and after removing SNPs with minor allele frequencies <5%. 19 The PC axes were first generated within the controls, and then projected onto the cases to generate PCs for all samples. These components have proven to be sufficient to control type I error in consortium-based IBD GWAS. 19
MBMDR was used for the epistasis analysis either without correcting for population structure or with correction using the five aforementioned PCs and MBMDR-PC (codominant main effects correction in the MBMDR software). The resulting multiple testing adjusted p-values indicate statistically significant SNP interactions when <0.05.
Epistasis analysis results obtained by MBMDR-PC and screening chromosome 16 SNPs, with and without correcting for population structure are shown in Figure 2A. These epistasis results are adjusted for potential main effects (codominant encoding scheme in MBMDR). The statistical significance of each pair of SNPs is assessed based on 1000 permutations and adjusting for multiple testing. Without correction for population structure we found 152 significant interacting SNP pairs. With correction for population structure, this number was reduced to 109 SNP pairs (Supplementary Table S1). Of these, 90 interacting SNP pairs were common to both approaches (Fig. 2A). In addition, 19 of the 109 significant interacting SNP pairs, detected after correction for population structure, were not detected by the epistasis analysis without such correction.

Pairwise SNPs that were involved in statistically significant interactions with IBD Crohn's disease:
Many of the 109 significant SNP pairs resulted from variants from intergenic and long noncoding RNA (lncRNA) regions. In particular, 24 SNPs involved in the interactions are variants in lncRNAs (RP11-327F22.5 and RP11-21B23.2). Details about SNP–SNP interaction counts based on genomic position are shown in Figure 3.

SNP–SNP interaction counts relate to Crohn's disease based on genomic position. The highest frequency of interaction is found between SNPs in the intergenic and intron regions.
In contrast, 17 unique SNPs involved in 36 pairs of significantly interacting SNPs were from coding regions. Location-wise mapping of these 17 SNPs through the human genome browser at UCSC (http://genome.ucsc.edu/) linked to 7 known genes. The associated statistical epistasis network (obtained through the igraph R package) shown in Figure 4A defines an edge between two nodes (2 out of 17 SNPs) if and only if the interacting SNP pair was found statistically associated with CD. The corresponding gene-level network is displayed in Figure 4B and suggests the following interacting gene pairs: NOD2×CYLD, NOD2×SNX20, NOD2×ADCY7, NOD2×HEATR3, NOD2×NKD1, SNX20×NKD1, CYLD×SNX20, and SNX20×ADCY7. The suggestive interaction between NOD2 and CYLD was derived from rs1332952×rs4785226 (r2=0.16), rs718226×rs2302759 (r2=0.18) and rs7194886×rs3785142 (r2=0.32), among others (Fig. 2A). Notably, NOD2 is connected to all genes except CLEC16A. SNPs occurring in significant SNP×SNP interaction pairs generally did not exhibit strong LD patterns (Fig. 2B).

We next note on the computation time for the real data analysis. In general, the computation time for epistasis analysis using the MBMDR-PC method considered depends on both the number of SNPs and the sample size being analyzed. To analyze all 52,277 individuals and 1374 SNPs from the IBD CD data on a parallel workflow using 96-core computer cluster took 5 min. This computation time is only for epistasis analysis and does not include the prior analysis for extracting PCs.
Discussion
The MBMDR-PC method is a powerful tool to detect epistasis for structured populations when the population structure is adequately captured through the first few PCs. In addition, for this method, population structure can easily be accounted for in regression framework by adding PCs as covariates in the model. Because of the overall good performance (simulation results not shown in this study) of MBMDR-PC we applied it to data from the International IBD Genetics Consortium with the top five PCs, as advocated by the consortium. 19 We found 109 interacting pairs of SNPs of which 36 pairs were physically mapped to 7 pairs of potentially interacting genes, namely NOD2 versus CYLD, NOD2 versus SNX20, NOD2 versus ADCY7, NOD2 versus HEATR3, NOD2 versus NKD1, SNX20 versus NKD1, CYLD versus SNX20, and SNX20 versus ADCY7. Genes such as NOD2, CYLD, HEATR3, and ADCY7 have been reported to be involved in immune system as well as nuclear factor-κB (NF-KappaB) pathways. In particular, independent effects of NOD2 and CYLD have been demonstrated in many GWAS related to CD. In this study we have demonstrated that these genes have a potential synergistic effect on CD. The statistical interaction result between NOD2 and CYLD is supported by a recent study that indicated that CYLD restricts deposition of Lys63-Ub and Met1-Ub on the LUBAC substrate RIPK2 to limit NOD2-dependent inflammatory signaling. 25 Similarly, our finding related to the potential impact of the interaction between HEATR3 and NOD2 on CD is consistent with expression studies of HEATR3 that demonstrated a positive role in NOD2-mediated NF-κB signaling. 34 The NF-κB signaling pathway regulates the expression of hundreds of genes that are involved in diverse and key cellular and organismal processes, including cell proliferation, cell survival, the cellular stress response, innate immunity, and inflammation. 35 It is evident that unregulated activation of NF-κB plays a critical role in the pathogenesis of CD. In a recent investigation of the clinical characteristics and disease outcome of CD patients with varying levels of the NF-κB activation, Han et al. 36 demonstrated that the association of NF-κB activity with specific clinical manifestations in CD patients. However, our statistical evidence for interaction between NKD1 and SNX20 is new; to our knowledge there is no reported evidence for their physical interaction or their coexpression.
Our real-life epistasis analysis for chromosome 16 also indicated the involvement of 24 SNPs in lncRNAs to significantly interact with other SNPs in relation to CD. This links to a recent study that reported the contribution of lncRNAs in IBDs. 37 Moreover, among the significantly interacting SNPs in NOD2, rs5743291 is a nonsynonymous (missense) SNP. This SNP is also involved in changing amino acid V to I within NOD2 (UCSC Genome Browser). In contrast, based on SIFT software prediction, a V to I amino acid change is found to be deleterious possessing a tolerance index score of 0.02 (score <0.05 is predicted to be deleterious), which might result in a potential change in the interaction behavior of NOD2 with other genes for CD as well. Some reported evidence exists that the quantity and the quality of dietary protein consumption and amino acid supplementation may differentially influence the IBD course. 38
Conclusion
In this study, we have highlighted the importance of detecting and correcting population structure in epistasis studies of complex human diseases. It has been established that not accounting for population structure due to allele frequency differences among populations and subpopulations can result in high false-positive results or reduced power in GWAS. In this regard, the proposed MBMDR-PC approach is a powerful tool to detect epistasis in genome-wide association and interaction studies from heterogeneous populations. Since the performance of the MBMDR-PC approach is highly dependent on how well PCs capture population structure, we recommend to use MBMDR-PC with either linear or nonlinear versions of PC analysis, whenever possible. MBMDR analytics should be seen as part of an entire analysis pipeline that involves making marker selection choices and performing postanalysis steps to validate and replicate findings, as well as to seek biological evidence for flagged interacting regions with MBMDR. 14 Fast implementation for multiple testing correction in exhaustive epistasis screenings 10 makes these MBMDR-based methods efficient tools for GWAIS in structured populations.
Our study is important in view of ongoing initiatives of epistasis detection in large-scale heterogeneous consortium data, as we have shown that inadequate capturing of population structure is devastating in GWAIS for obtaining meaningful and replicable epistasis results.
Footnotes
Acknowledgments
This study makes use of data generated by the International IBD Genetics Consortium. A full list of investigators who contributed to the data collection is provided at
. We thank Myriam Nemry for suggestions regarding the implementation of ideas in the MBMDR software. This research was in part funded by the Fonds de la Recherche Scientifique (F.N.R.S.), in particular “Integrated complex traits epistasis kit” (Convention n° 2.4609.11) [K.V.S.]. We also acknowledge the research opportunities offered by F.N.R.S., including “Foresting in Integromics Inference” (Convention n° T.0180.13) [K.C.], and by the interuniversity research institute Walloon Excellence in Lifesciences and BIOtechnology (WELBIO) [F.A., K.V.S.].
Web Resources
MBMDR-PC is available through the MBMDR software (from version mbmdr-4.4.1 onward), which is downloadable from the BIO3 (University of Liege) website. Main options —binary–ac number of PCs–d 2D–a CODOMINANT–rc RESIDUALS
Author Disclosure Statement
No competing financial interests exist.
Abbreviations Used
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
