With the cost-effectiveness technology in whole-genome sequencing, more sophisticated statistical methods for testing genetic association with both rare and common variants are being investigated to identify the genetic variation between individuals. Several methods which group variants, also called gene-based approaches, are developed. For instance, advanced extensions of the sequence kernel association test, which is a widely used variant-set test, have been proposed for unrelated samples and extended for family data. Family data have been shown to be powerful when analyzing rare variants. However, most of such methods capture familial relatedness using a random effect component within the generalized linear mixed model framework. Therefore, there is a need to develop unified and flexible methods to study the association between a set of genetic variants and a trait, especially for a binary outcome. Copulas are multivariate distribution functions with uniform margins on the interval and they provide suitable models to capture familial dependence structure. In this work, we propose a flexible family-based association test for both rare and common variants in the presence of binary traits. The method, termed novel rare variant association test (NRVAT), uses a marginal logistic model and a Gaussian Copula. The latter is employed to model the dependence between relatives. An analytic score-type test is derived. Through simulations, we show that our method can achieve greater power than existing approaches. The proposed model is applied to investigate the association between schizophrenia and bipolar disorder in a family-based cohort consisting of 17 extended families from Eastern Quebec.
The identification of the association between single nucleotide variants (SNVs) with complex diseases is the main goal of genetic studies.1 In a genome-wide association studies (GWAS) framework, researchers generally focus on common causal variants with a minor allele frequency (MAF) . Now, because of the cost-effectiveness of next-generation sequencing (NGS) technologies, hundreds of millions of genetic variants with mostly low MAF have been identified. Rare variants can be defined as genetic mutations that occur at low frequency in a population. They are variants with MAF Single variant tests that have been proposed in GWAS have generally low power when using rare genetic variants in NGS studies. To avoid the power loss issue, several methods such as the burden test,2,3 sequence kernel association test (SKAT),4 and their various combinations have been proposed.6,5 Sequence kernel machine-based association methods test for association between a set of rare/common variants and quantitative or binary phenotypes, and most of them are based on generalized linear mixed models.
In this study, we focus on family-based designs. Sampling relatives in sequencing studies may sometimes be more advantageous than sampling unrelated subjects.7 It has the advantage of enriching data sets for familial rare disease variants because of the segregation of alleles within pedigrees, and it may protect data analysis against population stratification issues. Moreover, sequencing data are subject to read (or sequence) errors, and the Mendelian pedigree information available when related individuals are sequenced can help to identify technological artefacts in the data and improve the protection against sequencing errors.9,8,10 For such reasons, several whole-genome sequencing (WGS) projects have been conducted for family-based cohorts to identify rare variants’ effects. For example, the National Heart, Lung, and Blood Institute’s Trans-Omics for Precision Medicine (TOPMed) program and the National Human Genome Research Institute’s Genome Sequencing Project, have performed WGS from more than samples not only for population-based cohort but also for family studies.5 The Type 2 Diabetes (T2D-GENES) Consortium has been conducted in WGS from 20 large Mexican-American families containing multiple cases of Type 2 diabetes, which aims to detect rare coding variants associated with that disease.11,12
Several region-based sequence kernel association methods have been developed in the literature to analyse familial data. The methods can be broadly divided into two categories to handle the cluster dependencies induced by related subjects: The first category relies on marginal models linking the outcome to genetic predictors, and use the working-correlation trick to control for the within-family dependence structure through generalized estimating equations (GEE).13,14 GEE methods do not need to impose assumptions on the trait joint distribution within each family. Thus they are more robust to distribution misspecification and might be suitable for large-scale studies with complicated data design. However, the test from GEE can suffer from low power if the working covariance is not well calibrated, that is, if the family size is not uniform.14 The methods of the second category rely on conditional models, which capture the clustered structure by adding a random effect in the generalized linear mixed model (GLMM) framework.17,15,16,6 Most existing methods in this category are designed for continuous outcomes. To face this situation, a computationally efficient logistic linear mixed model association test (GMMAT) has been proposed by Chen et al.18 As GMMAT was originally developed for single-variant analysis, the authors, later extended the method to variant-set mixed model association tests (SMMAT) for binary traits.5
Other testing approaches that are computationally feasible regarding the family-based designs for binary traits in a region-based framework have been proposed by Saad and Wijsman.19 The method is termed allele frequency comparison (AFC) tests. Note that these tests were proposed firstly for single-variant analysis by Bourgain et al.20 and Choi et al.21 AFC tests are based on the difference in single nucleotide polymorphism (SNP) allele frequencies in the cases (affected) and controls (unaffected). They use multivariate random-effects with a covariance matrix implying the kinship matrix to account for family relationships between individuals. Unlike the linear mixed models, these tests do not need the kinship matrix to be positive semidefinite, which makes them interesting as they can allow the use of an estimated genetic relationship matrix (GRM) from genotype data, which might not be positive semidefinite. Although the AFC tests are useful, they cannot adjust for covariates. Moreover, they may suffer from type 1 error rate inflation when the correlation between SNPs is not estimated accurately (i.e. correlation computed using small sample sizes).
In this paper, we propose a flexible copula-based variant-set association test for binary phenotypes in family-based designs. Copulas are suitable models for modelling joint distributions since they allow for separation of the dependence and marginal distributions.22–24 That is, the proposed model, termed novel rare variant association test (NRVAT), uses (marginal) generalized link functions to relate the binary phenotype to both the covariates and genetic variants and models the dependence between relatives through a Gaussian copula with correlation matrix corresponding to the familial kinship matrix to describe the polygenic relationship between subjects of a same family. The proposed joint modelling approach has several advantages over current GLMM-based methods, such as GMMAT and SMMAT. For instance, the dependence structure can be captured using a different copula model (e.g. student or chi-square copula), which allows for more flexible models. Because the association between the binary phenotype and both the covariates and genotypes can be captured using usual (marginal) Generalized linear models (GLM), this allows for simple interpretability of the phenotype–genotype association parameters. Moreover, the trait polygenic heritability is ‘margin-free’, in the sense that it is characterized by the copula alone.
The remainder of this paper is structured as follows. In Section 2, we formally describe the proposed statistical framework. Numerical studies are conducted to compare different models in Section 3. In Section 4, we apply the proposed method to real data. We discuss and conclude this paper in Section 5.
Data and model
Consider families and for , let be the size of the family. The total sample size is . Let be the binary phenotype under investigation for individual in family , for and . We begin by specifying the (conditional) marginal distribution of , denoted , where is a vector of covariates including the intercept and is a vector of genotypes coded as () from biallelic variants. Since the response variable is dichotomous, is completely specified by . Thus, we relate the binary phenotype to and through a logistic regression model
where and and are sets of regression coefficients. In a matrix notation, one has
where , is a matrix with the row equal to and is a matrix with the row equal to . The logit was taken element-wise of the entries of
We then specify a joint distribution for based on a latent-variable model, which assumes a latent vector such that
with , , measures the polygenic heritability, is a matrix with entries reflecting the proportion of the genome that is shared identically by descent between subjects, and is the identity matrix of size . This leads to
where
is a Gaussian copula, with , and is the -dimensional Gaussian cumulative distribution function with mean zero and correlation matrix .
The role of the copula is to account for possible dependence between the residuals of the marginal models. In other words, the variance–covariance matrix of the vector of residuals, , is given as follows:
where is a correlation matrix with the out-of-diagonal element equals to the element of , .
Motivated by such advantages provided by the proposed joint modelling approach, we suggest building a score-type test statistic for phenotype–genotype association based on the marginal logistic regression model and adjusting for possible dependence between marginal residuals via the copula model. The proposed score-type test procedure is detailed in the next section.
Inference procedure
Under the proposed model, the association between the variants and the phenotype can be tested by evaluating the null hypothesis .
Deriving the test statistic from the complete log-likelihood function induces complex formulae that prevent the practical implementation of the test. In this work, we consider an alternative approach and propose to derive the test statistic under the independent working assumption.26,25 The log-likelihood function is then written as
We adopt the variance-component (VC) hypothesis testing technique, which is a standard approach in the region-based association framework. That is, we treat as a random vector following an arbitrary distribution with mean zero and variance–covariance matrix , where is a VC scalar and is a diagonal matrix of a priori weights to be used for the variants, here we used MAF-based weights. Thus, testing for is equivalent to testing for .
Under such a framework, one has
Following the same rationale as in the derivation of the SKAT[15 and VC Copula-based score statistics of Lakhal-Chaieb et al.7 (refer to Supplemental materials for details), we show that
where , , and and , . Here, is a Kernel matrix that can be written as
where is a sub-matrix with the entry equal to .
The variability of the second term of the right-hand side of (5) is negligible compared with the first one. This prompts us to consider the test statistic , where is the estimator of under the null hypothesis. Explicitly, one has , where , and . In Appendix A, we show that
where and . Therefore, one has , where can be consistently estimated by .
In Appendix B, we show that the variance–covariance matrix of is
and that covariance matrix of and is
Therefore, the variance–covariance matrix of is
Hence, the distribution of the test statistic under the null hypothesis can be approximated by a weighted mixture of chi-squared distributions , where are the eigenvalues of and are independent random variables. In practice, involves , which is unknown. In this work, we replace by its estimator obtained by maximizing, computed under the null hypothesis the pairwise log-likelihood which formulae is given by
See Appendix A for more details.
Generalizations
In this work, the dependence between the outcomes is modelled via a Gaussian copula. The latter is used to express the variance–covariance matrix and to estimate the polygenic heritability parameter . Therefore, one may generalize the derived test to any elliptic copula (23, chapter 4) without any loss of generality. The robustness of the derived test to the specification of the underlying copula is well-investigated in the simulation studies section. The Kernel matrix in equation (6) is known in the literature as the linear kernel matrix.4 However, several other choices such as the quadratic and Gaussian kernels can be incorporated in the derived test, without loss of generality.27 In the simulations, their impact is investigated on both the type I error and the power of the proposed test.
Simulations
Simulations were carried out to validate and compare the performance of the proposed method, NRVAT, with three existing set-based rare-variant association methods dealing with binary phenotype in presence of families, namely SMMAT5; gSKAT14 and AFC tests.19 As mentioned previously, SMMAT is a GLMM method which captures family relationships via a random effect; gSKAT is a GEE-based association test which controls for familial relationships via the working-correlation trick; AFC test is a score-type association test based on the difference of SNP allele frequencies in the cases (affected) versus controls (unaffected) and uses a multivariate random effect to account for family relationships between individuals. The methods comparison is based on empirical type I error rate and power.
Data generation
This section outlines the detailed steps pursued to generate simulated data for the methods’ evaluation.
Genotypes: we used SIMULATE3 computer program,28 which allows simulation of genotypes of a set of linked SNPs in family members. In all simulations, we set the number of families to be , with families of two parents and one child, families of two parents and two children, and families of three generations where there are eight subjects per family (i.e. cousins with their parents and grandparents). This leads to a total of individuals (240 males and 360 females; 320 founders and 280 nonfounders) per each generated data. We set the number of simulated SNPs for each subject to be , with MAFs ranging between and . We assumed linkage disequilibrium between two adjacent SNPs to be (i.e. is the squared correlation coefficient between two adjacent SNPs).
Covariates: in all simulations, we considered two covariates associated with the binary response: a continuous covariate, , and a binary covariate, . The marginal effects of and on the outcome were set to , respectively. The intercept coefficient was set to .
Binary phenotype: we considered three settings to generate in order to conduct type I error and power comparisons with the competing methods. In Setting 1, was generated based on our copula model (2). In Setting 2, was simulated based on the GLMM model of the SMMAT approach. Finally, Setting 3 aimed to assess the robustness of the proposed copula approach with respect to miss-specification of the true copula describing the dependence structure between subjects of the same family. The three settings are described in detail as follows:
Setting 1 (data generation from our model): For each family , we simulated the response variable of the subjects following our copula model (2). The data generation steps are given as follows:
1. generate matrix of genotypes from Simulate3;
2. set the first column of matrix to be the vector of ones, then generate entries of its additional two columns from the uniform distribution over and a Bernoulli distribution with probability of success equals to , respectively;
3. calculate the vector ;
4. generate with ;
5. generate
Setting 2 (data generation from a GLMM model): We generated the response variable based on the GLMM model for family , as follows:
where , and are defined as in Setting 1. is twice the family theoretical kinship matrix, and is the polygenic trait heritability.
Setting 3 (copula miss-specification): To investigate the robustness of the derived test to the miss-specification of the underlying/true copula, we conducted two scenarios in which the simulated data were generated from either the student or chi-square copula models. However, our proposed Gaussian copula model given in (3) was fitted to the simulated data in order to derive -values. More precisely, in both scenarios, for family , the response variable was generated following the same steps of Setting 1, except Step 4 where was generated following a multivariate student with 3 degrees of freedom (df) and correlation matrix , in Scenario 1. In Scenario 2, was simulated from a chi-square copula distribution, as follows: we first generated , then we set , and finally we calculated the vector such that , where is the standard normal cumulative distribution function. Following,29 has a multivariate chi-square copula distribution with a non-centrality parameter , and normal marginal distributions. In our simulations, we fixed .
In all three settings, we considered . Table 1 describes the parameters’ combinations used in our simulation studies.
Parameters combinations for the simulations studies under all the settings and scenarios family size; total of sample size; polygenic heritability; variance component (VC); intercept; covariate effects; genotypes effects; covariates.
Parameters
Values
(under null) — (under alternative)
discrete 0/1 with probability of succes
Under the null model, the parameter of interest is ; then random samples were generated according to each parameter-combination scenario to assess the type I error rate of all the methods. We again refer the reader to Table 1 for the specific values of the model parameter combination considered.
To evaluate the methods’ performance in terms of power, five SNPs were randomly chosen among the 20 studied SNPs as causal variants (i.e. of all studied SNPs). More precisely, for the five causal SNPs, we assumed their effect , with and is a diagonal matrix of a priori weights to be used for the variants. The effects of the remaining SNPs were set to zero. The power comparison was based on replications.
Simulation results
This section summarizes the simulation results for the Type I error rate for the scenario with and the power levels for all settings described above.
Type I error results
Results of Setting 1:Figure 1 and Table S1 (Supplemental material) show quantile–quantile (QQ)-plots of the -values and empirical type I error rate of all the considered methods, respectively, where data are generated under the Gaussian copula model. The empirical type I error of NRVAT, SMMAT, AFC (Xc) and gSKAT (perturbed) are around the nominal level (). By contrast, the empirical type I error of gSKAT (asymptotic) is lower than the nominal level whereas AFC_QLS has an inflated type I error rate. Our proposed approach showsa well-controlled type I error rate, for all the proposed Kernel-based tests. Figures S1 and S2 (Supplemental material) show similar results under respectively.
Results of Setting 2:Figure 2 and Table S2 (Supplemental material) show the results of all the methods when the data are generated from the GLMM model. As expected, the empirical type I error rate of SMMAT is well-controlled since it is a GLMM-based association test. The empirical type I error of NRVAT is controlled for (Figures S3 and S4 of Supplemental material, respectively), however, the method seems to be conservative for This might be due to the bias introduced in the NRVAT estimation of the marginal model parameters using the GLM model (1). In fact, from Table S6 of Supplemental material, one can see that the bias in the estimation of , , and , increased as increased. On the other hand, AFC (Xc) and gSKAT (perturbed) show no type I error rate inflation, for the three values of the polygenic heritability (). The gSKAT (asymptotic) is still conservative whereas AFC test (QLS) remains inflated. Figures S3 and S4 of Supplemental material show similar results under respectively.
Results of Setting 3:Figure 3 and Table S3 (Supplemental material), and Figure 4 and Table S4 (Supplemental material) describe the empirical type I error rate results for data generated under the student and the chi-square copula models, respectively. Under the student copula model (Figures 3 and SupplementalTable S3), NRVAT seems to be sensitive to the true copula miss-specification by showing a slightly higher empirical type I error rate compared to the nominal level . However, under the chi-square copula model (Figures 4 and Supplemental Table S4), NRVAT shows a well-controlled type I error. On the contrary, in both scenarios of Setting 3, all the competitors behave in a similar way as in Setting 1. That is, SMMAT, AFC (Xc) and gSKAT (perturbed) have valid type I error rates, whereas AFC test (QLS) remains inflated and gSKAT (asymptotic) remains conservative. Figures S5 to S8 show similar results under for the two scenarios of Setting 3, respectively.
We also show in Supplemental Tables S7 and S8, the results of empirical bias, for our method, of the nuisance parameters and the polygenic heritability under for the two scenarios of Setting 3.
QQ-plot under the null hypothesis of no SNPs/phenotype association with the heritability parameter where the data are generated under the G copula. Results are computed from 10,000 data sets generated under Setting 1. The compared methods are NRVAT model with the L, Q, IBS, G, and P kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with (Xc), and (QLS); and gSKAT model with the asymptotic and perturbed. QQ: quantile–quantile; SNP: single nucleotide polymorphism; NRVAT: novel rare variant association test; L: linear; Q: quadratic; IBS: identity-by-state; G: Gaussian; P: polynomial; SMMAT: variant-set mixed model association tests; AFC: allele frequency comparison; GSKAT: burden and kernel-based gene set association tests for binary traits.
QQ-plot under the null hypothesis of no SNPs/phenotype association with the heritability parameter where the data are generated under the GLMM. Results are computed from 10,000 data sets generated under Setting 2. The compared methods are NRVAT model with the L, Q, IBS, G, and P kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with (Xc), and (QLS); and GSKAT model with the asymptotic and perturbed. QQ: quantile–quantile; SNP: single nucleotide polymorphism; GLMM: generalized linear mixed model; NRVAT: novel rare variant association test; L: linear; Q: quadratic; IBS: identity-by-state; G: Gaussian; P: polynomial; SMMAT: variant-set mixed model association tests; AFC: allele frequency comparison; GSKAT: burden and kernel-based gene set association tests for binary traits.
QQ-plot under the null hypothesis of no SNPs/phenotype association with the heritability parameter where the data are generated under the student copula model (). Results are computed from 10,000 data sets generated under Scenario 1 of Setting 3. The compared methods are NRVAT model with the L, Q, IBS, G, and P kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with (Xc), and (QLS); and GSKAT model with the asymptotic and perturbed. QQ: quantile–quantile; SNP: single nucleotide polymorphism; NRVAT: novel rare variant association test; L: linear; Q: quadratic; IBS: identity-by-state; G: Gaussian; P: polynomial; SMMAT: variant-set mixed model association tests; AFC: allele frequency comparison; GSKAT: burden and kernel-based gene set association tests for binary traits.
QQ-plot under the null hypothesis of no SNPs/phenotype association with the heritability parameter where the data are generated under the chi-square copula model, with a non-centrality parameter . Results are computed from 10,000 data sets generated under Scenario 2 of Setting 3. The compared methods are NRVAT model with the L, Q, IBS, G, and P kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with (Xc), and (QLS); and GSKAT model with the asymptotic and perturbed. QQ: quantile–quantile; SNP: single nucleotide polymorphism; NRVAT: novel rare variant association test; L: linear; Q: quadratic; IBS: identity-by-state; G: Gaussian; P: polynomial; SMMAT: variant-set mixed model association tests; AFC: allele frequency comparison; GSKAT: burden and kernel-based gene set association tests for binary traits.
Of note, the proposed method, like its competitors, might be sensitive to selection bias and it does not handle missing values; that is, subjects with missing values are removed from the data analysis. Although this is beyond the scope of this work, because real familial data are subject to selection bias and/or missing genotypes/phenotypes, we have conducted a sensitivity analysis to emphasize the impact of selection bias and missing data on our method. In Supplemental materials, we also show the conducted additional simulations under the null hypothesis for Setting 1. The simulated data were generated under three scenarios : (a) selection bias, (b) missing at random, and (c) missing completely at random.
To emphasize the impact of linkage disequilibrium between SNPs (i.e. values) on the performance of the proposed method and its competitors, we have carried out additional simulation scenarios with different values of . More precisely, in Supplemental materials, we presented additional simulations under the null hypothesis of Setting 1 where the simulated data were generated under two scenarios: (1) , and (2) .
Power results
Figures 5 to 8 outline the empirical power levels for all the methods where data is generated from the Gaussian copula model (Setting 1), GLMM (Setting 2), and the student and chi-square copula models (Setting 3). The AFC (QLS) method was omitted in the power analysis comparison as it demonstrates severe type I error rate inflation (Figures 1 to 4 and Figures S1 to S8 of Supplemental material). Interestingly, in all settings, one can notice a substantial gain in the power of NRVAT with both the IBS and the Gaussian Kernel matrices. Figure 9 and Figures S9 and S10 of Supplemental material show the power levels as a function of a grid of values of the variance-component respectively, for and , under the Gaussian copula model (Setting 1). Again, these figures illustrate the important gain in power achieved by NRVAT with the IBS and the Gaussian Kernel similarity matrices.
Application to Real Data
Schizophrenia and bipolar disorder family study
The data in this analysis consists of 640 subjects belonging to 17 extended families from Eastern Quebec, with some family members known to have schizophrenia (SZ) or bipolar disorder (BP).30 We considered gene-based association analyses of SZ ( affected; non-affected and unknown) and BP ( affected; non-affected and unknown) binary phenotypes; we also analysed a ‘common locus’ (CL) binary phenotype, for which, diseased individuals are defined as subjects with SZ and/or BP ( affected; non-affected and unknown). We considered genomics regions significant with linkage finding in SZ and BP from Quebec Eastern family data.30 That is, for the gene-based analysis of the SZ trait, SNPs clustered within genes were considered. The analysis of the BP trait considered SNPs falling within genes. The CL phenotype analysis consisted of SNPs and genes. The whole-genome SNP genotyping was provided by OmniExpress24 Illumina and the genotype data was prepared by Chagnon et al.30 After removing both subjects and SNPs with missing values, a total of 433 subjects were available for analysis, with 57 affected (AF) and 376 non-affected relatives (NAR) for SZ, 83 AF versus 350 NAR for BP, and 153 AF against 280 NAR for CL. Finally, genes were retained for SZ analysis with a total of SNPs with gene size varying between and SNPs with of SNPs having their MAF BP analysis has considered genes for a total of SNPs, with gene-size varying from to SNPs with of their SNPs have MAF The genotypes of the SNPs corresponding to the CL genomics region were all available, and so, there was no reduction in the number of SNPs; the gene size, in this analysis, ranges between and SNPs with of SNPs having their MAF For the three traits, we fitted NRVAT, SMMAT, AFC, and gSKAT set-based association tests. In all analyses, we adjusted for the sex ( females vs. males) as a covariate in all fitted models.
Empirical power under the alternative hypothesis of SNPs/phenotype association and (respectively, for the first line and the second line) where the data are generated under the Gaussian copula. Results are computed from 1000 data sets generated with 25% of causal variants taken randomly from the regions’ size (20) under Setting 1. The compared methods are NRVAT model with the L, Q, IBS, G, and P kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with (Xc); and GSKAT model with the asymptotic and perturbed. SNP: single nucleotide polymorphism; NRVAT: novel rare variant association test; L: linear; Q: quadratic; IBS: identity-by-state; G: Gaussian; P: polynomial; SMMAT: variant-set mixed model association tests; AFC: allele frequency comparison; GSKAT: burden and kernel-based gene set association tests for binary traits.
Empirical power under the alternative hypothesis of SNPs/phenotype association and (respectively, for the first line and the second line) where the data are generated under the GLMM. Results are computed from 1000 data sets generated with 25% of causal variants taken randomly from the regions size (20) under Setting 2. The compared methods are NRVAT model with the L, Q, IBS, G, and P kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with (Xc); and GSKAT model with the asymptotic and perturbed. SNP: single nucleotide polymorphism; GLMM: generalized linear mixed model; NRVAT: novel rare variant association test; L: linear; Q: quadratic; IBS: identity-by-state; G: Gaussian; P: polynomial; SMMAT: variant-set mixed model association tests; AFC: allele frequency comparison; GSKAT: burden and kernel-based gene set association tests for binary traits.
Empirical power under the alternative hypothesis of SNPs/phenotype association and (respectively, for the first line and the second line) where the data are generated under the student copula (). Results are computed from 1000 data sets generated with 25% of causal variants taken randomly from the regions size (20) under scenario 1 of Setting 3. The compared methods are NRVAT model with the L, Q, IBS, G, and P kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with (Xc); and GSKAT model with the asymptotic and perturbed. SNP: single nucleotide polymorphism; NRVAT: novel rare variant association test; L: linear; Q: quadratic; IBS: identity-by-state; G: Gaussian; P: polynomial; SMMAT: variant-set mixed model association tests; AFC: allele frequency comparison; GSKAT: burden and kernel-based gene set association tests for binary traits.
Empirical power under the alternative hypothesis of SNPs/phenotype association and (respectively, for the first line and the second line) where the data are generated under the chi-square copula with a non-centrality parameter . Results are computed from 1000 data sets generated with 25% of causal variants taken randomly from the regions’ size (20) under Scenario 2 of Setting 3. The compared methods are NRVAT model with the L, Q, IBS, G, and P kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with (Xc); and GSKAT model with the asymptotic and perturbed. SNP: single nucleotide polymorphism; NRVAT: novel rare variant association test; L: linear; Q: quadratic; IBS: identity-by-state; G: Gaussian; P: polynomial; SMMAT: variant-set mixed model association tests; AFC: allele frequency comparison; GSKAT: burden and kernel-based gene set association tests for binary traits.
Power function under the alternative hypothesis of SNPs/phenotype association of grid of for the polygenic heritability parameter where the data are generated under the Gaussian copula. Results are computed from 1000 data sets generated with 25% of causal variants taken randomly from the regions size (20) under Setting 1. The compared methods are NRVAT model with the L, Q, IBS, G, and P kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with (Xc); and gSKAT model with the asymptotic and perturbed. SNP: single nucleotide polymorphism; NRVAT: novel rare variant association test; L: linear; Q: quadratic; IBS: identity-by-state; G: Gaussian; P: polynomial; SMMAT: variant-set mixed model association tests; AFC: allele frequency comparison; GSKAT: burden and kernel-based gene set association tests for binary traits.
Results of SZ and BP analysis
Figures 10 to 12 show the QQ plots of the obtained -values from the gene-based analyses of SZ, BP, and CL. Figure 13 shows the overall -value QQ-plot, in which, -values from the analysis of the three traits are put together as one set of -values. From these figures, one can see that all the methods have valid type I error rates under the analysis of both SZ and CL traits (Figures 10 and 12), except AFC (QLS), which has severe inflated type I error rate as it was also noticed in the simulation results. On the contrary, under the BP analysis, AFC (XC), SMMAT, and gSKAT (perturbed) present inflated type I error rate, however, NRVAT still has valid results in this analysis.
Table 2 reports genes/regions with -values that are significant at the nominal significance level of , after correcting for multiple testing using Bonferroni correction, for all the methods. Under SZ analysis, the strongest association is detected for gene SMYD3 by our method with the quadratic kernel matrix (NRVAT-Quadratic) and the SMMAT model with the hybrid test (O) method (-value ). Significant signals are also detected, under the BP analysis, for the genes C1ORF77, CGI-96, and TRIM24 for NRVAT-Quadratic, NRVAT-Gaussian, NRVAT-Polynomial, and AFC (Xc) (-value ). No significant signals were declared from the CL analysis.
Significant genes after Bonferroni correction of NRVAT model with the L, Q, IBS, G, and P kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with (Xc), and (QLS); and GSKAT model with the asymptotic and perturbed for SZ, BP disorder, and their CL family study data at nominal level and the overall Bonferroni correction, which corrects for multiple testing of the total number of tests conducted in the three analyses, for each method.
NRVAT
SMMAT
AFC
GSKAT
Traits
L
Q
IBS
G
P
O
E
Xc
QLS
Asymp
Pert
SZ
-
SMYD3 ()
–
–
–
SMYD3 ()
–
–
*
–
–
BP
-
C1ORF77 ()
-
CGI-96 ()
CGI-96 ()
–
–
TRIM24 ()
*
–
–
CL
–
–
–
–
–
–
–
–
*
–
–
OverAll
–
–
–
–
CGI-96
SMYD3
–
–
*
–
–
NRVAT: novel rare variant association test; L: linear; Q: quadratic; IBS: identity-by-state; G: Gaussian; P: polynomial; CL: common locus; SMMAT: variant-set mixed model association tests; AFC tests: allele frequency comparison tests; Gskat: burden and kernel-based gene set association tests for binary traits; BP: bipolar disorder; SZ: schizophrenia; *: method with inflated type I error rate; –: non-significative methods.
QQ-plot for NRVAT model with the L, Q, IBS, G, and P kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with (Xc), and (QLS); and GSKAT model with the asymptotic and perturbed for schizophrenia family study data. QQ: quantile–quantile; NRVAT: novel rare variant association test; L: linear; Q: quadratic; IBS: identity-by-state; G: Gaussian; P: polynomial; SMMAT: variant-set mixed model association tests; AFC: allele frequency comparison; GSKAT: burden and kernel-based gene set association tests for binary traits.
QQ-plot for NRVAT model with the L, Q, IBS, G, and P kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with (Xc), and (QLS); and GSKAT model with the asymptotic and perturbed for bipolar disorder family study data. QQ: quantile–quantile; NRVAT: novel rare variant association test; L: linear; Q: quadratic; IBS: identity-by-state; G: Gaussian; P: polynomial; SMMAT: variant-set mixed model association tests; AFC: allele frequency comparison; GSKAT: burden and kernel-based gene set association tests for binary traits.
QQ-plot for NRVAT model with the L, Q, IBS, G, and P kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with (Xc), and (QLS); and GSKAT model with the asymptotic and perturbed for CL, for which, diseased individuals are defined as subjects with SZ and/or BP, family study data. QQ: quantile–quantile; NRVAT: novel rare variant association test; L: linear; Q: quadratic; IBS: identity-by-state; G: Gaussian; P: polynomial; SMMAT: variant-set mixed model association tests; AFC: allele frequency comparison tests: GSKAT: burden and kernel-based gene set association tests for binary traits; CL: common locus; SZ: schizophrenia; BP: bipolar disorder.
QQ-plot for NRVAT model with the L, Q, IBS, G, and P kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with (Xc), and (QLS); and GSKAT model with the asymptotic and perturbed for schizophrenia, bipolar disorder and their CL family study data for overall (combining SZ, BP and CL). QQ: quantile–quantile; NRVAT: novel rare variant association test; L: linear; Q: quadratic; IBS: identity-by-state; G: Gaussian; P: polynomial; SMMAT: variant-set mixed model association tests; AFC: allele frequency comparison tests: gSKAT: burden and kernel-based gene set association tests for binary traits; CL: common locus; SZ: schizophrenia; BP: bipolar disorder.
Table 2 highlights also the significant results from the overall Bonferroni correction, which corrects for multiple testing of the total number of tests conducted in the three analyses, for each method; that is, Bonferroni-corrected threshold equals . We note that the genes CGI-96 and SMYD3 remain significant for NRVAT-Polynomial and SMMAT-O methods, respectively. Indeed, SMYD3 is one of the histone methyltransferases that catalyse methylation of histone H3 at K4 in mammalian cells,31 and it has been proven that one of it rare SNV namely rs6426297 is associated with the SZ trait. This rare SNVis associated with suicide attempts in people with SZ and BP (for more details, see https://www.ebi.ac.uk/gwas/variants/rs6426297).
Discussion
In this work, we have developed NRVAT, a flexible set-based association test for rare and common variants and binary phenotypes, in family-based designs. NRVAT uses marginal generalized linear mixed models to relate the outcome to both the covariates and a set of SNPs and uses copulas to account for possible dependence between subjects of the same family/pedigree through the kinship matrix. An advantage of NRVAT joint modelling is that the regression parameters linking both covariates and genotypes to the phenotype are marginally meaningful. Moreover, trait polygenic heritability is ‘margin-free’, in the sense that it is characterized by the copula alone. The NRVAT framework includes five different kernel matrices to capture genotype–phenotype relationships, namely, the linear (L), quadratic (Q), identity-by-state (IBS), Gaussian (G), and polynomial (P) kernel matrices. Through simulations, we have shown that NRVAT has a valid type I error rate and allows for more power than existing models in family-based designs, even when the true copula was not well specified. We have also evaluated the performance of the NRVAT model using schizophrenia and bipolar disorder data and we have found an association signal with the SZ binary trait for one gene SMYD3, and with the BP binary trait for two genes, C1ORF77 and CGI-96.
One of the main advantages of NRVAT is that it models the dependence structure between subjects regardless of their marginal distributions, which allows the use of different margins to link the binary response to the covariates and the genotypes. For instance, one can use latent Probit or Robit22 marginal models instead of the latent logistic model in (1). Our method is also valid in the presence of selection bias or missing genotypes.
It is necessary to note that the NRVAT model has some limitations. In fact, the derivation of -values and the asymptotic null distribution of the NRVAT score test statistic assumes that the families are independent. This is true when one uses a priori kinship matrix to describe subjects’ relatedness, which is a block diagonal matrix. However, in the absence of pedigree information and/or in the presence of admixture population effects, the estimation of subjects’ relatedness is more suitable using an empirical genetic similarity matrix, such as the GRM.34,32,33 Such matrices, by construction, are not block diagonals and thus their use within NRVAT might not be suitable. This means that NRVAT might not handle a population structure other than the well-defined familial structure. Also, people should be aware that the NRVAT model could give incorrect results when the number of families is small. Several extensions to our methodological proposal could also be investigated. Our approach could be extended to develop a functional association test for dichotomous traits in the presence of family data. To parallel what is done in Jiang et al.,35 genotype–phenotype functional relationship can be modelled in the marginal distributions based on generalized functional linear mixed models while, again, a copula model can be used to characterize the trait dependence between subjects of the same family. So far, our approach only handles a single binary trait for each subject, but an extension to a mixture of binary-continuous trait cases for familial data could be possible. In fact, an additional source of dependency, stemming from within-subject correlated phenotypic values, could be accounted for by choosing an appropriate copula model.
NRVAT integrates five kernel matrices within the association test to better capture the phenotype–genotype relationship. Although the optimal choice of the kernel matrix for real data analysis is a daunting question, here are some guidelines on how to choose a kernel matrix in NRVAT: when the underlying genotype–phenotype relationship is unknown, one may choose a linear kernel if there is a priori knowledge that relationships are linear and there are no interactions. In the presence of interactions, the quadratic kernel can be a good alternative as it implicitly assumes that the underlying relationship depends on the main and second-order effects.4 In the case where there is a priori knowledge of the existence of more complex relationships, the Gaussian and IBS kernels may be the better choice.
sj-pdf-1-smm-10.1177_09622802231197977 - Supplemental material for A novel rare variants association test for binary traits in family-based designs via copulas
Supplemental material, sj-pdf-1-smm-10.1177_09622802231197977 for A novel rare variants association test for binary traits in family-based designs via copulas by Houssou R. G. Dossa, Alexandre Bureau, Michel Maziade, Lajmi Lakhal-Chaieb and Karim Oualkacha in Statistical Methods in Medical Research
Footnotes
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
This work received received financial support from:
the Canadian Statistical Sciences Institute (CANSSI) through a collaborative reseach team (CRT) grant.
Fonds de recherche du Québec through the grant FRQS-CB-J1.
The Eastern Quebec Schizophrenia and Bipolar Disorder study was funded by the Canadian Institutes of Health research (CIHR, grants MOP-74430, MOP-119408, MOP-114988 and PCG-155471) and its data management system was supported by the Canada Foundation for Innovation Leadership Opportunity Fund (grant 27592).
Supplemental material
Supplemental material for this article is available online.
Appendix
Appendix B
Let,
under an extension of Taylor’s development of of order 1 around we have that
Suppose the variance of the residuals becomes
From (7) and (8)
with
Then
and
So
Hence, one has
Since the families are independents between themselves, we get
References
1.
PengZShenHZhaoY, et al. Genetic association analysis using sibship data: A multilevel model approach, 2012.
2.
LiBLealSM. Methods for detecting associations with rare variants for common diseases: Application to analysis of sequence data. Am J Human Genet2008; 83: 311–321.
3.
MadsenBEBrowningSR. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet2009; 5: e1000384.
4.
WuMCLeeSCaiT, et al. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Human Genet2011; 89: 82–93.
5.
ChenHHuffmanJEBrodyJA, et al. Efficient variant set mixed model association tests for continuous and binary traits in large-scale whole-genome sequencing studies. Am J Human Genet2019; 104: 260–274.
6.
JiangDMcPeekMS. Robust rare variant association testing for quantitative traits in samples with related individuals. Genet Epidemiol2014; 38: 10–20.
7.
Lakhal-ChaiebLOualkachaKRichardsBJ, et al. A rare variant association test in family-based designs and non-normal quantitative traits. Stat Med2016; 35: 905–921.
8.
OttJKamataniYLathropM. Family-based designs for genome-wide association studies. Nat Rev Genet2011; 12: 465–474.
9.
RoachJCGlusmanGSmitAF, et al. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science2010; 328: 636–639.
10.
ZhouBWhittemoreAS. Improving sequence-based genotype calls with linkage disequilibrium and pedigree information. Ann Appl Stat2012; 457–475.
11.
JunGManningAAlmeidaM, et al. Evaluating the contribution of rare variants to type 2 diabetes and related traits using pedigrees. Proc Natl Acad Sci USA2018; 115: 379–384.
12.
XueAWuYZhuZ, et al. Genome-wide association analyses identify 143 risk variants and putative regulatory mechanisms for type 2 diabetes. Nat Commun2018; 9: 1–14.
13.
WangXLeeSZhuX, et al. Gee-based snp set association test for continuous and discrete traits in family-based association studies. Genet Epidemiol2013; 37: 778–786.
14.
WangXZhangZMorrisN, et al. Rare variant association test in family-based sequencing studies. Brief Bioinformatics2016; 18: 954–961.
15.
ChenHMeigsJBDupuisJ. Sequence kernel association test for quantitative traits in family samples. Genet Epidemiol2013; 37: 196–204.
16.
OualkachaKDastaniZLiR, et al. Adjusted sequence kernel association test for rare variants controlling for cryptic and family relatedness. Genet Epidemiol2013; 37: 366–376.
17.
SchifanoEDEpsteinMPBielakLF, et al. Snp set association analysis for familial data. Genet Epidemiol2012; 36: 797–810.
18.
ChenHWangCConomosMP, et al. Control for population structure and relatedness for binary traits in genetic association studies via logistic mixed models. Am J Human Genet2016; 98: 653–666.
19.
SaadMWijsmanEM. Association score testing for rare variants and binary traits in family data with shared controls. Brief Bioinformatics2019; 20: 245–253.
20.
BourgainCHoffjanSNicolaeR, et al. Novel case-control test in a founder population identifies p-selectin as an atopy-susceptibility locus. Am J Human Genet2003; 73: 612–626.
21.
ChoiYWijsmanEMWeirBS. Case-control association testing in the presence of unknown relationships. Genetic Epidemiol2009; 33: 668–678.
22.
de LeonARWuB. Copula-based regression models for a bivariate mixed discrete and continuous outcome. Stat Med2011; 30: 175–185.
23.
JoeH. Dependence modeling with copulas. Chapman and Hall/CRC, 2014.
24.
TounkaraFLefebvreGGreenwoodC, et al. A flexible copula-based approach for the analysis of secondary phenotypes in ascertained samples. Stat Med2020; 39 (5): 517–543.
25.
BreslowNEClaytonDG. Approximate inference in generalized linear mixed models. J Am Stat Assoc1993; 88: 9–25.
26.
GilmourAAndersonRRaeA. The analysis of binomial data by a generalized linear mixed model. Biometrika1985; 72: 593–599.
27.
SchrammCJacquemontSOualkachaK, et al. Kspm: an r package for kernel semi-parametric models. R Journal2020; 12 (2): 82–106.
28.
TerwilligerJDSpeerMOttJ. Chromosome-based method for rapid computer simulation in human genetic linkage analysis. Genet Epidemiol1993; 10: 217–224.
29.
QuessyJ-FRivestL-PToupinM-H. On the family of multivariate chi-square copulas. J Multivar Anal2016; 152: 40–60.
30.
ChagnonYCMaziadeMPaccaletT, et al. A multimodal attempt to follow-up linkage regions using RNA expression, SNPs and CpG methylation in schizophrenia and bipolar disorder kindreds. Eur J Hum Genet2020; 28: 499–507.
31.
ThomasEA. Epigenetic mechanisms in huntington’s disease. Chromatin Signal Neurol Disorders2019; 73–95.
32.
GianolaDFarielloMINayaH, et al. Genome-wide association studies with a genomic relationship matrix: a case study with wheat and arabidopsis. G3: Genes, Genomes, Genet2016; 6: 3241–3256.
33.
WangJ. Pedigrees or markers: Which are better in estimating relatedness and inbreeding coefficient?Theor Popul Biol2016; 107: 4–13.
34.
YangJBenyaminBMcEvoyBP, et al. Common SNPs explain a large proportion of the heritability for human height. Nat Genet2010; 42: 565–569.
35.
JiangYChiuC-YYanQ, et al. Gene-based association testing of dichotomous traits with generalized functional linear mixed models using extended pedigrees: Applications to age-related macular degeneration. J Am Stat Assoc2020; 1–15.
36.
LindsayBGYiGYSunJ. Issues and strategies in the selection of composite likelihoods. Stat Sin2011; 71–105.
37.
GenestCNikoloulopoulosAKRivestL-P, et al. Predicting dependent binary outcomes through logistic regressions and meta-elliptical copulas. Braz J Probab Stat2013; 265–284.
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.