A novel rare variants association test for binary traits in family-based designs via copulas

Abstract

With the cost-effectiveness technology in whole-genome sequencing, more sophisticated statistical methods for testing genetic association with both rare and common variants are being investigated to identify the genetic variation between individuals. Several methods which group variants, also called gene-based approaches, are developed. For instance, advanced extensions of the sequence kernel association test, which is a widely used variant-set test, have been proposed for unrelated samples and extended for family data. Family data have been shown to be powerful when analyzing rare variants. However, most of such methods capture familial relatedness using a random effect component within the generalized linear mixed model framework. Therefore, there is a need to develop unified and flexible methods to study the association between a set of genetic variants and a trait, especially for a binary outcome. Copulas are multivariate distribution functions with uniform margins on the $[0, 1]$ interval and they provide suitable models to capture familial dependence structure. In this work, we propose a flexible family-based association test for both rare and common variants in the presence of binary traits. The method, termed novel rare variant association test (NRVAT), uses a marginal logistic model and a Gaussian Copula. The latter is employed to model the dependence between relatives. An analytic score-type test is derived. Through simulations, we show that our method can achieve greater power than existing approaches. The proposed model is applied to investigate the association between schizophrenia and bipolar disorder in a family-based cohort consisting of 17 extended families from Eastern Quebec.

Keywords

Score test rare variants association tests region-based tests copulas

1. Introduction

The identification of the association between single nucleotide variants (SNVs) with complex diseases is the main goal of genetic studies.¹ In a genome-wide association studies (GWAS) framework, researchers generally focus on common causal variants with a minor allele frequency (MAF) $\geq 5 %$ . Now, because of the cost-effectiveness of next-generation sequencing (NGS) technologies, hundreds of millions of genetic variants with mostly low MAF have been identified. Rare variants can be defined as genetic mutations that occur at low frequency in a population. They are variants with MAF $< 5 % .$ Single variant tests that have been proposed in GWAS have generally low power when using rare genetic variants in NGS studies. To avoid the power loss issue, several methods such as the burden test,^2,3 sequence kernel association test (SKAT),⁴ and their various combinations have been proposed.^6,5 Sequence kernel machine-based association methods test for association between a set of rare/common variants and quantitative or binary phenotypes, and most of them are based on generalized linear mixed models.

In this study, we focus on family-based designs. Sampling relatives in sequencing studies may sometimes be more advantageous than sampling unrelated subjects.⁷ It has the advantage of enriching data sets for familial rare disease variants because of the segregation of alleles within pedigrees, and it may protect data analysis against population stratification issues. Moreover, sequencing data are subject to read (or sequence) errors, and the Mendelian pedigree information available when related individuals are sequenced can help to identify technological artefacts in the data and improve the protection against sequencing errors.^9,8,10 For such reasons, several whole-genome sequencing (WGS) projects have been conducted for family-based cohorts to identify rare variants’ effects. For example, the National Heart, Lung, and Blood Institute’s Trans-Omics for Precision Medicine (TOPMed) program and the National Human Genome Research Institute’s Genome Sequencing Project, have performed WGS from more than $120, 000$ samples not only for population-based cohort but also for family studies.⁵ The Type 2 Diabetes (T2D-GENES) Consortium has been conducted in WGS from 20 large Mexican-American families containing multiple cases of Type 2 diabetes, which aims to detect rare coding variants associated with that disease.^11,12

Several region-based sequence kernel association methods have been developed in the literature to analyse familial data. The methods can be broadly divided into two categories to handle the cluster dependencies induced by related subjects: The first category relies on marginal models linking the outcome to genetic predictors, and use the working-correlation trick to control for the within-family dependence structure through generalized estimating equations (GEE).^13,14 GEE methods do not need to impose assumptions on the trait joint distribution within each family. Thus they are more robust to distribution misspecification and might be suitable for large-scale studies with complicated data design. However, the test from GEE can suffer from low power if the working covariance is not well calibrated, that is, if the family size is not uniform.¹⁴ The methods of the second category rely on conditional models, which capture the clustered structure by adding a random effect in the generalized linear mixed model (GLMM) framework.^17,15,16,6 Most existing methods in this category are designed for continuous outcomes. To face this situation, a computationally efficient logistic linear mixed model association test (GMMAT) has been proposed by Chen et al.¹⁸ As GMMAT was originally developed for single-variant analysis, the authors, later extended the method to variant-set mixed model association tests (SMMAT) for binary traits.⁵

Other testing approaches that are computationally feasible regarding the family-based designs for binary traits in a region-based framework have been proposed by Saad and Wijsman.¹⁹ The method is termed allele frequency comparison (AFC) tests. Note that these tests were proposed firstly for single-variant analysis by Bourgain et al.²⁰ and Choi et al.²¹ AFC tests are based on the difference in single nucleotide polymorphism (SNP) allele frequencies in the cases (affected) and controls (unaffected). They use multivariate random-effects with a covariance matrix implying the kinship matrix to account for family relationships between individuals. Unlike the linear mixed models, these tests do not need the kinship matrix to be positive semidefinite, which makes them interesting as they can allow the use of an estimated genetic relationship matrix (GRM) from genotype data, which might not be positive semidefinite. Although the AFC tests are useful, they cannot adjust for covariates. Moreover, they may suffer from type 1 error rate inflation when the correlation between SNPs is not estimated accurately (i.e. correlation computed using small sample sizes).

In this paper, we propose a flexible copula-based variant-set association test for binary phenotypes in family-based designs. Copulas are suitable models for modelling joint distributions since they allow for separation of the dependence and marginal distributions.^22–24 That is, the proposed model, termed novel rare variant association test (NRVAT), uses (marginal) generalized link functions to relate the binary phenotype to both the covariates and genetic variants and models the dependence between relatives through a Gaussian copula with correlation matrix corresponding to the familial kinship matrix to describe the polygenic relationship between subjects of a same family. The proposed joint modelling approach has several advantages over current GLMM-based methods, such as GMMAT and SMMAT. For instance, the dependence structure can be captured using a different copula model (e.g. student $t$ or chi-square copula), which allows for more flexible models. Because the association between the binary phenotype and both the covariates and genotypes can be captured using usual (marginal) Generalized linear models (GLM), this allows for simple interpretability of the phenotype–genotype association parameters. Moreover, the trait polygenic heritability is ‘margin-free’, in the sense that it is characterized by the copula alone.

The remainder of this paper is structured as follows. In Section 2, we formally describe the proposed statistical framework. Numerical studies are conducted to compare different models in Section 3. In Section 4, we apply the proposed method to real data. We discuss and conclude this paper in Section 5.

2. Data and model

Consider $I$ families and for $i = 1, \dots, I$ , let $n_{i}$ be the size of the $i th$ family. The total sample size is $N = \sum_{i = 1}^{I} n_{i}$ . Let $Y_{i j} \in {0, 1}$ be the binary phenotype under investigation for individual $j$ in family $i$ , for $i = 1, \dots, I$ and $j = 1, \dots, n_{i}$ . We begin by specifying the (conditional) marginal distribution of $Y_{i j}$ , denoted $F (Y_{i j} | X_{i j}, G_{i j})$ , where $X_{i j} = (1, X_{i j 1}, \dots, X_{i j s})^{⊤}$ is a $(s + 1) \times 1$ vector of covariates including the intercept and $G_{i j} = (G_{i j 1}, \dots, G_{i j r})^{⊤}$ is a $r \times 1$ vector of genotypes coded as ( $0, 1, 2$ ) from biallelic variants. Since the response variable is dichotomous, $F (Y_{i j} | X_{i j}, G_{i j})$ is completely specified by $μ_{i j} = P (Y_{i j} = 1 | X_{i j}, G_{i j})$ . Thus, we relate the binary phenotype $Y_{i j}$ to $X_{i j}$ and $G_{i j}$ through a logistic regression model

logit (μ_{i j}) = X_{i j}^{⊤} γ + G_{i j}^{⊤} β, j = 1, \dots, n_{i}, i = 1, \dots, I

(1)

where

logit (u) = \log [\frac{u}{1 - u}],

and

γ = (γ_{0}, γ_{1}, \dots, γ_{s})^{⊤}

and

β = (β_{1}, \dots, β_{r})^{⊤}

are sets of regression coefficients. In a matrix notation, one has

logit (μ_{i}) = X_{i} γ + G_{i} β

where

μ_{i} = (μ_{i 1}, \dots, μ_{i n_{i}})^{⊤}

X_{i}

is a

n_{i} \times (s + 1)

matrix with the

j th

row equal to

X_{i j}

and

G_{i}

is a

n_{i} \times r

matrix with the

j th

row equal to

G_{i j}

. The logit was taken element-wise of the entries of

μ_{i} .

We then specify a joint distribution for $Y_{i} = (Y_{i 1}, \dots, Y_{i n_{i}})^{⊤}$ based on a latent-variable model, which assumes a $n_{i} \times 1$ latent vector $Z_{i} = (Z_{i 1}, \dots, Z_{i n_{i}})^{⊤}$ such that

Y_{i j} = {\begin{cases} 1 & if Z_{i j} \leq Φ^{- 1} (μ_{i j}) \\ 0 & otherwise \end{cases}

(2)

with

Z_{i} \sim N (0_{n_{i}}, Γ_{i})

Γ_{i} = h^{2} Ψ_{i} + (1 - h^{2}) I_{n_{i}}

h^{2}

measures the polygenic heritability,

Ψ_{i}

is a matrix with entries reflecting the proportion of the genome that is shared identically by descent between subjects, and

I_{n_{i}}

is the identity matrix of size

n_{i}

. This leads to

P r (Y_{i 1} = 1, \dots, Y_{i n_{i}} = 1 | X_{i}, G_{i}) = P r (Z_{i 1} \leq Φ^{- 1} (μ_{i 1}), \dots, Z_{i n_{i}} \leq Φ^{- 1} (μ_{i n_{i}})) = C_{Γ_{i}} (μ_{i 1}, \dots, μ_{i n_{i}})

where

C_{Γ} (u_{1}, \dots, u_{d}) = Φ_{d} (Φ^{- 1} (u_{1}), \dots, Φ^{- 1} (u_{d}) | Γ)

(3)

is a Gaussian copula, with

u_{j} \in [0, 1], j = 1, \dots, d

, and

Φ_{d} (. | Γ)

is the

d

-dimensional Gaussian cumulative distribution function with mean zero and

d \times d

correlation matrix

Γ

The role of the copula $C_{Γ_{i}}$ is to account for possible dependence between the residuals of the marginal models. In other words, the $n_{i} \times n_{i}$ variance–covariance matrix of the vector of residuals, $Σ_{i} := V a r (Y_{i} - μ_{i})$ , is given as follows:

Σ_{i}^{j j} = μ_{i j} (1 - μ_{i j}), with Σ_{i}^{j k} = C_{{\tilde{Γ}}_{i}^{j k}} (μ_{i j}, μ_{i k}) - μ_{i j} μ_{i k}

where

{\tilde{Γ}}_{i}^{j k}

is a

2 \times 2

correlation matrix with the out-of-diagonal element equals to the

(j, k)

element of

Γ_{i}

Γ_{i}^{j k}

Motivated by such advantages provided by the proposed joint modelling approach, we suggest building a score-type test statistic for phenotype–genotype association based on the marginal logistic regression model and adjusting for possible dependence between marginal residuals via the copula model. The proposed score-type test procedure is detailed in the next section.

2.1. Inference procedure

Under the proposed model, the association between the $r$ variants and the phenotype can be tested by evaluating the null hypothesis $H_{0} : β = 0_{r}$ .

Deriving the test statistic from the complete log-likelihood function induces complex formulae that prevent the practical implementation of the test. In this work, we consider an alternative approach and propose to derive the test statistic under the independent working assumption.^26,25 The log-likelihood function is then written as

l_{i n d} (β, γ) = \sum_{i = 1}^{I} \sum_{j = 1}^{n_{i}} y_{i j} (X_{i j}^{⊤} γ + G_{i j}^{⊤} β) - \log {1 + e^{X_{i j}^{⊤} γ + G_{i j}^{⊤} β}}

(4)

We adopt the variance-component (VC) hypothesis testing technique, which is a standard approach in the region-based association framework. That is, we treat

β

as a random vector following an arbitrary distribution

H

with mean zero and variance–covariance matrix

τ W

, where

τ

is a VC scalar and

W = d i a g (w_{1}, \dots, w_{r})

is a

r \times r

diagonal matrix of a priori weights to be used for the

r

variants, here we used MAF-based weights. Thus, testing for

H_{0} : β = 0_{r}

is equivalent to testing for

τ = 0

Under such a framework, one has

l_{V C} (τ, γ) = \log {\int e^{l_{i n d} (β, γ)} d H (β | τ)}

Following the same rationale as in the derivation of the SKAT[¹⁵ and VC Copula-based score statistics of Lakhal-Chaieb et al.⁷ (refer to Supplemental materials for details), we show that

\frac{\partial l_{VC} (τ, γ)}{\partial τ} |_{τ = 0} = \frac{1}{2} [(Y - μ)^{⊤} K (Y - μ) - tr (K Δ)]

(5)

where

Y = (Y_{1}^{⊤}, \dots, Y_{I}^{⊤})^{⊤}

μ = (μ_{1}^{⊤}, \dots, μ_{I}^{⊤})^{⊤}

, and

Δ = diag (Δ_{1}, \dots, Δ_{I})

and

Δ_{i} = diag [μ_{i} (1 - μ_{i})]

i = 1, \dots, I

. Here,

K

is a

N \times N

Kernel matrix that can be written as

K = (\begin{matrix} K_{11} & K_{12} & \dots & K_{1 I} \\ ⋮ & ⋮ & \dots & ⋮ \\ K_{I 1} & K_{I 2} & \dots & K_{I I} \end{matrix})

(6)

where

K_{i i^{'}}

is a

n_{i} \times n_{i^{'}}

sub-matrix with the

(j, j^{'})

entry equal to

\sum_{l = 1}^{r} w_{l} G_{i j l} G_{i^{'} j^{'} l}

The variability of the second term of the right-hand side of (5) is negligible compared with the first one. This prompts us to consider the test statistic $Q = (Y - \hat{μ})^{⊤} K (Y - \hat{μ})$ , where $\hat{μ}$ is the estimator of $μ$ under the null hypothesis. Explicitly, one has $\hat{μ} = ({\hat{μ}}_{1}^{⊤}, \dots, {\hat{μ}}_{I}^{⊤})^{⊤}$ , where ${\hat{μ}}_{i} = g_{i} (\hat{γ}), g_{i} (γ) = {logit}^{- 1} (X_{i} γ)$ , $i = 1, \dots, I$ and $\hat{γ} = {argmax}_{γ} l_{ind} (0, γ)$ . In Appendix A, we show that

\sqrt{I} (\hat{γ} - γ) = A^{- 1} \frac{1}{\sqrt{I}} \sum_{i = 1}^{I} S_{i} (γ) + o_{p} (1)

where

A = {lim}_{I \to \infty} I^{- 1} \sum_{i = 1}^{I} X_{i}^{⊤} Δ_{i} X_{i}

and

S_{i} (γ) = X_{i}^{⊤} [Y_{i} - g_{i} (γ)]

. Therefore, one has

\sqrt{I} (\hat{γ} - γ) \sim N (0, A^{- 1} B A^{- 1})

, where

B = {lim}_{I \to \infty} I^{- 1} \sum_{i = 1}^{I} X_{i}^{⊤} Σ_{i} X_{i}

can be consistently estimated by

I^{- 1} \sum_{i = 1}^{I} S_{i} (γ) S_{i} (γ)^{⊤}

In Appendix B, we show that the variance–covariance matrix of $Y_{i} - {\hat{μ}}_{i}$ is

Ω_{i i} = E [(Y_{i} - {\hat{μ}}_{i}) (Y_{i} - {\hat{μ}}_{i})^{⊤}] = Σ_{i} - \frac{1}{I} Δ_{i} A^{- 1} X_{i}^{⊤} Σ_{i} - \frac{1}{I} Σ_{i} X_{i} A^{- 1} X_{i}^{⊤} Δ_{i} + \frac{1}{I} Δ_{i} X_{i} A^{- 1} B A^{- 1} Δ_{i} X_{i}

and that covariance matrix of

Y_{i} - {\hat{μ}}_{i}

and

Y_{i^{'}} - {\hat{μ}}_{i^{'}}

Ω_{i i^{'}} = E [(Y_{i} - {\hat{μ}}_{i}) (Y_{i^{'}} - {\hat{μ}}_{i^{'}})^{⊤}] = - \frac{1}{I} Δ_{i} X_{i} A^{- 1} X_{i}^{⊤} Σ_{i} - \frac{1}{I} Σ_{i^{'}} X_{i^{'}} A^{- 1} X_{i}^{⊤} Δ_{i} + \frac{1}{I} Δ_{i} X_{i} A^{- 1} B A^{- 1} X_{i^{'}}^{⊤} Δ_{i^{'}}

Therefore, the variance–covariance matrix of

Y - \hat{μ}

Ω = (\begin{matrix} Ω_{11} & Ω_{12} & \dots & Ω_{1 I} \\ ⋮ & ⋮ & \dots & ⋮ \\ Ω_{I 1} & Ω_{I 2} & \dots & Ω_{I I} \end{matrix})

Hence, the distribution of the test statistic

Q

under the null hypothesis can be approximated by a weighted mixture of chi-squared distributions

\sum_{n = 1}^{N} θ_{n} χ_{1, n}^{2}

, where

(θ_{1}, \dots, θ_{r})

are the eigenvalues of

Ω^{1 / 2} K Ω^{1 / 2}

and

χ_{1, n}^{2}

are independent

χ_{1}^{2}

random variables. In practice,

Ω

involves

h^{2}

, which is unknown. In this work, we replace

h^{2}

by its estimator obtained by maximizing, computed under the null hypothesis

H_{0} : β = 0_{r},

the pairwise log-likelihood which formulae is given by

\sum_{i = 1}^{I} \sum_{1 \leq j \neq j^{'} \leq n_{i}} \log [P (Y_{i j} = y_{i j}, Y_{i j^{'}} = y_{i j^{'}} | X_{i j}, X_{i j^{'}})]

See Appendix A for more details.

2.2. Generalizations

In this work, the dependence between the outcomes $(Y_{i 1}, \dots, Y_{i n_{i}})$ is modelled via a Gaussian copula. The latter is used to express the variance–covariance matrix $Σ_{i}$ and to estimate the polygenic heritability parameter $h^{2}$ . Therefore, one may generalize the derived test to any elliptic copula (²³, chapter 4) without any loss of generality. The robustness of the derived test to the specification of the underlying copula is well-investigated in the simulation studies section. The Kernel matrix $K$ in equation (6) is known in the literature as the linear kernel matrix.⁴ However, several other choices such as the quadratic and Gaussian kernels can be incorporated in the derived test, without loss of generality.²⁷ In the simulations, their impact is investigated on both the type I error and the power of the proposed test.

3. Simulations

Simulations were carried out to validate and compare the performance of the proposed method, NRVAT, with three existing set-based rare-variant association methods dealing with binary phenotype in presence of families, namely SMMAT⁵; gSKAT¹⁴ and AFC tests.¹⁹ As mentioned previously, SMMAT is a GLMM method which captures family relationships via a random effect; gSKAT is a GEE-based association test which controls for familial relationships via the working-correlation trick; AFC test is a score-type association test based on the difference of SNP allele frequencies in the cases (affected) versus controls (unaffected) and uses a multivariate random effect to account for family relationships between individuals. The methods comparison is based on empirical type I error rate and power.

3.1. Data generation

This section outlines the detailed steps pursued to generate simulated data for the methods’ evaluation.

Genotypes: we used SIMULATE3 computer program,²⁸ which allows simulation of genotypes of a set of linked SNPs in family members. In all simulations, we set the number of families to be $I = 120$ , with $40$ families of two parents and one child, $40$ families of two parents and two children, and $40$ families of three generations where there are eight subjects per family (i.e. $2$ cousins with their $4$ parents and $2$ grandparents). This leads to a total of $N = 600$ individuals (240 males and 360 females; 320 founders and 280 nonfounders) per each generated data. We set the number of simulated SNPs for each subject to be $r = 20$ , with MAFs ranging between $0.003$ and $0.01$ . We assumed linkage disequilibrium between two adjacent SNPs to be $d^{2} = 0.16$ (i.e. $d^{2}$ is the squared correlation coefficient between two adjacent SNPs).

Covariates: in all simulations, we considered two covariates associated with the binary response: a continuous covariate, $X_{1} \sim U n i f o r m [0, 1]$ , and a binary covariate, $X_{2} \sim B e r n o u l l i (0.2)$ . The marginal effects of $X_{1}$ and $X_{2}$ on the outcome were set to $γ_{1} = γ_{2} = 1$ , respectively. The intercept coefficient was set to $γ_{0} = - 2$ .

Binary phenotype: we considered three settings to generate $Y$ in order to conduct type I error and power comparisons with the competing methods. In Setting 1, $Y$ was generated based on our copula model (2). In Setting 2, $Y$ was simulated based on the GLMM model of the SMMAT approach. Finally, Setting 3 aimed to assess the robustness of the proposed copula approach with respect to miss-specification of the true copula describing the dependence structure between subjects of the same family. The three settings are described in detail as follows:

Setting 1 (data generation from our model): For each family $i \in {1, \dots, I}$ , we simulated the response variable of the $n_{i}$ subjects following our copula model (2). The data generation steps are given as follows:

1. generate $n_{i} \times r$ matrix of genotypes $G_{i}$ from Simulate3;

2. set the first column of $n_{i} \times 3$ matrix $X_{i}$ to be the vector of ones, then generate entries of its additional two columns from the uniform distribution over $[0, 1]$ and a Bernoulli distribution with probability of success equals to $0.2$ , respectively;

3. calculate the $n_{i} \times 1$ vector $logit (μ_{i}) = X_{i} γ + G_{i} β$ ;

4. generate $Z_{i} \sim N (0_{n_{i}}, Γ_{i}),$ with $Γ_{i} = h^{2} Ψ_{i} + (1 - h^{2}) I_{n_{i}}$ ;

5. generate $Y_{i j} = {\begin{cases} 1 & if Z_{i j} \leq Φ^{- 1} (μ_{i j}) \\ 0 & otherwise, for j = 1, \dots, n_{i} \end{cases}$

Setting 2 (data generation from a GLMM model): We generated the response variable based on the GLMM model for family $i \in {1, \dots, I}$ , as follows:

logit (μ_{i}) = X_{i} γ + G_{i} β + b_{i}

where

b_{i} \sim N (0_{n_{i}}, h^{2} Ψ_{i})

G_{i}

and

X_{i}

are defined as in Setting 1.

Ψ_{i}

is twice the

i t h

family theoretical kinship matrix, and

h^{2}

is the polygenic trait heritability.

Setting 3 (copula miss-specification): To investigate the robustness of the derived test to the miss-specification of the underlying/true copula, we conducted two scenarios in which the simulated data were generated from either the student $t$ or chi-square copula models. However, our proposed Gaussian copula model given in (3) was fitted to the simulated data in order to derive $p$ -values. More precisely, in both scenarios, for family $i$ , the response variable was generated following the same steps of Setting 1, except Step 4 where $Z_{i}$ was generated following a multivariate student $t$ with 3 degrees of freedom (df) and correlation matrix $Γ_{i}$ , in Scenario 1. In Scenario 2, $Z_{i}$ was simulated from a chi-square copula distribution, as follows: we first generated ${\tilde{Z}}_{i} \sim N (0_{n_{i}}, Γ_{i})$ , then we set $U_{i j} = sign ({\tilde{Z}}_{i j} + a) {Φ ({\tilde{Z}}_{i j} + 2 a) + Φ ({\tilde{Z}}_{i j}) - 1}$ , and finally we calculated the vector $Z_{i}$ such that $Z_{i j} = Φ^{- 1} (U_{i j})$ , where $Φ (\cdot)$ is the standard normal cumulative distribution function. Following,²⁹ $Z_{i}$ has a multivariate chi-square copula distribution with a non-centrality parameter $a \geq 0$ , and normal marginal distributions. In our simulations, we fixed $a = 1$ .

In all three settings, we considered $h^{2} \in {0, 0.2, 0.5}$ . Table 1 describes the parameters’ combinations used in our simulation studies.

Table 1.
Parameters combinations for the simulations studies under all the settings and scenarios $I :$ family size; $N :$ total of sample size; $h^{2} :$ polygenic heritability; $τ :$ variance component (VC); $γ_{0} :$ intercept; $γ_{1} and γ_{2} :$ covariate effects; $β :$ genotypes effects; $X_{1} and X_{2} :$ covariates.

Parameters Values

$I$ $120$

$N$ $600$

$h^{2}$ ${0, 0.2, 0.5}$

$τ$ ${0.05, 0.2}$

$γ_{0}$ $- 2$

$γ_{1}$ $1$

$γ_{2}$ $1$

$β$ $N_{20} (0, 0)$ (under null) — $N_{20} (0, τ W_{20})$ (under alternative)

$X_{1}$ $U [0, 1]$

$X_{2}$ discrete 0/1 with probability of succes $0.2$

Parameters	Values
$I$	$120$
$N$	$600$
$h^{2}$	${0, 0.2, 0.5}$
$τ$	${0.05, 0.2}$
$γ_{0}$	$- 2$
$γ_{1}$	$1$
$γ_{2}$	$1$
$β$	$N_{20} (0, 0)$ (under null) — $N_{20} (0, τ W_{20})$ (under alternative)
$X_{1}$	$U [0, 1]$
$X_{2}$	discrete 0/1 with probability of succes $0.2$

Under the null model, the parameter of interest is $β = 0_{r}$ ; then $B = 10, 000$ random samples were generated according to each parameter-combination scenario to assess the type I error rate of all the methods. We again refer the reader to Table 1 for the specific values of the model parameter combination considered.

To evaluate the methods’ performance in terms of power, five SNPs were randomly chosen among the 20 studied SNPs as causal variants (i.e. $25 %$ of all studied SNPs). More precisely, for the five causal SNPs, we assumed their effect $β_{c a u s a l} \sim N_{5} (0, τ W_{5})$ , with $τ \in {0.05, 0.2},$ and $W_{5} = d i a g (w_{1}, \dots, w_{5})$ is a $5 \times 5$ diagonal matrix of a priori weights to be used for the $5$ variants. The effects of the remaining SNPs were set to zero. The power comparison was based on $B = 1000$ replications.

3.2. Simulation results

This section summarizes the simulation results for the Type I error rate for the scenario with $h^{2} = 0.5,$ and the power levels for all settings described above.

3.2.1. Type I error results

Results of Setting 1: Figure 1 and Table S1 (Supplemental material) show quantile–quantile (QQ)-plots of the $p$ -values and empirical type I error rate of all the considered methods, respectively, where data are generated under the Gaussian copula model. The empirical type I error of NRVAT, SMMAT, AFC (Xc) and gSKAT (perturbed) are around the nominal level ( $α = 0.01$ ). By contrast, the empirical type I error of gSKAT (asymptotic) is lower than the nominal level whereas AFC_QLS has an inflated type I error rate. Our proposed approach showsa well-controlled type I error rate, for all the proposed Kernel-based tests. Figures S1 and S2 (Supplemental material) show similar results under $h^{2} \in {0, 0.2},$ respectively.

Results of Setting 2: Figure 2 and Table S2 (Supplemental material) show the results of all the methods when the data are generated from the GLMM model. As expected, the empirical type I error rate of SMMAT is well-controlled since it is a GLMM-based association test. The empirical type I error of NRVAT is controlled for $h^{2} \in {0, 0.2}$ (Figures S3 and S4 of Supplemental material, respectively), however, the method seems to be conservative for $h^{2} = 0.5.$ This might be due to the bias introduced in the NRVAT estimation of the marginal model parameters using the GLM model (1). In fact, from Table S6 of Supplemental material, one can see that the bias in the estimation of $γ_{0}$ , $γ_{1}$ , and $γ_{2}$ , increased as $h^{2}$ increased. On the other hand, AFC (Xc) and gSKAT (perturbed) show no type I error rate inflation, for the three values of the polygenic heritability ( $h^{2}$ ). The gSKAT (asymptotic) is still conservative whereas AFC test (QLS) remains inflated. Figures S3 and S4 of Supplemental material show similar results under $h^{2} \in {0, 0.2},$ respectively.

Results of Setting 3: Figure 3 and Table S3 (Supplemental material), and Figure 4 and Table S4 (Supplemental material) describe the empirical type I error rate results for data generated under the student $t$ and the chi-square copula models, respectively. Under the student $t$ copula model (Figures 3 and SupplementalTable S3), NRVAT seems to be sensitive to the true copula miss-specification by showing a slightly higher empirical type I error rate compared to the nominal level $α = 0.01$ . However, under the chi-square copula model (Figures 4 and Supplemental Table S4), NRVAT shows a well-controlled type I error. On the contrary, in both scenarios of Setting 3, all the competitors behave in a similar way as in Setting 1. That is, SMMAT, AFC (Xc) and gSKAT (perturbed) have valid type I error rates, whereas AFC test (QLS) remains inflated and gSKAT (asymptotic) remains conservative. Figures S5 to S8 show similar results under $h^{2} \in {0, 0.2},$ for the two scenarios of Setting 3, respectively.

We also show in Supplemental Tables S7 and S8, the results of empirical bias, for our method, of the nuisance parameters and the polygenic heritability under $H_{0}$ for the two scenarios of Setting 3.

Figure 1.

QQ-plot under the null hypothesis of no SNPs/phenotype association $(τ = 0),$ with the heritability parameter $h^{2} = 0.5,$ where the data are generated under the G copula. Results are computed from 10,000 data sets generated under Setting 1. The compared methods are NRVAT model with the L, Q, IBS, G, and P kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with $X_{c}^{2}$ (Xc), and $W_{QLS}$ (QLS); and gSKAT model with the asymptotic and perturbed. QQ: quantile–quantile; SNP: single nucleotide polymorphism; NRVAT: novel rare variant association test; L: linear; Q: quadratic; IBS: identity-by-state; G: Gaussian; P: polynomial; SMMAT: variant-set mixed model association tests; AFC: allele frequency comparison; GSKAT: burden and kernel-based gene set association tests for binary traits.

Figure 2.

QQ-plot under the null hypothesis of no SNPs/phenotype association $(τ = 0),$ with the heritability parameter $h^{2} = 0.5,$ where the data are generated under the GLMM. Results are computed from 10,000 data sets generated under Setting 2. The compared methods are NRVAT model with the L, Q, IBS, G, and P kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with $X_{c}^{2}$ (Xc), and $W_{QLS}$ (QLS); and GSKAT model with the asymptotic and perturbed. QQ: quantile–quantile; SNP: single nucleotide polymorphism; GLMM: generalized linear mixed model; NRVAT: novel rare variant association test; L: linear; Q: quadratic; IBS: identity-by-state; G: Gaussian; P: polynomial; SMMAT: variant-set mixed model association tests; AFC: allele frequency comparison; GSKAT: burden and kernel-based gene set association tests for binary traits.

Figure 3.

QQ-plot under the null hypothesis of no SNPs/phenotype association $(τ = 0),$ with the heritability parameter $h^{2} = 0.5,$ where the data are generated under the student $t$ copula model ( $d f = 3$ ). Results are computed from 10,000 data sets generated under Scenario 1 of Setting 3. The compared methods are NRVAT model with the L, Q, IBS, G, and P kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with $X_{c}^{2}$ (Xc), and $W_{QLS}$ (QLS); and GSKAT model with the asymptotic and perturbed. QQ: quantile–quantile; SNP: single nucleotide polymorphism; NRVAT: novel rare variant association test; L: linear; Q: quadratic; IBS: identity-by-state; G: Gaussian; P: polynomial; SMMAT: variant-set mixed model association tests; AFC: allele frequency comparison; GSKAT: burden and kernel-based gene set association tests for binary traits.

Figure 4.

QQ-plot under the null hypothesis of no SNPs/phenotype association $(τ = 0),$ with the heritability parameter $h^{2} = 0.5,$ where the data are generated under the chi-square copula model, with a non-centrality parameter $a = 1$ . Results are computed from 10,000 data sets generated under Scenario 2 of Setting 3. The compared methods are NRVAT model with the L, Q, IBS, G, and P kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with $X_{c}^{2}$ (Xc), and $W_{QLS}$ (QLS); and GSKAT model with the asymptotic and perturbed. QQ: quantile–quantile; SNP: single nucleotide polymorphism; NRVAT: novel rare variant association test; L: linear; Q: quadratic; IBS: identity-by-state; G: Gaussian; P: polynomial; SMMAT: variant-set mixed model association tests; AFC: allele frequency comparison; GSKAT: burden and kernel-based gene set association tests for binary traits.

Of note, the proposed method, like its competitors, might be sensitive to selection bias and it does not handle missing values; that is, subjects with missing values are removed from the data analysis. Although this is beyond the scope of this work, because real familial data are subject to selection bias and/or missing genotypes/phenotypes, we have conducted a sensitivity analysis to emphasize the impact of selection bias and missing data on our method. In Supplemental materials, we also show the conducted additional simulations under the null hypothesis for Setting 1. The simulated data were generated under three scenarios : (a) selection bias, (b) missing at random, and (c) missing completely at random.

To emphasize the impact of linkage disequilibrium between SNPs (i.e. $d^{2}$ values) on the performance of the proposed method and its competitors, we have carried out additional simulation scenarios with different values of $d^{2}$ . More precisely, in Supplemental materials, we presented additional simulations under the null hypothesis of Setting 1 where the simulated data were generated under two scenarios: (1) $d^{2} = 0.25$ , and (2) $d^{2} = 0.36$ .

3.2.2. Power results

Figures 5 to 8 outline the empirical power levels for all the methods where data is generated from the Gaussian copula model (Setting 1), GLMM (Setting 2), and the student and chi-square copula models (Setting 3). The AFC (QLS) method was omitted in the power analysis comparison as it demonstrates severe type I error rate inflation (Figures 1 to 4 and Figures S1 to S8 of Supplemental material). Interestingly, in all settings, one can notice a substantial gain in the power of NRVAT with both the IBS and the Gaussian Kernel matrices. Figure 9 and Figures S9 and S10 of Supplemental material show the power levels as a function of a grid of values of the variance-component $τ,$ respectively, for $h^{2} = 0.2,$ $h^{2} = 0$ and $h^{2} = 0.5$ , under the Gaussian copula model (Setting 1). Again, these figures illustrate the important gain in power achieved by NRVAT with the IBS and the Gaussian Kernel similarity matrices.

4. Application to Real Data

4.1. Schizophrenia and bipolar disorder family study

The data in this analysis consists of 640 subjects belonging to 17 extended families from Eastern Quebec, with some family members known to have schizophrenia (SZ) or bipolar disorder (BP).³⁰ We considered gene-based association analyses of SZ ( $61$ affected; $435$ non-affected and $144$ unknown) and BP ( $91$ affected; $405$ non-affected and $144$ unknown) binary phenotypes; we also analysed a ‘common locus’ (CL) binary phenotype, for which, diseased individuals are defined as subjects with SZ and/or BP ( $166$ affected; $330$ non-affected and $144$ unknown). We considered genomics regions significant with linkage finding in SZ and BP from Quebec Eastern family data.³⁰ That is, for the gene-based analysis of the SZ trait, $10, 088$ SNPs clustered within $291$ genes were considered. The analysis of the BP trait considered $5979$ SNPs falling within $163$ genes. The CL phenotype analysis consisted of $9919$ SNPs and $281$ genes. The whole-genome SNP genotyping was provided by OmniExpress24 Illumina and the genotype data was prepared by Chagnon et al.³⁰ After removing both subjects and SNPs with missing values, a total of 433 subjects were available for analysis, with 57 affected (AF) and 376 non-affected relatives (NAR) for SZ, 83 AF versus 350 NAR for BP, and 153 AF against 280 NAR for CL. Finally, $288$ genes were retained for SZ analysis with a total of $10085$ SNPs with gene size varying between $2$ and $239$ SNPs with $13 %$ of SNPs having their MAF $< 5 % .$ BP analysis has considered $161$ genes for a total of $5977$ SNPs, with gene-size varying from $2$ to $285$ SNPs with $12 %$ of their SNPs have MAF $< 5 % .$ The genotypes of the SNPs corresponding to the CL genomics region were all available, and so, there was no reduction in the number of SNPs; the gene size, in this analysis, ranges between $2$ and $285$ SNPs with $13 %$ of SNPs having their MAF $< 5 % .$ For the three traits, we fitted NRVAT, SMMAT, AFC, and gSKAT set-based association tests. In all analyses, we adjusted for the sex ( $255$ females vs. $178$ males) as a covariate in all fitted models.

Figure 5.

Empirical power under the alternative hypothesis of SNPs/phenotype association $τ = 0.05$ and $τ = 0.2$ (respectively, for the first line and the second line) where the data are generated under the Gaussian copula. Results are computed from 1000 data sets generated with 25% of causal variants taken randomly from the regions’ size (20) under Setting 1. The compared methods are NRVAT model with the L, Q, IBS, G, and P kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with $X_{c}^{2}$ (Xc); and GSKAT model with the asymptotic and perturbed. SNP: single nucleotide polymorphism; NRVAT: novel rare variant association test; L: linear; Q: quadratic; IBS: identity-by-state; G: Gaussian; P: polynomial; SMMAT: variant-set mixed model association tests; AFC: allele frequency comparison; GSKAT: burden and kernel-based gene set association tests for binary traits.

Figure 6.

Empirical power under the alternative hypothesis of SNPs/phenotype association $τ = 0.05$ and $τ = 0.2$ (respectively, for the first line and the second line) where the data are generated under the GLMM. Results are computed from 1000 data sets generated with 25% of causal variants taken randomly from the regions size (20) under Setting 2. The compared methods are NRVAT model with the L, Q, IBS, G, and P kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with $X_{c}^{2}$ (Xc); and GSKAT model with the asymptotic and perturbed. SNP: single nucleotide polymorphism; GLMM: generalized linear mixed model; NRVAT: novel rare variant association test; L: linear; Q: quadratic; IBS: identity-by-state; G: Gaussian; P: polynomial; SMMAT: variant-set mixed model association tests; AFC: allele frequency comparison; GSKAT: burden and kernel-based gene set association tests for binary traits.

Figure 7.

Empirical power under the alternative hypothesis of SNPs/phenotype association $τ = 0.05$ and $τ = 0.2$ (respectively, for the first line and the second line) where the data are generated under the student copula ( $d f = 3$ ). Results are computed from 1000 data sets generated with 25% of causal variants taken randomly from the regions size (20) under scenario 1 of Setting 3. The compared methods are NRVAT model with the L, Q, IBS, G, and P kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with $X_{c}^{2}$ (Xc); and GSKAT model with the asymptotic and perturbed. SNP: single nucleotide polymorphism; NRVAT: novel rare variant association test; L: linear; Q: quadratic; IBS: identity-by-state; G: Gaussian; P: polynomial; SMMAT: variant-set mixed model association tests; AFC: allele frequency comparison; GSKAT: burden and kernel-based gene set association tests for binary traits.

Figure 8.

Empirical power under the alternative hypothesis of SNPs/phenotype association $τ = 0.05$ and $τ = 0.2$ (respectively, for the first line and the second line) where the data are generated under the chi-square copula with a non-centrality parameter $a = 1$ . Results are computed from 1000 data sets generated with 25% of causal variants taken randomly from the regions’ size (20) under Scenario 2 of Setting 3. The compared methods are NRVAT model with the L, Q, IBS, G, and P kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with $X_{c}^{2}$ (Xc); and GSKAT model with the asymptotic and perturbed. SNP: single nucleotide polymorphism; NRVAT: novel rare variant association test; L: linear; Q: quadratic; IBS: identity-by-state; G: Gaussian; P: polynomial; SMMAT: variant-set mixed model association tests; AFC: allele frequency comparison; GSKAT: burden and kernel-based gene set association tests for binary traits.

Figure 9.

Power function under the alternative hypothesis of SNPs/phenotype association of grid of $τ \in {0, 0.01, 0.05, 0.2}$ for the polygenic heritability parameter $h^{2} = 0.2$ where the data are generated under the Gaussian copula. Results are computed from 1000 data sets generated with 25% of causal variants taken randomly from the regions size (20) under Setting 1. The compared methods are NRVAT model with the L, Q, IBS, G, and P kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with $X_{c}^{2}$ (Xc); and gSKAT model with the asymptotic and perturbed. SNP: single nucleotide polymorphism; NRVAT: novel rare variant association test; L: linear; Q: quadratic; IBS: identity-by-state; G: Gaussian; P: polynomial; SMMAT: variant-set mixed model association tests; AFC: allele frequency comparison; GSKAT: burden and kernel-based gene set association tests for binary traits.

4.2. Results of SZ and BP analysis

Figures 10 to 12 show the QQ plots of the obtained $p$ -values from the gene-based analyses of SZ, BP, and CL. Figure 13 shows the overall $p$ -value QQ-plot, in which, $p$ -values from the analysis of the three traits are put together as one set of $p$ -values. From these figures, one can see that all the methods have valid type I error rates under the analysis of both SZ and CL traits (Figures 10 and 12), except AFC (QLS), which has severe inflated type I error rate as it was also noticed in the simulation results. On the contrary, under the BP analysis, AFC (XC), SMMAT, and gSKAT (perturbed) present inflated type I error rate, however, NRVAT still has valid results in this analysis.

Table 2 reports genes/regions with $p$ -values that are significant at the nominal significance level of $0.05$ , after correcting for multiple testing using Bonferroni correction, for all the methods. Under SZ analysis, the strongest association is detected for gene SMYD3 by our method with the quadratic kernel matrix (NRVAT-Quadratic) and the SMMAT model with the hybrid test (O) method ( $p$ -value $\leq 0.05 / 288 = 1.74 \times 10^{- 4}$ ). Significant signals are also detected, under the BP analysis, for the genes C1ORF77, CGI-96, and TRIM24 for NRVAT-Quadratic, NRVAT-Gaussian, NRVAT-Polynomial, and AFC (Xc) ( $p$ -value $\leq 0.05 / 161 = 3.1 \times 10^{- 4}$ ). No significant signals were declared from the CL analysis.

Table 2.
Significant genes after Bonferroni correction of NRVAT model with the L, Q, IBS, G, and P kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with $X_{corrected}^{2}$ (Xc), and $W_{QLS}$ (QLS); and GSKAT model with the asymptotic and perturbed for SZ, BP disorder, and their CL family study data at nominal level $α = 0.05$ and the overall Bonferroni correction, which corrects for multiple testing of the total number of tests conducted in the three analyses, for each method.

NRVAT SMMAT AFC GSKAT

$α$ Traits L Q IBS G P O E Xc QLS Asymp Pert

$0.05$ SZ - SMYD3 ( $16 \times 10^{- 05}$ ) – – – SMYD3 ( $5.82 \times 10^{- 05}$ ) – – * – –

BP - C1ORF77 ( $28 \times 10^{- 05}$ ) - CGI-96 ( $8 \times 10^{- 05}$ ) CGI-96 ( $3.64 \times 10^{- 06}$ ) – – TRIM24 ( $19 \times 10^{- 05}$ ) * – –

CL – – – – – – – – * – –

OverAll – – – – CGI-96 SMYD3 – – * – –

		NRVAT	SMMAT	AFC	GSKAT
$0.05$	SZ	-	SMYD3 ( $16 \times 10^{- 05}$ )	–	–	–	SMYD3 ( $5.82 \times 10^{- 05}$ )	–	–	*	–	–
	BP	-	C1ORF77 ( $28 \times 10^{- 05}$ )	-	CGI-96 ( $8 \times 10^{- 05}$ )	CGI-96 ( $3.64 \times 10^{- 06}$ )	–	–	TRIM24 ( $19 \times 10^{- 05}$ )	*	–	–
	CL	–	–	–	–	–	–	–	–	*	–	–
	OverAll	–	–	–	–	CGI-96	SMYD3	–	–	*	–	–

NRVAT: novel rare variant association test; L: linear; Q: quadratic; IBS: identity-by-state; G: Gaussian; P: polynomial; CL: common locus; SMMAT: variant-set mixed model association tests; AFC tests: allele frequency comparison tests; Gskat: burden and kernel-based gene set association tests for binary traits; BP: bipolar disorder; SZ: schizophrenia; *: method with inflated type I error rate; –: non-significative methods.

Figure 10.

QQ-plot for NRVAT model with the L, Q, IBS, G, and P kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with $X_{c}^{2}$ (Xc), and $W_{QLS}$ (QLS); and GSKAT model with the asymptotic and perturbed for schizophrenia family study data. QQ: quantile–quantile; NRVAT: novel rare variant association test; L: linear; Q: quadratic; IBS: identity-by-state; G: Gaussian; P: polynomial; SMMAT: variant-set mixed model association tests; AFC: allele frequency comparison; GSKAT: burden and kernel-based gene set association tests for binary traits.

Figure 11.

QQ-plot for NRVAT model with the L, Q, IBS, G, and P kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with $X_{c}^{2}$ (Xc), and $W_{QLS}$ (QLS); and GSKAT model with the asymptotic and perturbed for bipolar disorder family study data. QQ: quantile–quantile; NRVAT: novel rare variant association test; L: linear; Q: quadratic; IBS: identity-by-state; G: Gaussian; P: polynomial; SMMAT: variant-set mixed model association tests; AFC: allele frequency comparison; GSKAT: burden and kernel-based gene set association tests for binary traits.

Figure 12.

QQ-plot for NRVAT model with the L, Q, IBS, G, and P kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with $X_{c}^{2}$ (Xc), and $W_{QLS}$ (QLS); and GSKAT model with the asymptotic and perturbed for CL, for which, diseased individuals are defined as subjects with SZ and/or BP, family study data. QQ: quantile–quantile; NRVAT: novel rare variant association test; L: linear; Q: quadratic; IBS: identity-by-state; G: Gaussian; P: polynomial; SMMAT: variant-set mixed model association tests; AFC: allele frequency comparison tests: GSKAT: burden and kernel-based gene set association tests for binary traits; CL: common locus; SZ: schizophrenia; BP: bipolar disorder.

Figure 13.

QQ-plot for NRVAT model with the L, Q, IBS, G, and P kernel matrices; SMMAT model with the hybrid test (O), and the efficient hybrid test (E); AFC model with $X_{c}^{2}$ (Xc), and $W_{QLS}$ (QLS); and GSKAT model with the asymptotic and perturbed for schizophrenia, bipolar disorder and their CL family study data for overall (combining SZ, BP and CL). QQ: quantile–quantile; NRVAT: novel rare variant association test; L: linear; Q: quadratic; IBS: identity-by-state; G: Gaussian; P: polynomial; SMMAT: variant-set mixed model association tests; AFC: allele frequency comparison tests: gSKAT: burden and kernel-based gene set association tests for binary traits; CL: common locus; SZ: schizophrenia; BP: bipolar disorder.

Table 2 highlights also the significant results from the overall Bonferroni correction, which corrects for multiple testing of the total number of tests conducted in the three analyses, for each method; that is, Bonferroni-corrected threshold equals $0.05 / (288 + 161 + 281) = 6.85 \times 10^{- 5}$ . We note that the genes CGI-96 and SMYD3 remain significant for NRVAT-Polynomial and SMMAT-O methods, respectively. Indeed, SMYD3 is one of the histone methyltransferases that catalyse methylation of histone H3 at K4 in mammalian cells,³¹ and it has been proven that one of it rare SNV namely rs6426297 is associated with the SZ trait. This rare SNVis associated with suicide attempts in people with SZ and BP (for more details, see https://www.ebi.ac.uk/gwas/variants/rs6426297).

5. Discussion

In this work, we have developed NRVAT, a flexible set-based association test for rare and common variants and binary phenotypes, in family-based designs. NRVAT uses marginal generalized linear mixed models to relate the outcome to both the covariates and a set of SNPs and uses copulas to account for possible dependence between subjects of the same family/pedigree through the kinship matrix. An advantage of NRVAT joint modelling is that the regression parameters linking both covariates and genotypes to the phenotype are marginally meaningful. Moreover, trait polygenic heritability is ‘margin-free’, in the sense that it is characterized by the copula alone. The NRVAT framework includes five different kernel matrices to capture genotype–phenotype relationships, namely, the linear (L), quadratic (Q), identity-by-state (IBS), Gaussian (G), and polynomial (P) kernel matrices. Through simulations, we have shown that NRVAT has a valid type I error rate and allows for more power than existing models in family-based designs, even when the true copula was not well specified. We have also evaluated the performance of the NRVAT model using schizophrenia and bipolar disorder data and we have found an association signal with the SZ binary trait for one gene SMYD3, and with the BP binary trait for two genes, C1ORF77 and CGI-96.

One of the main advantages of NRVAT is that it models the dependence structure between subjects regardless of their marginal distributions, which allows the use of different margins to link the binary response to the covariates and the genotypes. For instance, one can use latent Probit or Robit²² marginal models instead of the latent logistic model in (1). Our method is also valid in the presence of selection bias or missing genotypes.

It is necessary to note that the NRVAT model has some limitations. In fact, the derivation of $p$ -values and the asymptotic null distribution of the NRVAT score test statistic assumes that the families are independent. This is true when one uses a priori kinship matrix to describe subjects’ relatedness, which is a block diagonal matrix. However, in the absence of pedigree information and/or in the presence of admixture population effects, the estimation of subjects’ relatedness is more suitable using an empirical genetic similarity matrix, such as the GRM.^34,32,33 Such matrices, by construction, are not block diagonals and thus their use within NRVAT might not be suitable. This means that NRVAT might not handle a population structure other than the well-defined familial structure. Also, people should be aware that the NRVAT model could give incorrect results when the number of families is small. Several extensions to our methodological proposal could also be investigated. Our approach could be extended to develop a functional association test for dichotomous traits in the presence of family data. To parallel what is done in Jiang et al.,³⁵ genotype–phenotype functional relationship can be modelled in the marginal distributions based on generalized functional linear mixed models while, again, a copula model can be used to characterize the trait dependence between subjects of the same family. So far, our approach only handles a single binary trait for each subject, but an extension to a mixture of binary-continuous trait cases for familial data could be possible. In fact, an additional source of dependency, stemming from within-subject correlated phenotypic values, could be accounted for by choosing an appropriate copula model.

NRVAT integrates five kernel matrices within the association test to better capture the phenotype–genotype relationship. Although the optimal choice of the kernel matrix for real data analysis is a daunting question, here are some guidelines on how to choose a kernel matrix in NRVAT: when the underlying genotype–phenotype relationship is unknown, one may choose a linear kernel if there is a priori knowledge that relationships are linear and there are no interactions. In the presence of interactions, the quadratic kernel can be a good alternative as it implicitly assumes that the underlying relationship depends on the main and second-order effects.⁴ In the case where there is a priori knowledge of the existence of more complex relationships, the Gaussian and IBS kernels may be the better choice.

Software availability

The NRVAT method is implemented in a software R package PedGFLMM, which is freely available at https://github.com/houssoudossa/NRVATGFLMM.

Codes are also publicly available in Github for reproducibility of the simulation scenarios of Setting 1 at https://github.com/houssoudossa/NRVAT-simulation-code.

Supplemental Material

sj-pdf-1-smm-10.1177_09622802231197977 - Supplemental material for A novel rare variants association test for binary traits in family-based designs via copulas

Supplemental material, sj-pdf-1-smm-10.1177_09622802231197977 for A novel rare variants association test for binary traits in family-based designs via copulas by Houssou R. G. Dossa, Alexandre Bureau, Michel Maziade, Lajmi Lakhal-Chaieb and Karim Oualkacha in Statistical Methods in Medical Research

Footnotes

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

This work received received financial support from:

the Canadian Statistical Sciences Institute (CANSSI) through a collaborative reseach team (CRT) grant.

Fonds de recherche du Québec through the grant FRQS-CB-J1.

The Eastern Quebec Schizophrenia and Bipolar Disorder study was funded by the Canadian Institutes of Health research (CIHR, grants MOP-74430, MOP-119408, MOP-114988 and PCG-155471) and its data management system was supported by the Canada Foundation for Innovation Leadership Opportunity Fund (grant 27592).

Supplemental material

Supplemental material for this article is available online.

Appendix

Appendix B

Let,

\begin{aligned} γ & = [\begin{matrix} γ_{0} \\ γ_{1} \\ γ_{2} \end{matrix}], Y_{i} = [\begin{matrix} y_{i 1} \\ ⋮ \\ y_{i n_{i}} \end{matrix}], μ_{i} = g_{i} (γ) = [\begin{matrix} μ_{i 1} \\ ⋮ \\ μ_{i n_{i}} \end{matrix}] \end{aligned}

(7)

\begin{aligned} \sqrt{I} (\hat{γ} - γ) & = A^{- 1} \frac{1}{\sqrt{I}} \sum_{k = 1}^{I} S_{k} (γ) + o_{p} (1) \end{aligned}

(8)

\begin{aligned} S_{i} (γ) & = X_{i}^{⊤} (Y_{i} - μ_{i}) = X_{i}^{⊤} [Y_{i} - g_{i} (γ)] \end{aligned}

under an extension of Taylor’s development of

g_{i} (\hat{γ})

of order 1 around

γ,

we have that (9)

g_{i} (\hat{γ}) \approx g_{i} (γ) + \frac{\partial g_{i} (γ)}{\partial γ} (\hat{γ} - γ)

Suppose

D_{i} = \frac{\partial g_{i} (γ)}{\partial γ},

the variance of the residuals becomes

\begin{aligned} V a r (Y_{i} - {\hat{μ}}_{i}) & = V a r [Y_{i} - g_{i} (\hat{γ})] \\ = E [(Y_{i} - g_{i} (\hat{γ})) (Y_{i} - g_{i} (\hat{γ}))^{⊤}] \\ \approx E [(Y_{i} - g_{i} (γ) - D_{i} (\hat{γ} - γ)) (Y_{i} - g_{i} (γ) - D_{i} (\hat{γ} - γ))^{⊤}] \\ = E {[(Y_{i} - g_{i} (γ)) - D_{i} (\hat{γ} - γ)] [(Y_{i} - g_{i} (γ)) - D_{i} (\hat{γ} - γ)]^{⊤}} \\ = E [(Y_{i} - g_{i} (γ)) (Y_{i} - g_{i} (γ))^{⊤}] - D_{i} E [(\hat{γ} - γ) (Y_{i} - g_{i} (γ))^{⊤}] - E [(Y_{i} - g_{i} (γ)) (\hat{γ} - γ)^{⊤}] D_{i}^{⊤} \\ + D_{i} E [(\hat{γ} - γ) (\hat{γ} - γ)^{⊤}] D_{i}^{⊤} \end{aligned}

From (7) and (8)

\begin{aligned} E [(\hat{γ} - γ) (Y_{i} - g_{i} (γ))^{⊤}] & = E [(A^{- 1} \frac{1}{I} \sum_{k = 1}^{I} S_{k} (γ)) (Y_{i} - g_{i} (γ))^{⊤}] \\ = A^{- 1} \frac{1}{I} \sum_{k = 1}^{I} [X_{k}^{⊤} E [(Y_{k} - g_{k} (γ)) (Y_{i} - g_{i} (γ))^{⊤}]] \\ = A^{- 1} \frac{1}{I} X_{i}^{⊤} E [(Y_{i} - g_{i} (γ)) (Y_{i} - g_{i} (γ))^{⊤}] \end{aligned}

with

E [(Y_{k} - g_{k} (γ)) (Y_{i} - g_{i} (γ))^{⊤}] = {\begin{cases} 0, si i \neq k \\ V a r [Y_{i} - g_{i} (γ)], sinon \end{cases}

Then and So

\begin{aligned} V a r (Y_{i} - {\hat{μ}}_{i}) & = V a r [Y_{i} - g_{i} (\hat{γ})] \\ = V a r [Y_{i} - g_{i} (γ)] - D_{i} A^{- 1} \frac{1}{I} X_{i}^{⊤} V a r [Y_{i} - g_{i} (γ)] - V a r [Y_{i} - g_{i} (γ)] X_{i} \frac{1}{I} [A^{- 1}]^{⊤} D_{i}^{⊤} + D_{i} E [(\hat{γ} - γ) (\hat{γ} - γ)^{⊤}] D_{i}^{⊤} \\ = V a r [Y_{i} - g_{i} (γ)] - D_{i} A^{- 1} \frac{1}{I} X_{i}^{⊤} V a r [Y_{i} - g_{i} (γ)] - V a r [Y_{i} - g_{i} (γ)] X_{i} \frac{1}{I} [A^{- 1}]^{⊤} D_{i}^{⊤} + D_{i} V a r (\hat{γ} - γ) D_{i}^{⊤} \\ = V a r (Y_{i}) - D_{i} A^{- 1} \frac{1}{I} X_{i}^{⊤} V a r (Y_{i}) - V a r (Y_{i}) X_{i} \frac{1}{I} [A^{- 1}]^{⊤} D_{i}^{⊤} + D_{i} A^{- 1} B [A^{- 1}]^{⊤} D_{i}^{⊤}, (under H_{0}) \\ = Σ_{i} - D_{i} A^{- 1} \frac{1}{I} X_{i}^{⊤} Σ_{i} - Σ_{i} X_{i} \frac{1}{I} [A^{- 1}]^{⊤} D_{i}^{⊤} + D_{i} A^{- 1} B [A^{- 1}]^{⊤} D_{i}^{⊤} \end{aligned}

\begin{aligned} C o v [(Y_{i} - {\hat{μ}}_{i}) (Y_{k} - {\hat{μ}}_{k})] & = E [(Y_{i} - {\hat{μ}}_{i}) (Y_{k} - {\hat{μ}}_{k})^{⊤}] - E [(Y_{i} - {\hat{μ}}_{i})] E [(Y_{k} - {\hat{μ}}_{k})^{⊤}] \\ = E [(Y_{i} - {\hat{μ}}_{i}) (Y_{k} - {\hat{μ}}_{k})^{⊤}] \\ = E [(Y_{i} - g_{i} (\hat{γ})) (Y_{k} - g_{k} (\hat{γ}))^{⊤}] \\ \approx E [(Y_{i} - g_{i} (γ) - D_{i} (\hat{γ} - γ)) (Y_{k} - g_{k} (γ) - D_{k} (\hat{γ} - γ))^{⊤}] \\ = E {[(Y_{i} - g_{i} (γ)) - D_{i} (\hat{γ} - γ)] [(Y_{k} - g_{k} (γ)) - D_{k} (\hat{γ} - γ)]^{⊤}} \\ = E [(Y_{i} - g_{i} (γ)) (Y_{k} - g_{k} (γ))^{⊤}] - D_{i} E [(\hat{γ} - γ) (Y_{k} - g_{k} (γ))^{⊤}] - E [(Y_{i} - g_{i} (γ)) (\hat{γ} - γ)^{⊤}] D_{k}^{⊤} \\ + D_{i} E [(\hat{γ} - γ) (\hat{γ} - γ)^{⊤}] D_{k}^{⊤} \\ = - D_{i} E [(\hat{γ} - γ) (Y_{k} - g_{k} (γ))^{⊤}] - E [(Y_{i} - g_{i} (γ)) (\hat{γ} - γ)^{⊤}] D_{k}^{⊤} + D_{i} E [(\hat{γ} - γ) (\hat{γ} - γ)^{⊤}] D_{k}^{⊤} \\ = - D_{i} A^{- 1} \frac{1}{I} X_{i}^{⊤} V a r [Y_{i} - g_{i} (γ)] - V a r [Y_{k} - g_{k} (γ)] X_{k} \frac{1}{I} [A^{- 1}]^{⊤} D_{k}^{⊤} + D_{i} V a r (\hat{γ} - γ) D_{k}^{⊤} \\ = - D_{i} A^{- 1} \frac{1}{I} X_{i}^{⊤} V a r (Y_{i}) - V a r (Y_{k}) X_{k} \frac{1}{I} [A^{- 1}]^{⊤} D_{k}^{⊤} + D_{i} A^{- 1} B [A^{- 1}]^{⊤} D_{k}^{⊤} \end{aligned}

Hence, one has

C o v [(Y_{i} - {\hat{μ}}_{i}) (Y_{k} - {\hat{μ}}_{k})] = - \frac{1}{I} D_{i} A^{- 1} X_{i}^{⊤} Σ_{i} - \frac{1}{I} Σ_{k} X_{k} [A^{- 1}]^{⊤} D_{k}^{⊤} + \frac{1}{I} D_{i} A^{- 1} B [A^{- 1}]^{⊤} D_{k}^{⊤}

Since the families

(i, k)

are independents between themselves, we get

References

Peng

Shen

Zhao

, et al. Genetic association analysis using sibship data: A multilevel model approach, 2012.

Leal

. Methods for detecting associations with rare variants for common diseases: Application to analysis of sequence data. Am J Human Genet 2008; 83: 311–321.

Madsen

Browning

. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet 2009; 5: e1000384.

Lee

Cai

, et al. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Human Genet 2011; 89: 82–93.

Chen

Huffman

Brody

, et al. Efficient variant set mixed model association tests for continuous and binary traits in large-scale whole-genome sequencing studies. Am J Human Genet 2019; 104: 260–274.

Jiang

McPeek

. Robust rare variant association testing for quantitative traits in samples with related individuals. Genet Epidemiol 2014; 38: 10–20.

Lakhal-Chaieb

Oualkacha

Richards

, et al. A rare variant association test in family-based designs and non-normal quantitative traits. Stat Med 2016; 35: 905–921.

Ott

Kamatani

Lathrop

. Family-based designs for genome-wide association studies. Nat Rev Genet 2011; 12: 465–474.

Roach

Glusman

Smit

, et al. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science 2010; 328: 636–639.

10.

Zhou

Whittemore

. Improving sequence-based genotype calls with linkage disequilibrium and pedigree information. Ann Appl Stat 2012; 457–475.

11.

Jun

Manning

Almeida

, et al. Evaluating the contribution of rare variants to type 2 diabetes and related traits using pedigrees. Proc Natl Acad Sci USA 2018; 115: 379–384.

12.

Xue

Zhu

, et al. Genome-wide association analyses identify 143 risk variants and putative regulatory mechanisms for type 2 diabetes. Nat Commun 2018; 9: 1–14.

13.

Wang

Lee

Zhu

, et al. Gee-based snp set association test for continuous and discrete traits in family-based association studies. Genet Epidemiol 2013; 37: 778–786.

14.

Wang

Zhang

Morris

, et al. Rare variant association test in family-based sequencing studies. Brief Bioinformatics 2016; 18: 954–961.

15.

Chen

Meigs

Dupuis

. Sequence kernel association test for quantitative traits in family samples. Genet Epidemiol 2013; 37: 196–204.

16.

Oualkacha

Dastani

, et al. Adjusted sequence kernel association test for rare variants controlling for cryptic and family relatedness. Genet Epidemiol 2013; 37: 366–376.

17.

Schifano

Epstein

Bielak

, et al. Snp set association analysis for familial data. Genet Epidemiol 2012; 36: 797–810.

18.

Chen

Wang

Conomos

, et al. Control for population structure and relatedness for binary traits in genetic association studies via logistic mixed models. Am J Human Genet 2016; 98: 653–666.

19.

Saad

Wijsman

. Association score testing for rare variants and binary traits in family data with shared controls. Brief Bioinformatics 2019; 20: 245–253.

20.

Bourgain

Hoffjan

Nicolae

, et al. Novel case-control test in a founder population identifies p-selectin as an atopy-susceptibility locus. Am J Human Genet 2003; 73: 612–626.

21.

Choi

Wijsman

Weir

. Case-control association testing in the presence of unknown relationships. Genetic Epidemiol 2009; 33: 668–678.

22.

de Leon

. Copula-based regression models for a bivariate mixed discrete and continuous outcome. Stat Med 2011; 30: 175–185.

23.

Joe

. Dependence modeling with copulas. Chapman and Hall/CRC, 2014.

24.

Tounkara

Lefebvre

Greenwood

, et al. A flexible copula-based approach for the analysis of secondary phenotypes in ascertained samples. Stat Med 2020; 39 (5): 517–543.

25.

Breslow

Clayton

. Approximate inference in generalized linear mixed models. J Am Stat Assoc 1993; 88: 9–25.

26.

Gilmour

Anderson

Rae

. The analysis of binomial data by a generalized linear mixed model. Biometrika 1985; 72: 593–599.

27.

Schramm

Jacquemont

Oualkacha

, et al. Kspm: an r package for kernel semi-parametric models. R Journal 2020; 12 (2): 82–106.

28.

Terwilliger

Speer

Ott

. Chromosome-based method for rapid computer simulation in human genetic linkage analysis. Genet Epidemiol 1993; 10: 217–224.

29.

Quessy

J-F

Rivest

L-P

Toupin

M-H

. On the family of multivariate chi-square copulas. J Multivar Anal 2016; 152: 40–60.

30.

Chagnon

Maziade

Paccalet

, et al. A multimodal attempt to follow-up linkage regions using RNA expression, SNPs and CpG methylation in schizophrenia and bipolar disorder kindreds. Eur J Hum Genet 2020; 28: 499–507.

31.

Thomas

. Epigenetic mechanisms in huntington’s disease. Chromatin Signal Neurol Disorders 2019; 73–95.

32.

Gianola

Fariello

Naya

, et al. Genome-wide association studies with a genomic relationship matrix: a case study with wheat and arabidopsis. G3: Genes, Genomes, Genet 2016; 6: 3241–3256.

33.

Wang

. Pedigrees or markers: Which are better in estimating relatedness and inbreeding coefficient? Theor Popul Biol 2016; 107: 4–13.

34.

Yang

Benyamin

McEvoy

, et al. Common SNPs explain a large proportion of the heritability for human height. Nat Genet 2010; 42: 565–569.

35.

Jiang

Chiu

C-Y

Yan

, et al. Gene-based association testing of dichotomous traits with generalized functional linear mixed models using extended pedigrees: Applications to age-related macular degeneration. J Am Stat Assoc 2020; 1–15.

36.

Lindsay

Sun

. Issues and strategies in the selection of composite likelihoods. Stat Sin 2011; 71–105.

37.

Genest

Nikoloulopoulos

Rivest

L-P

, et al. Predicting dependent binary outcomes through logistic regressions and meta-elliptical copulas. Braz J Probab Stat 2013; 265–284.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.75 MB

		NRVAT					SMMAT		AFC		GSKAT
$α$	Traits	L	Q	IBS	G	P	O	E	Xc	QLS	Asymp	Pert
$0.05$	SZ	-	SMYD3 ( $16 \times 10^{- 05}$ )	–	–	–	SMYD3 ( $5.82 \times 10^{- 05}$ )	–	–	*	–	–
	BP	-	C1ORF77 ( $28 \times 10^{- 05}$ )	-	CGI-96 ( $8 \times 10^{- 05}$ )	CGI-96 ( $3.64 \times 10^{- 06}$ )	–	–	TRIM24 ( $19 \times 10^{- 05}$ )	*	–	–
	CL	–	–	–	–	–	–	–	–	*	–	–
	OverAll	–	–	–	–	CGI-96	SMYD3	–	–	*	–	–