Abstract
Gene copy number changes are common characteristics of many genetic disorders. A new technology, array comparative genomic hybridization (a-CGH), is widely used today to screen for gains and losses in cancers and other genetic diseases with high resolution at the genome level or for specific chromosomal region. Statistical methods for analyzing such a-CGH data have been developed. However, most of the existing methods are for unrelated individual data and the results from them provide explanation for horizontal variations in copy number changes. It is potentially meaningful to develop a statistical method that will allow for the analysis of family data to investigate the vertical kinship effects as well. Here we consider a semiparametric model based on clustering method in which the marginal distributions are estimated nonparametrically, and the familial dependence structure is modeled by copula. The model is illustrated and evaluated using simulated data. Our results show that the proposed method is more robust than the commonly used multivariate normal model. Finally, we demonstrated the utility of our method using a real dataset.
Introduction
The gene copy number (also called “copy number variants”—CNV) is the number of copies of a particular gene in the genome of an organism. Recent evidences show that gene copy number (GCN) amplifications and deletions are common characteristics of many genetic diseases. For example, GCN can be elevated in cancer cells as demonstrated in the epidermal growth factor receptor (EGFR) gene in patients with non-small cell lung cancer (Cappuzzo et al. 2005) and also higher copy number of CCL3L1 has been associated with susceptibility to human HIV infection (Gonzalez et al. 2005). Thus identifying these genetic gains and losses provides useful information about specific disease susceptibility or resistance. GCN analysis among normal people within the human genome is also of interest. However, these genetic characteristics are usually not directly observable. Recent technological development in array comparative genomic hybridization (a-CGH) provides scientists with an efficient tool to conduct whole genome and high-density region specific investigation of GCN (Solinas et al. 1997; Pinkel et al. 1998; Snijders et al. 2001).
Briefly, a-CGH technique involves the labeling of genomic DNA from disease tissues (e.g. cancer) and normal control tissue (reference) with different colors (fluorochrome). These samples are then co-hybridize to a metaphase spread from a normal reference cell. After hybridization, emission from each of the two fluorescent dyes is measured, and the signal intensity ratios are indicative of the relative copy number of the two samples. The ratio of the two fluorochrome intensities is then calculated and regions where the disease DNA are amplified or deleted are readily detected on the metaphase spread. The resulting data are in the form of microarrays. This technique not only gives us information about copy number gains and losses in the disease genomic DNA but also allows the identification of the specific chromosomes and the regions of the chromosomes where these changes occurred.
However, the a-CGH data does not provide direct measurements of the GCN changes. Hence, several statistical approaches for analyzing and describing results from these experiments have been developed. Differences exist in these approaches and newer approaches addressing some of the limitations of existing method are needed. For example, some of these methods do not take into account the spatial dependence within the chromosome (Hodgson et al. 2001; Pollack et al. 2002; Cheng et al. 2003; Wang et al. 2004) while others have implemented such dependence structure into their models to enhance the inference (Jong et al. 2004; Picard et al. 2004; Fridlyand et al. 2004; Eilers et al. 2004). With the exception of a few that are Bayesian (Barash and Friedman, 2002; Daruwala et al. 2004; Broet and Richardson, 2005), most of the existing methods are frequentist's. All of the existing methods are designed for analyzing data from unrelated persons and are therefore effective in explaining horizontal changes in GCN. However and when available, family data present wonderful opportunity to investigate the vertical kinship effects of GCN as well as the horizontal changes. For this type of data, the main challenge is to model the high dimensional familial dependence structure, and no such approach was found following a careful review of the literature. In this paper, we present such a method in which we used the nonparametric approach to model the marginal distributions and then linked the joint distribution by a copula structure.
Typically, GCN changes observed from a-CGH experiments are classified into three groups corresponding to the three statuses of copy number changes—amplification, deletion and normal; Thus, allowing the microarray responses to have similar features. The practical challenge in the problem that we describe here is that of high dimensionality due to familial dependence among pedigree members. As we indicated above, several of the statistical tools for microarry data clustering deal with low dimensional data (usually one dimensional) and do not take into account the familial dependence among the pedigree members. Such methods can be divided into two main groups, the model based and non-model based (semiparametric). The former assumes specific probability models for the sub-distributions of response in each cluster and is efficient when the specifications are correct but may be seriously biased if implemented specifications deviate from the true unknown models. The semiparametric does not make any assumption about the models except that of a mixing structure, in which the unknown sub-distributions are estimated nonparametrically from the data themselves, and the inference is robust. This method is adequate when the data size is large so that the sub-distributions can be estimated accurately. Yuan and He (2008) proposed such a method for low dimensional microarray clustering for independent data generated from unrelated persons. For data with high dimensionality, the commonly used multivariate normal model rarely fits the actual data, and the nonparametric method is not directly applicable in cluster analysis, so neither of the above models based or non-model based methods are suitable for analyzing the dependence and high dimensionality of family data.
In statistics, the copula is a widely used tool for modeling the dependence structure of high dimensional data (Sklar, 1959; Joe, 1997), and is particularly suitable for pedigree data modeling. Here we propose and implement a semiparametric copula model to address this problem. Specifically, the marginal distributions are estimated nonparametrically, the within pedigree dependence structure is modeled by copula with parameters specified by the kinship coefficients. A penalty term is used against non-unimodality of the sub-distributions. The optimal partition is performed using a classification-estimation (of densities)-maximization-estimation (of parameters) algorithm. The algorithm shares the property of ascending the penalized semiparametric likelihood, just like the well known EM algorithm for ascending the parametric likelihood, and thus, under fair conditions, converges to the optimal partition of the microarray.
The Method
In a-CGH data, the fluorescence ratios between two samples, case and control, are measured across a genomic region. For loci with different copy number changes, the corresponding log-ratio measurement tend to be different. Thus in a-CGH data analysis, often a three-state mixture model is used: deletion state, normal state and amplification state, and we arbitrarily lable them as state 1, 2 and 3. Genes with copy number deletion tend to have smaller log-ratio measurements, those with normal status tend to have moderate measurements, and those with amplification tend to have larger measurements.
We focus on the case of a given chromosome. When there are more than one chromosome under consideration, the method is similar by modeling the chromosomes one by one. Suppose there are n loci of interest and r pedigrees of individuals. The measurement at each locus for each individual is observed. The j-th pedigree has sj individuals (j = 1, …, r), at locus k, the l-th individual of the j-th pedigree has microarray measurement yjkl. Denote
Generally this question is formulated as a cluster problem, in which each of the gene locus in classified into one of the clusters B1, B2 and B3 represent the three states deletion, normal and amplification. Let y be a general random vector of the observation y jk 's, a mixture model on y is specified as
Where f(·|Bi) is the sub-density of the responses in the i-th cluster, and the αi's are the mixing proportions satisfying 0 ≤ αi ≤ 1,
In practice, such high dimensional dependent data hardly conforms to a multivariate normal distribution. A commonly used statistical tool to deal with high dimensional dependence structure is the copula. In this method, it is only necessary to specify each of the marginal densities, and then use a link (copula) to compose all the marginal densities into a joint multivariate density with desired dependence structure. There are large number of copulas to be used, and some optimality criteria to select the best copula for a given problem and data. When the copula is selected, we can incorporate the kinship coefficients among the pedigree members into its dependence structure. Also, there is the question of how to specify the marginal densities. There are various parametric densities to choose from, but if the wrong one is used the results may be seriously biased. On the other hand, for data with large sample size, the nonparametric density can adapt to any distributional feature. Since we do not know the true sub-densities we model each of the marginal densities by nonparametric method for robustness. Finally, a modified BIC criterion is used to select the optimal number of clusters. We describe each of the above steps in different sub-sections below.
The marginal distributions
Since commonly available pedigree data usually consist of three generations and to account for the age and gender difference, the distributions of the measurements are divided into six groups in the following order: first generation male, first generation female, second generation male, second generation female, third generation male and third generation female. We use Gs to denote the s-th group. For example if an individual with observation yjkl is a second generation female in any given pedigree, she is in group 4, we simply denote ykl ∈ G4, and so on. Denote fs(·|Bi) be the sub-density of array cluster i of group s.
Since the fs(·|Bi)s are unknown, they can be estimated by the well known nonparametric estimator (Rosenblatt, 1969)
where nis is the sample size (number of individuals) of group s in cluster i, K(·) is arbitrary given kernel density, and
In the density estimation literature, the choice of kernel is not of particular importance (Diggle, 1983; Silverman, 1986). Studies suggest that most unimodal densities perform about the same as the other when used as a kernel, and the choice between kernels can be made on other grounds such as computational efficiency. However, there are some popular options in practice for different reasons. For some general introduction for the choice of kernels, we refer to Silverman (1986) and Scott and Wand (1991). The normal kernel (i.e. K(·) is the density function of the standard normal distribution) is widely used in practice for convenience and other nice features.
In contrast, the choice of bandwidth is crucial in density estimation (Silverman, 1986). Interesting proposals which address this problem can be found in Fan and Gijbels (1992). There is literature on automatic methods that attempt to minimize a lack-of-fit criterion, such as integrated squared error. From Silverman (1986), we choose to use the bandwidth
where
In the copula formulation we also need the corresponding marginal distribution functions. Let Fs(·|Bi) denote the marginal distribution functions for cluster i and group s,
where χ(·) is the indicator function.
The joint distribution
The copula is a commonly used statistical tool to model multivariate joint distribution, it appeared in the early work of Hoeffding, Fréchet and others and formally introduced by Sklar (1959). We first give a very brief account of it and we refer to Hunchinson and Lai (1990); Joe (1997) and Nelson (1999) for detailed review.
A function C defined on [0, 1] d is a d-variate copula if C(F1(x1), …, Fd(xd)) is a joint distribution function for any marginal distribution functions F1(x1), …, Fd(xd). The marginal distributions of C(F1(x1), …, Fd(xd)) itself are just F1(x1), …, Fd(xd). This property provides a convenient way to construct a joint distribution with given marginal ones. On the other hand, given a set of marginal distribution functions F1(x1), …, Fd(xd), there is a unique copula C such that C(F1(x1), …, Fd(xd)) is the true joint distribution of them (Sklar, 1959). Also, for any joint d-dimensional distribution function F(…), let Fi–1(·) be the quantile functions of the i-th margin, then the function C(x1, …, xd) = F(F1–1(x1), …, Fd–1(xd)) is a d-variate copula. Let c (…) be the density function (the total derivative) of C(…) when exists. Let fi(·) be the density function of Fi(·), the density function f(x1, …, xd) of the copula distribution function C(F1(x1), …, Fd(xd)) is given by
Given a copula, the dependence structure can be characterized in several ways, including Pearson's correlation, Kendall's tau, Spearman's rho, tail dependence, etc. Kendall's tau is generally easier to compute for copulas, so we use this dependence measure. For a two-dimensional copula, Kendall's tau is given by
where (X1, Y1) and (X, Y) are independent with the same distribution. –1 ≤ τ ≤ 1, τ = 0 for independence, –1 and 1 for perfect negative and positive dependence. Genest et al. (1995) suggested a pseudo-likelihood approach to estimate the dependence parameters, in which the observed data is transformed via the empirical marginal distributions to obtain pseudo-data that are used in the estimation. Using the special relationships among relative pairs, we can implement the dependence parameters in the copula via the relationships among kinship coefficients, Kendall's tau and the copula dependence parameters without estimation.
For pedigree data, the dependence relationships among familial members (i, j) are best described by the kinship coeffcients, γij = Δ7ij/2 + Δ8ij/4, where Δ7ij, Δ8ij, Δ9ij are the condensed kinship coefficient (Jacquard, 1974) between relative pair i and j. The Δ kij s (k = 1, …, 9) are the probabilities for the nine possible condensed identical by descent (IBD) status as in Jacquard (1974), in which Δ7ij, Δ8ij and Δ 9ij are commonly used in practice. They are the population probabilities of sharing 2, 1 and 0 genes IBD for relative pair (i, j), without regard to their particular genotypes, but only (i, j)'s kinship relationships, under the Mendelian inheritance. Also, 2γij is the expected proportion of gene IBD for relative pair (i,j) at this locus. For convenience we list the values of these coefficients for some common relative pairs (Lange, 1997), and we compute corresponding Kendall's tau in the last column after the computations below.
For trait underlined by single locus or multiple loci, Pearson's correlation for relative pair (i,j) is 2γij (Lange, 1997). Assume that gene copy number change statuses are determined only by the underlying genetic sources, and that the amounts of dependence among them are additive with respect to their shared genetic sources. Then at any fixed locus, Kendall's tau between a fixed type of relative pair (i, j) is (Appendix)
As is true for Pearson's correlation, we postulate that Kendall's tau remain the same, or approximately so, when the trait is influenced by multiple loci. As the kinship coefficients are already known as in Table 1, we get Kendall's tau by the above relationships and in turn, the dependence parameters in the copula model is obtained from the relationship among the dependence parameters and kendal's tau for specified copula. Thus we can easily implement the dependence kinship coefficients in the copula in terms of Kendall's tau without estimating them. For this, we first need to review several commonly used copulas. Note, for family data the dependence are not constant among different relative types, hence copulas with constant dependence parameters, such as Clayton's copula or Frank's copula can not be used here.
Kinship coeffcient for selected relative pairs.
Multivariate normal copula
Let Φ
d
(·, Θ) be the d-variate normal distribution function with mean vector
with density
Thus for given marginal distribution functions F1(·), …, Fd(·) and their densities f1(·), …, fd(·), the joint distribution function for the multivariate normal copula with these given margins is
with density
For the distribution in (6), any lower dimensional joint distributions have the same form. For example the (i, j)-th joint distribution function is F(xi, xj) = Φ2(Φ–1(Fi(xi), Φ–1(Fi(xj)); Θ ij ), where Θij is the (i, j) sub-block of the matrix Θ. From Table 1 and in this case, Spearman's and Kendall's tau are the same. For this copula, Spearman's rho (Kendall's tau) and the dependence parameters θij's in normal copula are related by (Lindskog, 2000)
By relationships (6) and (7), the dependence parameters θ ij 's in the multivariate normal copula are easily obtained given the τ ij 's, which are computed via the known kinship coefficients Δ kij 's, as long as we know the kin type of relative pair (i, j).
Multivariate T-copula
Let Θ be the correlation matrix given in the multivariate normal distribution, x = (x1, …, xd)’ The density function d-dimensional T-distribution with r degrees of freedom is
The corresponding copula density is
where Qr(·) is the distribution function of the T-distribution with r degrees of freedom, and qr(·) is its density function. Given marginal distribution functions F1(·), …,Fd(·) and their densities f1(·), …, fd(·), the joint density with the copula defined by this multivariate T-distribution is
For this copula, the relationships between the θ ij 's and the τij's are the same as for the multivariate normal copula.
Selection of copula
Given several candidate copulas C1, …, Ch with densities c1, …, ch, a natural question is how to select the optimal copula for the data. Let
where ns is the number of observations in group s.
When there are parameters to be estimated in the copula, the optimal copula can be selected by AIC criteria (Oakes, 1989; Dias and Embrechts, 2003). Here, our copula has no parameters to be estimated, by the likelihood principle and (5), an intuitive way is to select the C with
or equivalently, to avoid computation overflow or underflow,
This is equivalent to choosing the copula with the largest likelihood.
Now for given copula, the joint density for the data y = {yjkl} is modeled by
where
We point out that although we used the same notation c, for different families, the number of individuals may differ and so are the dimensionalities of the c's.
However, under the semi-parametric mixture model assumption, the sub-distributions can take any shape, even the shape of the entire distribution, and as a result any cluster partition will give about the same likelihood value via (9). So optimizing (9) over all possible cluster partitions will not be able to identify the desired clusters. Thus we put some constraints on the selection of clusters such that the sub-distribution is approximately unimodal and optimizing model (9) will give the desired clusters, as in Yuan and He (2008). The reference Yuan and He will be refered to as YH in subsequent citations. However there are two major differences between the method we are proposing and that of YH. Our method can handle high-dimensional data and the link among the marginal densities in copula.
Specifically, let g(·|Bi) be the multivariate normal density with mean given by the sample mean for data in Bi, and covariance matrix Θ, for observations in Bi(i = 1, …, 3), and denote g = (g1, g2, g3), where gi = g(·|Bi). g is used as shape constraints for the
and we set
Let
for some 0 ≤ λ ≤ 1 to be specified. This model can be viewed as an extension of the traditional mixture model. When λ = 0, it corresponds to a nonparametric specification of sub-distributions, when λ = 1 it is a full parametric model given by the g(·|Bi)s, and when 0 < λ < 1 it corresponds to an intermediate model. By doing this, we are forcing the distributions to be close to normal, more than what is needed for unimodal. The tunning parameter λ is chosen according to simulation for the given type of data. The choice of a multi-variate normal here is for convenience as other choices could be made but may result in additional complication.
The CEME algorithm
However, directly optimizing the mixture model (10) is usually not easy. A common practice of estimating the cluster membership of each observation in the data while evaluating the maximum likelihood estimate
For fixed k, let uij = 1 if the i-th locus belongs to the j-th cluster, ui = (ui1, ui2, ui3) be its membership vector, and u = {uij}. Treating u as missing data, (y, u) is referred to as the “complete” data. Then the likelihood for the “complete” data is
Although we used the same notation
By the same reason as (10), we optimize the penalized “complete data” log-likelihood
where g(yjk|Bs) is the analogue of
Starting Values
Set
Given the current t-th iteration estimates
Classification-step: Each response locus k, is classified into a candidate cluster
Expectation-step: Let Uks's be the associated random variables of the uks's, and g(t)(·|B(t)) be the multi-dimensional normal density with mean and covariance matrix empirically estimated from the data in Bs(t)(s = 1,2,3).
As in YH, for k = 1, …, n; s = 1, 2, 3, we have
where the expectation is taken with respect to the constrained log-likelihood L. Denote
Maximization-step: Compute the MLE α(t + 1) of a given u(t + 1) as in YH
Estimation-step: To update the estimation of the density f(t)(·) of current iteration t to f(t+1)(·) for the next iteration t + 1, we first compute candidate sub-marginal density
with
where
Then use (5) to get the candidate sub-joint density for the j-th pedigree at locus k, as follows
and that for all the pedigree at locus k is
Let
We update the quadruple (B(t + 1), f(t + 1), F(t + 1), g(t + 1)) as
The estimate of f(·) at the (t + 1)-th iteration is then
Note that at each iteration t, α(t), B(t), and the uij(t)s are updated, but not necessarily so for f(t) and g(t).
The above four steps are iterated until convergence of (α(t), f(t), B(t)). (Note by the following Proposition, we may check the stability of the α(t) as a simple criterion for the convergence of the triple). We may use the relative error criterion for the convergence of the (α(t)'s, that is, for some pre-specified δ > 0, we stop the iteration when
As in YH, we have
When (a, f *, B*) is the unique stationary point, we have, as t → ∞,
Application
Simulation Study
We simulate r = 10 pedigrees, each has four individuals, father, mother and two sibs, and we assume there are n = 200 loci of interest, which are divided into 3 clusters as B1 = (1, 80), B2 = (81, 150) and B3 = (151, 200), with cluster means μ1 = (4.9, 4.2), μ2 = (9.9, 9.2) and μ3 = (14.9, 14.2) for male and female individuals. We generated two datasets by simulation using the normal copula and multi-normal models. Each of the datasets were analyzed using both normal copula and multi-normal models.
To simulate data from the Multivariate normal copula model, let A be the Cholesky decomposition of Θ. To sample from this copula distribution: for k = 1, …, n and i = 1, …, 5
generate r independent samples Z1k, …, Zrk from N (0, I4).
Let ulk = AZlk (l = 1, …, r).
For k Bi, if j is for male, set xljk = Φ(ul1k) + μk1; if j is for female, set xljk = Φ(ul1k) + μk2, and xlk = (xl1k, …, xl4k), where Φ(·) is the distribution function of the standard normal. Then x1k, …, xrk is a sample from the 4-variate normal copula model with correlation matrix Θ. The results are displayed in Table 2 below, with tuning parameter λ of values 0.25, 0.5, 0.75 and 1.
In the above, λ = 1 corresponds to a normal model, and 0 < λ < 1 correspond to a mixed model. For this type of data, the model has difficulty in parameter convergence for small values of λ, reflecting the fact that the multivariate data distribution is too noisy for nonparametric part of the model to work alone; thus a parametric unimodal component is needed to help cluster the data. The normal copula model has larger likelihood value in all these cases. This means the normal copula model is more robust than the multivariate normal model. For the data from multi-normal model, when λ = 0.5, 38 loci from cluster three are classified to cluster two. Over all, λ = 0.75 performs well for all the data set, and so we recommend this value of λ in this analysis.
To assess the robustness of the method, we simulated larger data sets with family sizes of 4 and 5, each with 100 families and 200 candidate loci. The simulated clusters and means for male and female are: cluster one, 1-70, (4.9, 4.2); cluster two, 81-150, (8.9,8.2) and cluster three, 171-200, (12.9,12.2) respectively. To reflect some complexity we added minor clusters to some of the clusters. The means for male and female for the minor clusters are 71-80, (5.4, 4.7) and 151-170, (12.4,11.7). The results are summarized in Table 3.
Overall, the results are consistent: the smaller the value of λ, the better the model fitness, as indicated by larger likelihood value. This means that the nonparametric model component capture the data distribution in fine details. But in many cases, the computation breaks down for λ = 0 as pointed out earlier. It is seen that for either the data generated from multi-normal or normal copula distributions, the overall performances of the semiparametric model is robust for a range of the tuning parameter λ.
Real data analysis
We use the proposed method to analyze the Genetics Analysis Workshop 15 (GAW15) data set with 14 pedigrees of CEPH Utah families, each with three generations and about a dozen normal individuals. Expression level of genes in lymphoblastoid cells of the above subjects were obtained using the Affymetrix Human Focus Arrays that contain probes for 8,500 transcripts. Gene copy number variations in normal people within human genome has been the subject of study (Freeman et al. 2006; Pugh et al. 2008). For 3,554 of the 8,500 SNPs tested, Morley et al. (2004) found greater variation among individuals than between replicate determinations on the same individual. These 3,554 expression phenotypes (expressed genes) were chosen for copy number change analysis. The first step is to find out the best copula model for the data. We considered three different models, the multi-normal model, the semi-parametric multivariate normal-copula model, and the semi-parametric multivariate T-copula model. Then the criterion in (8) is used to select the optimal model. The average copula likelihood values for the three models are -3217389.15, -2094272.97, -2296408.96 respectively. Thus the semi-parametric multivariate normal-copula model is the best of the three and was used for clustering. The outcome of the analysis of the GAW15 data is displayed in the figure below. The horizontal axis represents the sequential numbering of genes from 1 to 3550, and the vertical axis indicates the classified states of the genes with 1, 2 and 3 representing deletion, normal and amplification.
Cluster results for normal copula and multi-normal models for 10 pedigrees and 200 loci.
Summary cluster results from normal copula and multi-normal data sets for 100 pedigrees with 4 or 5 Family Members.
As shown in the figure, most of the SNPs are in clusters 1 and 3, this observation is consistent with the large variation of the expression levels. The SNPs with deletion status are more likely to be contained in cluster 1, and those with amplification status are more likely to be in cluster 3.
Concluding Remarks
We proposed, studied and demonstrated a semiparametric copula method for microarray-SNP genomewide association analysis using pedigree data. We successfully implemented the kinship relationship into the model for more robust analysis of family data than the commonly used multivariate normal model.
Disclosure
The authors report no conflicts of interest.
Footnotes
Appendix
Proof of (6): Let (X1, Y1) be the traits of relative pair (i, j); (X, Y) be an independent copy of (X1, Y1); and Al be the event that a relative pair share l alleles IBD (l = 0, 1, 2). Given a relative pair (i, j), by definition P(A2) = Δ7ij, P(A1) = Δ8ij, and P(A0) = Δ9ij. By the assumption that GCN change is determined by the underlying genetic source and given A2, the pair (i, j) share the same genetic source at the locus and the same copy number change status; thus X1 = Y1, X = Y. Note that the random variables X1 and X are of continuous type and P((X1 – X) (Y1 – Y). 0|A2) = P((X1 – X)
2
. 0|A2) = 1. Also, X1 – X and Y1 – Y are random variables symmetric around 0, thus given A0, X1 – X and Y1 – Y are independent, so the events (X1 – X)(Y1 – Y). 0 and (X1 – X)(Y1 – Y), 0 are completely random, each with probability 1/2, i.e. P((X1 – X)(Y1 – Y)|A0) = 1/2. By the additivity assumption, given A1, the probability of the event (X1 – X)(Y1 – Y). 0 is the average of those for cases of given A2 and A0, i.e. P((X1 – X)(Y1 – Y). 0|A1) = 3/4. Then at any fixed locus, Kendall's tau between a fixed type of relative pair (i, j) is
Acknowledgement
This work is supported in part by the National Center for Research Resources at NIH grant 2G12RR003048, and by the Center for Research on Genomics and Global Health (CRGGH) at NHGRI/NIH.
