Sage Journals: Discover world-class research

Abstract

DNA copy number variations (CNVs) have been shown to be associated with cancer development and progression. The detection of these CNVs has the potential to impact the basic knowledge and treatment of many types of cancers, and can play a role in the discovery and development of molecular-based personalized cancer therapies. One of the most common types of high-resolution chromosomal microarrays is array-based comparative genomic hybridization (aCGH) methods that assay DNA CNVs across the whole genomic landscape in a single experiment. In this article we propose methods to use aCGH profiles to predict disease states. We employ a Bayesian classification model and treat disease states as outcome, and aCGH profiles as covariates in order to identify significant regions of the genome associated with disease subclasses. We propose a principled two-stage method where we first make inferences on the underlying copy number states associated with the aCGH emissions based on hidden Markov model (HMM) formulations to account for serial dependencies in neighboring probes. Subsequently, we infer associations with disease outcomes, conditional on the copy number states, using Bayesian linear variable selection procedures. The selected probes and their effects are parameters that are useful for predicting the disease categories of any additional individuals on the basis of their aCGH profiles. Using simulated datasets, we investigate the method's accuracy in detecting disease category. Our methodology is motivated by and applied to a breast cancer dataset consisting of aCGH profiles assayed on patients from multiple disease subtypes.

Keywords

breast cancer classification Bayesian network hidden Markov model

Introduction

DNA copy number variations (CNVs) have been shown to be associated with cancer development and progression.¹ Somatic CNVs can lead to tumorigenesis. For example, loss of copy numbers for tumor suppressor genes or amplification for oncogenes both lead to cancer. The detection of these CNVs has the potential to impact the basic knowledge and treatment of many types of cancers, and can play a role in the discovery and development of molecular-based personalized cancer therapies.²

In early years, cytogeneticists have been limited to traditionally visually examining whole genomes with a microscope, a technique known as karyotyping or chromosome analysis. In the mid-70 s and 80s, the development and application of molecular diagnostic methods such as Southern blots, polymerase chain reaction (PCR), and fluorescence in situ hybridization (FISH) allowed clinical researchers to make many important advances in genetics, including clinical cytogenetics. However, these techniques have several limitations. First, they are very time consuming and labor intensive, and only a limited number and regions of the chromosome can be tested simultaneously. Further, because the probes are targeted to specific chromosome regions, the analysis requires prior knowledge of an abnormality and is of limited use for screening complex karyotypes. More recently, scientists have developed techniques that integrate aspects of both traditional and molecular cytogenetic techniques called chromosomal micorarrays.³ These high-throughput high-resolution microarrays have allowed researchers to diagnose numerous subtle genome-wide chromosomal abnormalities that were previously undetectable and find many cytogenetic abnormalities in part or all of a single gene. Such information is useful for biologists to detect new genetic disorders and also provide a better understanding of the pathogenetic mechanisms of many chromosomal aberrations.

One of the most common types of high-resolution chromosomal microarrays are array-based comparative genomic hybridization (aCGH) methods that assay DNA CNVs across the whole genomic landscape in a single experiment.⁴ With aCGH, differentially labeled test and reference samples’ genomic DNAs are cohybridized to normal chromosomes, and fluorescence intensities/ratios along the length of chromosomes provide a cytogenetic representation of the relative DNA CNV across the whole genome. Whereas early aCGH arrays were mainly used in research settings, recent improvements in algorithms for aCGH data analysis as well as rapidly reducing costs now enable clinical applications of aCGH arrays, particularly in the study of cancer genomic as a diagnostic tool.²

In this article, we propose methods to use aCGH profiles to predict disease states. We employ a Bayesian classification model, and treat disease states as outcome and aCGH profiles as covariates – to identify significant regions of the genome associated with disease subclasses. Statistical challenges for aCGH classification include not only high dimensionality ie, large number (tens of thousands) of probes but also relatively small number of samples, more importantly, the presence of serial correlation among the features – nearby probes (by genomic location) tend to be highly correlated. Classical methods usually used for multivariate classification of high-dimensional genomic data, eg, penalized approaches (Zhu and Hastie⁵ and the references there-in), do not account for the specific structure of aCGH data, as they ignore the serial dependence in the probes. To exploit the serial genomic information, typical approaches first segment the data⁶ and then conduct downstream classification. Alternative methods are based on kernel-based techniques such as support vector machine (SVM),⁷ and its variants exploit genomic continuity.⁸ While incorporating excellent prediction capabilities, these methods do not explicitly utilize the inherent discrete nature of the latent copy number states (gain/loss/normal) in their variable selection procedures, which serves as one of the primary aims in this article.

In the Bayesian framework, several innovative variable selection strategies have been developed in various contexts, with reasonable degrees of success. Some of these approaches can be regarded as linear variable selection methods. These include stepwise selection,⁹ penalized regression approaches such as lasso (and its variants),¹⁰ and non-concave penalized likelihood approaches.¹¹ The technique applied in this paper is based on Bayesian linear variable selection approaches, including spike and slab mixture priors,¹² stochastic search variable selection,¹³ Gibbs-based variable selection,¹⁴ Bayesian model averaging,^15,16 and indicator priors.¹⁷ The stochastic search variable selection approach of George and McCulloch¹³ has been extended to multivariate settings by Brown et al.¹⁸ and to generalized linear mixed models by Cai and Dunson.¹⁹ Effective variable selection methods have also been developed for multinomial probit models by Sha et al.²⁰ and for microarray data with censored outcomes by Lee and Mallick²¹ and Sha et al.²² However, none of these approaches account for natural spatial/serial dependency in the covariates (as in our case) – which might lead to biased estimates.

In this article we propose a principled two-stage method for disease classification using covariates exhibiting serial dependence. In general, the technique is applicable to datasets having the following structure. For individuals i = 1, …, n, we have (i) two disease categories coded as the binary response y_i and (ii) aCGH emissions e_i₁, …, e_ip corresponding to p probes, with p typically being much larger than n. The analysis broadly consists of two stages. In Stage 1, we make inferences on underlying copy number states associated with the aCGH emissions based on hidden Markov model (HMM) formulations²³ to account for serial dependencies. Subsequently in Stage 2, we analyze the model parameters associated with the binary responses, conditional on the parameters discovered in Stage 1, using Bayesian linear variable selection procedures. In particular, we select the aCGH probes having a linear regression relationship with the disease categories. The selected probes and their effects are parameters that are useful for predicting the disease categories of any additional individuals on the basis of their aCGH emissions. Our methodology is motivated by and applied to a dataset consisting of 111 breast cancer patients²⁴ and falling into two disease subgroups, ER+ and triple negative (TN). There are 56 TN patients and 55 ER+ patients. For each patient, DNA copy number data were generated using Agilent 4x44K CGH arrays (available at ArrayExpress accession number E-TABM-484).

The remainder of the paper is organized as follows. Section 2 provides details of the model for the two-stage analysis. Section 3 develops the posterior inference and prediction technique based on Markov chain Monte Carlo (MCMC) methods. In Section 4, using simulated datasets, we investigate the method's accuracy in detecting disease category. Finally, Section 5 analyzes the motivating breast cancer dataset and makes test case predictions.

Model

Our modeling framework consists of two stages: In Stage 1, we model the aCGH emissions, relying on HMMs to account for the serial correlations among the emissions. Then, in Stage 2, the relationship between the HMM parameters and the subject-specific binary responses is specified using a probit regression model and the latent indicator variables using the approaches proposed by George and McCulloch,¹³ Kuo and Mallick,¹⁷ and Brown et al.¹⁸ We expound on each of these below.

Stage 1: Relationship between aCGH Emissions and Latent Copy Number States

For subjects i = 1, n and probes j = 1, p, we have the binary responses y₁, …, y_n representing the two disease subcategories and the set of real-valued aCGH emissions {e_ij}. Let s_ij ∊ {-1, 0, +1} be a latent variable called the copy number state, representing a loss, no change, and gain in copy number for individual i at probe j. The copy number state is inferred using a Bayesian HMM that accounts for the serial correlations of the aCGH emissions.

Similarly to Guha et al.²³ conditional on s_ij, the aCGH emissions are assumed to be normally distributed:

e_{i j} | s_{i j} \overset{indep}{~} N (μ_{s_{i j}}, σ_{s_{i j}}^{2}),

where, because of the specific biological interpretations associated with the HMM states, we assume that μ_-1 < μ₀ < μ₊₁. This assumption also prevents label switching, a well-known problem with mixture models, thereby making inferences even more efficient. The latent states s_i1, …, s_ip are assumed to follow a three-state HMM with stationary transition probability matrix A = ((a_ut))_3x3 having row sums Σ_{t = 1,2,3}a_ut = 1 for u = 1, 2, 3. That is, P[s_i,j+1 = t | s_ij = u] = a_ut for j = 1, …, (n - 1). To further facilitate inferences of the state-specific parameters, informative conjugate priors are assigned to the parameters of the normal distribution ie, μ_s and σ_s for s ∊ {-1, 0, +1}. Refer to Guha et al.²³ for further details about MCMC inference of the underlying copy number states of the probes for the individuals. The technique developed in that paper is applied to infer the latent copy number states (gain/loss/normal) s_i₁, …, s_ip for subjects i = 1, …, n that are subsequently used in the below Stage 2.

Stage 2: Relationship between Disease Classification and Latent Copy Number States

In the second stage of the analysis, we model the relationship between the disease category and latent copy number states of the genomic probes for each individual. These values are copy number states inferred from analysis in Section 2.1.

Let $u_{i j}^{(-)} = I (s_{i j} = - 1)$ and $u_{i j}^{(+)} = I (s_{i j} = + 1)$ be indicator functions of loss and gain. To simplify the notation, for subjects i = 1, … n, we collectively represent the vector of 2p covariates as $w_{i} = (1 -, μ_{i 1}^{(-)}, μ_{i 1}^{+}, …, μ_{i p}^{(-)}, μ_{i p}^{(+)})'$ . For covariate j = 1, …, 2p, averaging over the individuals, let $w ._{j} = \sum_{i = 1}^{n} w_{i j} / n$ . Centering and scaling over the n individuals, we transform the covariates as follows:

v_{i j} = {\begin{matrix} (w_{i j} - w ._{j}) / \sqrt{\sum_{i = 1}^{n} {(w_{i j} - w ._{j})}^{2}} & if \sum_{i = 1}^{n} {(w_{i j} - w ._{j})}^{2} > 0, \\ 0 & otherwise \end{matrix}

Let Q be the set of covariates j for which ${w_{i j}}_{i = 1}^{n}$ assumes at least two distinct values. That is, $Q = {j | \sum_{i = 1}^{n} (w_{i j} - w ._{j})^{2} > 0}$ . Because the variables v_ij are centered, $j \notin Q$ if and only if v_1j = … = v_nj = 0.

A key assumption of our model is that probes that do not belong to Q ie for which ${w_{i j}}_{i = 1}^{n}$ do not assume at least 2 distinct values, are not predictive of disease subcategory, although the probes could possibly be predictive of the disease. For this reason, we identify Q as the set of potential predictors of disease subcategory and write q = |Q| ≤ 2p. We discard all probes $j \notin Q$ , relabeling the variables {v_ij: j ∊ Q} as {x_ij: j = 1, …, q}.

For individuals i = 1, …, n, we assume the probit regression model proposed by Albert and Chib²⁵:

\begin{matrix} y_{i} = {\begin{matrix} 1 & if z_{i} \geq 0 \\ 0 & if z_{i} < 0 \end{matrix} \\ z_{i} = β_{0} + \sum_{j = 1}^{q} β_{j} x_{i j} + \in_{i} \\ \in_{i} \underset{~}{i n d e p} N (0, 1) \end{matrix}

(1)

For the intercept β₀, we assume the prior $N (0, \overset{2}{τ 0})$ . Let γ = (γ₁, …, γ_q)’ be i.i.d. Bernoulli variables with P[γ_j = ω], where ω is expected to be relatively small and is assigned the uniform prior on (0,0.1). The remaining coefficients in (1) are independently distributed as

β_{j} | γ_{j} \underset{~}{i n d e p} {\begin{matrix} δ_{0} & if γ = 0 \\ N (0, τ^{2}) & if γ = 1 \end{matrix}

where δ₀ denotes the point mass at 0. In other words, each probe is predictive of disease classification with probability ω. We assume independent exponential priors with mean 1 for $τ_{0}^{- 2}$ and τ^-2.

Gibbs Sampling Procedure

Let $ρ = 1 + \sum_{j = 1}^{q} γ_{j}$ be the random number of variables (including the intercept β₀) that participate in the disease classification. Let $r_{i j} = z_{i} - \sum_{k \neq j} x_{i k} β_{k}$ for i = 1, …, n. For a set of numbers {θ_ij: i = 1, …, n, j = 1, …, q}, let θ_j represent the vector (θ_1j, …, θ_nj)’ for probe j = 1, …, q.

Although the Gibbs sampler is conceptually straightforward, updating of γ can be computationally intensive for large q. The step is described as follows. For probe j = 1, …, q, let β_-j represent the set of regression coefficients excluding β_j. With I _n denoting the identity matrix of order n and $B_{j} = I_{n} + τ^{2} x_{j} x_{j}^{T}$ , the posterior probability P[γ_j |β_-j, ω, r_j] is proportional to (1 - ω) · N_n(r_j | 0,I_n) when γ_j = 0 and is proportional to ωN_n(r_j | 0,B_j) when γ_j = 1. The density N_n(r_j | 0,I_n) can be quickly computed even in large problems. However, the density N_n(r_j | 0,B_j) involves the inversion and determinant calculation for the non-diagonal matrix B _j. Because it must be iteratively performed for every probe j, it can be computationally expensive or can at least involve large amounts of memory, when q is large. Theorem 7.1 of the Appendix exploits the structure of B _j to drastically simplify the computation. For probe j = 1, …, q, let

\begin{matrix} \begin{matrix} ϕ_{j} = (\frac{1}{\sqrt{1^{2} + τ^{2}}} - 1) x_{j}^{T} r_{j}, \\ h_{j} = ϕ_{j} x_{j} + r_{j}, and \\ L_{j 1} = \sum_{i = 1}^{n} h_{i j}^{2} . \end{matrix} \end{matrix}

(2)

Applying Theorem 7.1, we have det( B _j) = 1 + τ², and N_n(r_j|0,B_j) is proportional to exp $(- {0.5}_{j 1}) / \sqrt{1 + τ^{2}}$ . The calculation is feasible even for large q.

Outline of procedure

Let F·I(c, d) denote the distribution F restricted to the interval (c, d). The Gibbs sampler consists of the following steps: •

Applying Theorem 7.1, the binary indicators for probes j = 1, …, q are updated as follows:

P [γ_{j} | β_{- j}, ω, z \propto {\begin{matrix} (1 - ω) \cdot \exp (- 5 L_{j 0}) & if γ_{j} = 0 \\ ω / \sqrt{1 + τ^{2}} \cdot \exp (- 0.5 L_{j 1}) & if γ_{j} = 1 \end{matrix}

where $L_{j 0} = \sum_{i = 1}^{n} r_{i j}^{2}$ and L _j 1 is as defined in (2).

•

Writing x_i = (1, x_i1, …, x_iq)^T for individuals i = 1, …, n, the subject-specific latent variables z are independently distributed as

z_{i} | β, γ ~ {\begin{matrix} N (x_{i}^{T} β, 1) \cdot (- \infty, 0) & if γ_{i} = 0 \\ N (x_{i}^{T} β, 1) \cdot 1 (0, \infty) & if γ_{i} = 1 \end{matrix}

•

Let β_I be the elements of β corresponding to the intercept and to the set of probes j for which γ_j = 1. Then β_I^c = 0. Vector β_I is jointly updated as

β_{I} | z, γ ~ N_{ρ} ~ (\sum_{I} U_{I}^{T} z, \sum_{I})

where U_I is an n X p matrix with the first column equal to a vector of n 1's and the remaining columns equal to the vectors x_j for which γ_j = 1. The variance matrix $\sum_{I} = {(U_{I}^{T} U_{I} + τ^{- 2} I_{ρ})}^{- 1} .$ .

•

$τ_{0}^{- 2} | β_{0}$ is distributed as gamma $(3 / 2, (1 + β_{0}^{2}) / 2) .$ .

•

τ^-2|β_-0 is distributed as gamma $((1 + ρ) / 2, (1 + \sum_{j = 1}^{q} β_{j}^{2} / 2) .$

•

ω | γ is distributed as beta (ρ, q - ρ + 1) · 1 (0, 0.1).

Test case predictions

Suppose we have the aCGH profiles of n^* additional test case individuals from the same hypothetical disease population. Using the within-variable means and variances of the training sample, we transformed the aCGH profiles to obtain the covariates x_i* = (1,x_i* 1, …, x_i*q)^T for individuals i^* = 1, …, n^* belonging to the test sample. Let D represent the training set data. The posterior probability that individual i^* belongs to disease category 1 is

\begin{array}{l} P [y_{i *} = 1 | D] = \int P [y_{i *} = 1 | β] [β | D] d β \\ = 1 - \int Φ [0 | x_{i *}^{T} β, 1] [β | D] d β . \end{array}

A consistent (in simulation size) estimate of this probability is then

\hat{P} [y_{i *} = 1 | D] = 1 - \sum_{t = 1}^{M} Φ (0 | x_{i *}^{T} β^{(t)}, 1) / M

where β = β^(t) is the value generated at the Mth MCMC iterate. We declare the disease category of the test case individual labeled i^* as

{\hat{y}}_{i *} = {\begin{matrix} 0 & if \hat{P} [y_{i *} = 1 | D] < 0.5 \\ 1 & otherwise . \end{matrix}

(3)

Simulation Study

We generated a training sample consisting of p = 2000 aCGH profiles for n = 100 individuals. The individuals were regarded as random draws from a disease population where 100 x (1 - p^*) = 25% of the individuals had “disease 0” and the remaining 100 x p^* = 75% individuals had “disease 1,” so that p^* = 0.75 represented the prior probability of disease 1 in the population.

Disease 0 was assumed to be characterized by losses (s = -1) from probes 201 to 400 and gains (s = 1) from probes 1401 to 1800. Disease 1 was characterized by losses from probes 301 to 500 and also from probes 1601 to 1800. The remaining probes were assigned a copy number state of 0. For each disease subcategory, we randomly selected 10% of the probes that were associated with the disease and randomly set their copy number states to be copy neutral, gains, or losses with equal probability. Additionally, random noise at the probe level was then added to the profiles by selecting 2% (ie, 4000) of the remaining probes and randomly changing their copy number states. These values constituted the variables s_ij. in Stage 2 of the Section 2 model, and were assumed to be known in the simulation.

As described in Section 2, the variables were then transformed to obtain the covariates w_ij and v_ij for i = 1, …, n and j = 1, …, 2p. The set $Q = {j | \sum_{i = 1}^{n} {(w_{i = 1} - w_{. j})}^{2} > 0}$ was evaluated to identify q = 2571 probes for which the individuals had at least two distinct values. These variables were relabeled as {x_ij:j = 1, …, q}, and the remaining variables were discarded. The model was fit using the Gibbs sampler of Section 3. An initial set of 10,000 samples was run to allow the MCMC chain to forget its starting values. A 1-in-10 subsample of M = 100,000 additional draws was stored for posterior inferences. Figure 1 presents histograms for the marginal posteriors of the intercept β₀, standard deviations τ₀ and τ, and Bernoulli probability ω, which are used in the sequel to make predictions for the disease categories of the test case individuals.

Figure 1

Histogram of selected model parameters for the simulation study.

We evaluated the success of the predictive ability of our approach by drawing 50 independent test samples of n^* = 200 individuals from the same hypothetical disease population and generating their aCGH profiles based on their disease categories. Exactly 50 of these 200 test case individuals had disease 0, and the remaining 150 individuals had disease 1. Using the within-variable means and variances of each training sample, we transformed the aCGH profiles to obtain the covariates x_i_* = (1, x_i_* 1, …, x_i_*q)^T for individuals i^* = 1, …, n^* belonging to the test sample of each of the 50 datasets.

For each dataset, using the stored MCMC sample of size M = 100,000 and as described in Section 3, we computed the posterior probability of disease 1, $\hat{P}$ [y, = 1 | D], for the n^* = 200 individuals. The estimated ŷ_i_* for the n^* = 200 individuals were computed as in (3). These values versus the true disease categories y_i* are summarized in Table 1. The graph reveals the remarkable accuracy of the proposed methodology in detecting disease category. Specifically, for all 50 datasets, the technique resulted in perfect disease prediction with no false classification.

Table 1

For the 200 individuals belonging to the 50 test samples of the simulation study, the estimated disease category versus the true category averaged over the 50 test samples. Perfect classification was obtained for each dataset. As a result, the standard errors shown in parenthesis are all zero.

	ESTIMATED
	ŷ_i* = 0	ŷ_i* = 1
Truth
y_i* = 0	50 (0)	0 (0)
y_i* = 1	0 (0)	150 (0)

Breast Cancer Data Analysis

We analyzed the breast cancer dataset from Andre et al.²⁴ which consists of n = 111 individuals with either disease subcategory ER+ (label “1”) or TN (label “0”). There are 56 TN and 55 ER+ patients. aCGH emissions for these individuals were available on the same set of p = 42,416 probes along with the probes’ locations. Specifically, the chromosome and the distance in megabases (MB) from a telomere are available for every probe.

As described in Section 2.1, we used this information to first infer the latent copy number states e_ij of the probes using a Bayesian HMM, where i = 1, …, 111 and j = 1, …, 42,416. Then, as described in Section 2.2, we obtained the indicator functions, $u_{i j}^{+}$ and $u_{i j}^{-}$ , of gain and loss. These indicator variables were transformed to obtain the covariates w_ij and v_ij for i = 1, …, n and j = 1, 84,832. The set $Q = {j | \sum_{i = 1}^{n} {(w_{i j} - w_{. j})}^{2} > 0}$ was evaluated to identify q = 5,543 covariates having at least two distinct values for the 111 individuals. These variables were relabeled as {x_ij: j = 1, …, 5,543} and retained as potential regressors. The remaining variables were discarded because they were unlikely to be associated with the subcategory classification.

To investigate the reliability of the proposed method of these actual datasets, we performed 50 independent replications of the following steps. (i) We randomly split the data into training and test sets in a 4:1 ratio. (ii) We analyzed the disease subcategories and the q = 5,543 covariates of the 89 training set individuals using the Bayesian probit regression model with likelihood function (1). The model was fit using the Gibbs sampler of Section 3. An initial set of 10,000 samples was run to allow the MCMC chain to overcome its initial values. A 1-in-10 subsample of M = 100,000 additional draws was stored for posterior inferences. (iii) As described in Section 3, we used the q = 5,543 covariates of the 22 test case individuals to predict their disease subcategories. These predictions were compared with the actual disease subcategories of these 22 individuals to compute the classification error rate for the specific training-test case random split. An average of the 50 independent estimates in Step (iii) yielded a simulation-based estimate of the classification error rate for the proposed method. This was estimated to be 22.55% with a standard error of 1.16%.

The significant probes (covariates) that were found to be predictive of disease subtype are plotted in Figures 2–4. We assumed a posterior probability threshold of δ = 0.15 that yielded 500 markers along the entire genome predictive of the disease classification. Figure 2 plots a bar graph of the chromosomal breakdown of these markers. As can be seen, most of the significant markers are located on chromosomes 5, 12, 16, and 17. The corresponding karyograms Figures 3 and 4 show the breakdown on the markers by chromosomal locations for negative (red) and positive (green) associations with the disease states, respectively.

Figure 2

Number of significant markers broken down for each chromosome.

Figure 3

Human karyogram with significant locations. This figure is a karyogram that depicts the significant probes identified using our approach. The red color corresponds to negative regression coefficients.

Figure 4

Human karyogram with significant locations. This figure is a karyogram that depicts the significant probes identified using our approach. The green color corresponds to positive regression coefficients.

Our results are promising based on the locations of selected markers. As noted, most markers are on chromosomes 5, 12, 16, and 17. It has been shown that chromosome 5q deletions are the most frequent aberration in breast tumors from BRCA1 mutation carriers. The deletions in 5q occur at high frequencies on putative tumor suppressor genes such as XRCC4, RAD50, RASA1, APC, and PPP2R2B.²⁶ Chromosome 16q has been a target region for the detection of biomarkers for breast cancer.²⁴ We identified a high concentration of biomarkers in 16q as well. In addition, our flagged biomarkers on chromosome 17 are also convincing, since chromosome 17 is the host for the most famous breast cancer gene BRCA1 as well as ER. Interestingly, little is known about the association of CNVs on chromosome 12 with subgroups of breast cancer. Our findings on chromosome 12 could be potentially new discoveries that might warrant further functional validation.

Conclusions and Discussion

The detection of CNVs in aCGH methods is important for the treatment of many types of cancers, especially in the development of molecular-based personalized cancer therapies. We propose a framework for the prediction of disease types using aCGH profiles. We employ a Bayesian classification model and treat disease states as outcome and aCGH profiles as covariates in order to identify significant regions of the genome associated with disease subclasses. Specifically, we propose a principled two-stage method using the covariates exhibiting serial dependence. Stage 1 makes inferences on the underlying copy number states associated with the aCGH emissions based on HMM formulation. Using Bayesian linear variable selection procedures, Stage 2 detects the model parameters associated with the binary responses, conditional on the parameters of Stage 1.

The selected probes and their effects are parameters that are useful for predicting the disease categories of any additional individuals on the basis of their copy number profiles. A simulation study demonstrates the method's remarkable accuracy in detecting disease category. The methodology is applied to a breast cancer dataset, and we find several markers that are associated with disease subtype using the copy number profiles. Some of these discoveries confirm existing literature, and novel associations could be potential targets for future validation studies.

Our methods are general and could be potentially applied to SNP arrays as well that yield copy number profiles. A nice generalization of the method would be to incorporate genotype information (eg, allelic frequencies) in the models (especially, Stage 1) that could lead to more refined estimation of the latent copy number states. Furthermore, current technologies enable collection of multiplatform data on matched patient samples such as mRNA expression (eg, The Cancer Genome Atlas (TCGA)) that can be leveraged to provide a more detailed understanding of the biological mechanisms involved in cancer development and progression. We leave these tasks for future consideration.

Author Contributions

Conceived and designed the experiments: SG, YJ, VB. Analyzed the data: SG, VB. Wrote the first draft of the manuscript: SG, YJ, VB. Contributed to the writing of the manuscript: SG, YJ, VB. Agree with manuscript results and conclusions: SG, YJ, VB. Jointly developed the structure and arguments for the paper: SG, YJ, VB. Made critical revisions and approved final version: SG, YJ, VB. All authors reviewed and approved of the final manuscript.

Footnotes

Appendix

Theorem 7.1: Let x = (x₁, …, x_n)’ be a vector such that x^Tx = 1. Define the matrices A = xx^T and B = I _n + τ² A . Then the determinant of matrix B is 1 + τ². Given r ∊ Rⁿ, define the vector h = ( h 1 , … , h n ) T = ϕ x + r and scalar ϕ = ( 1 / 1 + τ 2 − 1 ) x T r Let L = ∑ i = 1 n h i 2 . Then the n-variate normal density

Proof. Since A = xx ^T has rank 1 and x^Tx = 1, the eigenvalues of A consist of a single 1 and (n - 1) number of 0's. Furthermore, the eigenvector corresponding to eigenvalue 1 must be x. Let λ_A be the diagonal matrix of the eigenvalues, and P be the matrix of eigenvectors of A . Then A = P Λ_A P ^T.

Since PP ^T = I _n and B = I _n + τ² A, B has the same eigenvectors as A and its eigenvalues are 1 + τ² and (n - 1) number of 1's. The product of these eigenvalues is (4)

det ( B ) = 1 + τ 2 .

Matrix B ^-1/2 has the same eigenvectors as B and its eigenvalues are 1 / 1 + τ 2 and (n - 1) number of 1's. Thus, Λ B − 1 / 2 = ( 1 / 1 + τ 2 − 1 ) Λ A + I n and

B − 1 / 2 = P ( ( 1 1 + τ 2 − 1 ) Λ A + I n ) P T = ( 1 1 + τ 2 − 1 ) P Λ A P T + P P T = ( 1 1 + τ 2 − 1 ) A + I n

Given r ∊ Rⁿ, we have (5)

B − 1 / 2 r = ( 1 1 + τ 2 − 1 ) A r + r .

We obtain the result on substituting (4) and (5) in the n-variate normal density.

References

Pinkel

, Albertson

D.G.

Array comparative genomic hybridization and its applications in cancer.

Nat Genet. 2005; 37(suppl): S11–7.

Chin

, Hahn

W.C.

, Getz

, Meyerson

Making sense of cancer genomic data.

Genes Dev. 2011; 25: 534–55.

Vissers

L.E.

, de Vries

B.B.

, Veltman

J.A.

Genomic microarrays in mental retardation: from copy number variation to gene, from research to diagnosis.

J Med Genet. 2010; 47: 289–97.

Kallioniemi

, Kallioniemi

O.P.

, Sudar

. Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science. 1992; 258: 818–21.

Zhu

, Hastie

Classification of gene microarrays by penalized logistic regression.

Biostatistics. 2004; 5: 427–43.

Willenbrock

, Fridlyand

A comparison study: applying segmentation to array CGH data for downstream analyses.

Bioinformatics. 2005; 21: 4084–91.

Liu

, Ranka

, Kahveci

Classification and feature selection algorithms for multi-class CGH data.

Bioinformatics. 2008; 24: 86–95.

Rapaport

, Barillot

, Vert

J.P.

Classification of arrayCGH data using fused SVM.

Bioinformatics. 2008; 24: i375–82.

Peduzzi

P.N.

, Hardy

R.J.

, Holford

T.R.

A stepwise variable selection procedure for nonlinear regression models.

Biometrics. 1980; 36: 511–6.

10.

Tibshirani

The lasso method for variable selection in the Cox model.

Stat Med. 1997; 16: 385–95.

11.

Fan

, Li

Variable selection for Coxs proportional hazards model and frailty model.

Ann Stat. 2002; 30: 74–99.

12.

Mitchell

T.J.

, Beauchamp

J.J.

Bayesian variable selection in linear regression.

J Am Stat Assoc. 1988; 83: 1023–36.

13.

George

, McCulloch

Variable selection via Gibbs sampling.

J Am Stat Assoc. 1993; 88: 881–9.

14.

Dellaportas

, Forster

J.J.

, Ntzoufras

Bayesian Variable Selection using the Gibbs Sampling. New York: Marcel Dekker, Inc; 1982: 273–86.

15.

Madigan

, Raftery

Model selection and accounting for model uncertainty in graphical models using Occams window.

J Am Stat Assoc. 1994; 89: 1535–46.

16.

Volinsky

, Madigan

, Raftery

A.E.

, Kronmal

R.A.

Bayesian model averaging in proportional hazard models: assessing the risk of stroke.

Appl Stat. 1997; 46: 433–48.

17.

Kuo

, Mallick

Bayesian semiparametric inference for the accelerated failure time model.

Can J Stat. 1997; 25: 457–72.

18.

Brown

P.J.

, Vannucci

, Fearn

Multivariate Bayesian variable selection and prediction.

JR Stat Soc Series B Stat Methodol. 1998; 60: 627–41.

19.

Cai

, Dunson

Bayesian covariance selection in generalized linear mixed models.

Biometrics. 2006; 62: 446–57.

20.

Sha

, Vannucci

, Tadesse

M.G.

. Bayesian variable selection in multinomial probit models to identify molecular signatures of disease stage. Biometrics. 2004; 60: 812–19.

21.

Lee

, Mallick

Bayesian methods for variable selection in survival models with application to DNA microarray data.

Sankhya. 2004; 66: 756–78.

22.

Sha

, Tadesse

M.G.

, Vannucci

Bayesian variable selection for the analysis of microarray data with censored outcome.

Bioinformatics. 2006; 22: 2262–8.

23.

Guha

, Li

, Neuberg

Bayesian Hidden Markov Modeling of Array CGH Data.

J Am Stat Assoc. 2008; 103: 485–97.

24.

Andre

, Job

, Dessen

. Molecular characterization of breast cancer with high-resolution oligonucleotide comparative genomic hybridization array. Clin Cancer Res. 2009; 15: 441–51.

25.

Albert

J.H.

, Chib

. (1993). Bayesian analysis of binary and polychotomous response data. J Am Stat Assoc. 1993; 88: 669–79.

26.

Johannsdottir

, Jonsson

, Johannesdottir

. Chromosome 5 imbalance mapping in breast tumors from BRCA1 and BRCA2 mutation carriers and sporadic breast tumors. Int J Cancer. 2006; 119: 1052–60.

Bayesian Disease Classification Using Copy Number Data

Abstract

Keywords

Introduction

Model

Stage 1: Relationship between aCGH Emissions and Latent Copy Number States

Stage 2: Relationship between Disease Classification and Latent Copy Number States

Gibbs Sampling Procedure

Simulation Study

Breast Cancer Data Analysis

Conclusions and Discussion

Author Contributions

Footnotes

Appendix

References