Abstract
Somatic alterations in DNA copy number have been well studied in numerous malignancies, yet the role of germline DNA copy number variation in cancer is still emerging. Genotyping microarrays generate allele-specific signal intensities to determine genotype, but may also be used to infer DNA copy number using additional computational approaches. Numerous tools have been developed to analyze Illumina genotype microarray data for copy number variant (CNV) discovery, although commonly utilized algorithms freely available to the public employ approaches based upon the use of hidden Markov models (HMMs). QuantiSNP, PennCNV, and GenoCN utilize HMMs with six copy number states but vary in how transition and emission probabilities are calculated. Performance of these CNV detection algorithms has been shown to be variable between both genotyping platforms and data sets, although HMM approaches generally outperform other current methods. Low sensitivity is prevalent with HMM-based algorithms, suggesting the need for continued improvement in CNV detection methodologies.
Introduction
Somatic alterations in DNA copy number (CNAs, also commonly referred to as somatic copy number variants), ranging in size from kilobases to entire chromosomes, are a hallmark of cancers, and identification of recurrent gains and deletions is vital to improving the biological understanding of these complex diseases. Examples of recurrent CNAs include deletion of the
Comparative genomic hybridization DNA microarrays (aCGH) based upon bacterial artificial chromosome (BAC) library sequences or synthesized oligonucleotides were developed specifically for genome-wide interrogation of DNA copy number, and currently have a resolution in the kilobase range.24,25 Statistical issues associated with aCGH, including removal of technical variation and biological interpretation of array measurements, have led to the development of numerous algorithms for normalization, chromosome segmentation, and copy number calling.26–30 More recently, genotyping arrays that were initially designed for genome-wide evaluation of single nucleotide polymorphisms (SNPs) have been adapted, through the use of additional processing of probe intensity data, to identify DNA copy number. In addition to the ability to capture both SNPs and copy number changes, genotype arrays can also provide information on copy number neutral loss of heterozygosity (LOH). 31 Although dependent on the platform used, genotype arrays typically have a higher overall resolution compared to comparative genomic hybridization (CGH) arrays, but variability in probe density in genotyping arrays may lead to areas of higher and lower coverage within the genome.31,32 As with aCGH, computational approaches to handle raw data in the form of allele-specific probe signal intensity derived from the hybridization of a single DNA sample have been developed to process and interpret DNA copy number in genotype arrays. While DNA copy number calling is possible on both Illumina and Affymetrix genotyping platforms, this review provides a brief summary of algorithms freely available to academic users for Illumina genotyping array data, focusing on commonly utilized hidden Markov model (HMM)-based approaches for CNV detection.
Hmm-Based Copy Number Detection and Algorithms for Illumina Genotype Data
HMMs are full probabilistic models that function to determine an unknown sequence of states based upon a sequence of observations. Markov models model stochastic processes in which known sequences are produced from a finite number of discrete states, where each new state of a sequence is only dependent upon the previous state. In the Markov model, a change from any one state to another is described by a matrix of transition probabilities. In contrast to the basic Markov model, the sequence of states is hidden in the HMM and can only be inferred through a sequence of observed random variables. In addition to a state transition probability distribution used in the Markov model, each state in an HMM has an emission probability distribution modeling the observed variable as a function of a particular hidden state. To identify the hidden sequence with the highest likelihood based upon the model, HMMs first optimize model parameters (including the emission and transition probability distributions) to best describe the observed sequence of variables. With the model parameters optimized, HMMs can then determine the most probable sequence of hidden states using a dynamic programming approach.
CNV detection from Illumina genotyping data involves utilizing observed signal intensity data from the microarray to determine the hidden copy number at each locus surveyed in the genome. The Illumina GenomeStudio software uses scanned genotype microarray files and platform information files to output genotype calls and other metrics, including those necessary for most copy number calling methodologies: the log
Three HMM-based approaches to CNV detection using LRR and BAF data from Illumina genotyping microarrays are QuantiSNP, 34 PennCNV, 35 and GenoCN. 36 The statistical models underlying these software packages share similarities and differences with regard to the core elements of HMMs, including the number of states, emission probabilities of LRRs and BAFs, state transition probabilities, parameter estimation and optimization, and selection of the optimal sequence of hidden copy number states that can be used for biological interpretation.
Hidden copy number states
All three tools employ a total of six discrete copy number states. Two states exist for deletion: full deletion (a null genotype and a copy number of 0) and single deletion (a genotype of allele A or B and a copy number of 1). The normal states have a genotype composition of heterozygotes (genotype AB) or homozygotes (genotype AA or BB) and a copy number of 2. While QuantiSNP has two separate normal states for heterozygotes and homozygotes, PennCNV and GenoCN combine normal heterozygotes and homozygotes into a single state and have an additional normal state representing copy number neutral LOH. Lastly, two states exist for gains, single copy gains (genotype AAA, AAB, ABB, or BBB and a copy number of 3), and double copy gains (genotype AAAA, AAAB, AABB, ABBB, or BBB and a copy number of 4).
Transition probabilities
In QuantiSNP, an exponential function involving the distance between two adjacent SNPs and a length inferred from the data is used to determine an a priori probability that copy number state change occurs between two adjacent loci. In the transition matrix, any state change has equal probability, whereas the probability of remaining in a copy number state differs between non-normal copy number regions, normal heterozygous regions, and normal homozygous regions. PennCNV determines transition probabilities by incorporating parameters estimated by the Baum–Welch algorithm 37 into an exponential function that uses the distance between two adjacent SNP loci and a constant distance, set to either 100 Mb for the copy number neutral LOH state or 100 kb for all other states. The transition matrix uses a common calculation for all state changes and another common calculation for remaining in a state. Transition probabilities in GenoCN, a continuous-time HMM, are determined using an intensity matrix to model the instantaneous transition rate as well as using the distance between two adjacent SNPs. These parameters are used in two exponential functions of the transition matrix, one to calculate copy number state change probabilities and another to calculate the probability of remaining in a copy number state.
Emission probabilities
Emission probabilities for the LRR are calculated in a similar fashion for QuantiSNP, PennCNV, and GenoCN, with the model using a mixture of a uniform distribution to model technical noise in the microarray and a normal distribution to model the LRR observations from a given state. BAF emission probabilities in QuantiSNP are also modeled using a mixture of uniform and normal distributions, but also include half-normal distributions for homozygous genotypes. In PennCNV, BAF emission probabilities are modeled as either a mixture of uniform and normal distributions (BAF between 0 and 1) or a mixture of point mass and truncated normal distributions (BAF is 0 or 1). GenoCN models BAF emission probabilities using a uniform distribution component and several truncated normal distribution components.
Parameter optimization
In QuantiSNP, an objective Bayes-based HMM, normal-gamma conjugate priors are used for emission model parameters, and hyperparameters are estimated from a reference data set using maximum marginal likelihood techniques. The model hyperparameters are then optimized using an expectation maximization algorithm. 38 In contrast, PennCNV estimates initial model parameters empirically using data from several large CNV regions in a large set of training samples. The Baum–Welch algorithm is then used to optimize the parameters by training the model to maximize the likelihood of LRR and BAF observations. Initial parameters of the model are estimated from the LRR and BAF data in GenoCN, and parameter optimization is performed using the Baum–Welch algorithm.
Optimal hidden state selection
Both QuantiSNP and PennCNV utilize the Viterbi algorithm to infer the most likely sequence of copy number states once all parameters within the model have been optimized. In contrast, GenoCN uses the optimized parameters to estimate the posterior probabilities for each SNP of a particular copy number state.
Performance Comparison and Conclusions
Performance reviews of CNV detection using Illumina genotyping data have focused predominantly on HMM-based approaches and commercially available software packages using different statistical models.39–42 When applied to both real and simulated data, the number of CNVs detected, as well as the identification of the genomic boundaries of the CNVs, varies between algorithms.39,40,43 Moreover, the ability to accurately detect CNVs may be dependent on microarray platform, copy number of the CNV, CNV size, and the number of array SNPs within a CNV region.40,43 HMMs have shown a relatively low false positive rate but high false negative rate when calling CNVs, although individual algorithm performance can vary between data sets.40,42,43 Other commercial CNV calling software packages that employ differing computational approaches, including cnvPartition within Illumina GenomeStudio and Nexus Copy Number (BioDiscovery), have typically demonstrated decreased specificity and sensitivity relative to HMMs in comparative studies.40,42,43
To provide a performance evaluation between the three HMM-based CNV detection algorithms, data for three HapMap individuals of European ancestry (NA06985, NA06991, and NA06993) genotyped on the Illumina Human610-Quad BeadChip v1.0 were obtained from the NCBI GEO database 44 (GEO accession: GSE17205). CNV detection for these samples was performed using the default settings in Quanti SNP, PennCNV, and GenoCN. Total CNV regions were identified, along with their associated copy number state, and region overlap between methods was defined as regions having at least one shared probe and copy numbers of the same type (gain or deletion). QuantiSNP and GenoCN identify more CNV regions in these samples when compared to PennCNV (Table 1). Moreover, the majority of CNV regions identified by PennCNV overlap with both QuantiSNP and GenoCN, and only a limited amount of unique CNV regions are identified by PennCNV. In contrast, the majority of CNV regions identified by QuantiSNP and GenoCN in the HapMap samples are unique to the algorithm. The distribution of CNV size for each sample, in terms of kilobases and also with regard to the number of probes on the genotyping array, was compared between detection methods (Fig. 1). For all three samples, PennCNV identified regions with the largest size (both in length and number of probes), whereas the CNV regions identified by GenoCN were predominantly the smallest (both in length and number of probes). Interestingly, instances were observed in which QuantiSNP and GenoCN identified multiple CNV regions within a single CNV region identified by another algorithm.
CNVs detected in three HapMap samples using QuantiSNP, PennCNV, and GenoCN. The HMM-based CNV detection tools were applied to Illumina Human610-Quad BeadChip v1.0 data from three HapMap samples of European ancestry (NA06985, NA06991, and NA06993). For each sample, the total number of CNVs detected using each algorithm is listed along with the percentage of regions unique to each algorithm and the percentage of regions that overlap results from the other CNV detection algorithms.

Size of CNVs detected in the HapMap samples using QuantiSNP, PennCNV, and GenoCN. The HMM-based CNV detection tools were applied to Illumina Human610-Quad BeadChip v1.0 data from three HapMap samples of European ancestry (NA06985, NA06991, and NA06993). For each sample, boxplots were generated for CNV sizes from each HMM-based detection method. Boxplots on the left are CNV sizes measured in genomic length (base pairs), and boxplots on the right are CNV sizes measured by the number of genotype microarray probes in the detected region.
Performance of these three CNV detection approaches was further assessed by comparison with previously identified CNVs within these samples. Conrad and others 45 examined 40 HapMap individuals of European and African ancestry using an oligo-based CGH array with a median probe spacing of 53 bp to identify a catalog of putative CNV regions within the sample set. This catalog was subsequently used to generate a CGH-based CNV-typing array that was applied to a set of 450 HapMap individuals of varying ancestry. PCR validation of a subset of CNVs on the CNV-typing CGH array suggested that the false discovery rate in these data was ~15%; therefore, these data were selected as the gold standard for validation of the HMM-based CNV detection in the three aforementioned HapMap samples (NA06985, NA06991, and NA06993). A set of 2419 CNVs from the CNV-typing array contained at least one probe from the Illumina Human610-Quad BeadChip v1.0 and was used for comparison. To identify true positive and false positive CNVs, CNV calls from the HMM-based approaches were considered validated if at least one probe overlapped with the regions identified by Conrad et al and the CNV regions had a similar copy number type (gain or deletion). In addition, CNV regions from Conrad et al that had normal copy number were examined in the same manner to identify true negatives and false negatives. Results from these comparisons show that all three HMM-based detection approaches have a high specificity (≫98%) when examining CNV overlap with the Conrad et al regions (Table 2). In contrast, the sensitivity of these tools was low (≪15%), with PennCNV demonstrating the worst performance for all samples and GenoCN demonstrating the best performance in two of the three HapMap samples (Table 2). PennCNV had the lowest false discovery rate in two of the three HapMap samples and QuantiSNP the largest false discovery rate, although the converse was true in the third sample. To examine how performance changed relative to genotype microarray probe coverage in CNV regions, the list of gold standard CNVs was filtered to remove CNV regions that encompassed less than a minimum number of probes on genotype microarray. This filtering was performed in a stepwise fashion from a minimum of two probes to five probes. These data showed that as minimum probe coverage within CNV regions was increased, sensitivity also increased because of a decrease in false negatives, although specificity decreased because of a decrease in true negatives (Table 2).
Performance of the Hmm-based CNV detection tools in three HapMap samples. The copy number status of 2419 CNV regions, containing at least one probe on the Illumina Human610-Quad BeadChip v1.0 genotype array, was determined for three HapMap samples (NA06985, NA06991, NA06993) using results from QuantisNP, PennCNV, and GenoCN. Comparison of the results from the HMM-based approaches to gold standard copy number data from the Conrad et al study allowed for the calculation of sensitivity, specificity, and the false discovery rate for each HMM-based method. These metrics were recalculated as the list of gold standard CNVs was filtered to remove CNV regions that encompassed less than a minimum number of probes on genotype microarray (stepwise from a minimum of two probes to five probes).
Although all three of the described CNV detection packages for Illumina genotyping microarrays are based upon HMMs, outcomes of performance comparisons presented here and found in other studies highlight how modification of core HMM elements can impact the results of these models. Specifically, variations in the computation of LRR and BAF emission probabilities, transition probabilities, parameter optimization, and hidden copy number state identification impact both the number and size of CNVs detected from genotyping microarray data. Yet these differences in core HMM elements do not lead to one methodology globally outperforming the other methodologies. Instead, variations of the HMMs produce improvements in certain aspects of performance. PennCNV generally has a lower false discovery rate when compared to the other approaches, yet it also has a lower sensitivity and the regions identified predominantly overlap with regions detected using the other two methodologies. QuantiSNP and GenoCN appear to have higher sensitivity and higher false discovery rates than PennCNV, likely because of the increased number of CNV regions detected by these methodologies. Owing to the low sensitivity observed here for all three HMM-based approaches, these tools must be used with the understanding that they will not provide a comprehensive catalog of CNVs within a genome. Furthermore, the observed false discovery rates and sensitivities emphasize the need for additional experimental validation of CNVs identified from Illumina genotyping microarrays when conducting genome-wide association studies of disease risk and outcome. Given the mixed nature of the performance of the currently available HMM-based CNV detection algorithms for Illumina genotyping data, the need exists for continued development of approaches to model the core HMM elements so that sensitivity can be increased and the false discovery rate decreased. Alternatively, utilization of CNV calls from multiple HMM-based algorithms may provide a means to obtain an optimal balance between sensitivity and false discovery rate using currently available HMM-based CNV detection tools.
Author Contributions
Conceived and designed the experiments: ELS. Analyzed the data: ELS. Wrote the first draft of the manuscript: ELS. Contributed to the writing of the manuscript: FI. Agree with manuscript results and conclusions: ELS, FI. Jointly developed the structure and arguments for the paper: ELS. Made critical revisions and approved final version: ELS. Both authors reviewed and approved of the final manuscript.
