Abstract

Introduction
Structural variation includes rearrangements (inversions and translocations) and copy number variations (CNVs), consisting of duplications, deletions, insertions, and multi-allelic variations ranging from one kilobase (kb) to several megabases of DNA, leading to possible dosage imbalances. In other words, genes that were originally understood to be present in two copies have now been revealed to sometimes exist in one, three, more than three copies, or even missing altogether. We define a CNV as a DNA segment that is 1 kb or larger and exists at variable copy numbers when referred to a reference genome. 1
So far, among the numerous insights gained from the completion of the Human Genome Project, single nucleotide polymorphisms (SNPs) have been recognized as a major source of genetic variation, which led to the speculation that the majority of phenotypic variations in humans are due to SNPs. 1 Consequently, immense amount of research has focused on developing genotyping assays and methodologies for SNP analysis, making it the most frequently assayed type of intraspecific genetic variation and the most centered gene-mapping studies with regard to the relationship between SNPs and human diseases. However, growing number of studies have confirmed that more nucleotide bases are affected by CNVs as opposed to SNPs between any two individuals, 2 and along with the advances in high-throughput technologies to detect the magnitude and location of genomic alterations within a single human genome, it is clear that large fragments of our genome have been deleted or duplicated. These structural variations can change the genes copy number that encompass the affected regions and alter gene regulation. In particular, one group of scientist conducted a pilot study in mapping the CNVs in the complete human genome, 2 which demonstrated the remarkable extent of large structural variation in the human genome, both within closely related people and between the global populations due to sizeable duplications and deletions of genomic segments. Likewise, similar studies have shown that despite the existence of powerful repair mechanisms in the human genome, CNV occurrence frequency is 100–10,000 times higher than point mutations. 3
Studies have shown that some types of genes are more prone to be copy number variables than others, such as the genes involved in immunological and neurological development, possibly due to the rapid human evolution in these two functions.4,5 On the other hand, genes that play a role in early development and mitosis, which are fundamental to life, are likely to be spared. 5 CNVs have been reported to play a role in cancer susceptibility, formation, and progression.6–8 These CNVs may be germline variants or somatic mutations. A few association studies have already elucidated the significance of CNVs as disease-susceptibility variants.9–12 In one study, CNV detected in 1,400 regions have overlaps with 14.5% of human disease-causal genes. 9 Recent studies concluded that CNV is frequently found in susceptible individuals who were predisposed to diseases such as color blindness in Mendelian disease, as well as autism, Parkinson's, HIV, cancer, and lupus in complex traits diseases.10–16
This intriguingly frequent and dynamic type of genomic variation has challenged the concept of analyzing the genome through the breakdown of a single diploid human reference genome and has also triggered more research on the techniques of detecting CNVs. Array comparative genome hybridization (aCGH) is a popular experimental technique for detecting copy number variants in genomes.17–19 aCGH requires the hybridization of fluorescently tagged differential DNA fragments from both a test genome and a reference genome to a set of probes originated from the reference genome sequence. The proportion of the test versus reference fluorescence intensity at each probe will identify the positions in the test genome that have fewer, more, or similar copy as the reference genome, which can generate a copy number profile of the test genome.17–19 These CNV profiles are generally compared across individuals in a group of interest to identify common CNVs that are shared among a portion of the group.
Another method that is popular for CNV detection and analysis is by using high-throughput array technologies for SNP genotyping from commercial companies such as Affymetrix and Illumina, due to their ability to perform a dual role for both SNP-based and CNV-based association studies. One widely used array is the Affymetrix genome-wide human SNP array 6.0 (Affy6), which is an array platform that aims to perform both high-density SNP genotyping and high-resolution CNV discovery simultaneously. Aside from the 906,600 SNP probe sets, the Affy6 array also contains 946,000 copy number probe sets that can be used to assess chromosomal copy number changes in regions of the genome that are not well covered by SNPs.
A number of CNV calling methods for individual patients have been implemented, which either follows the circular binary segmentation (CBS) method or the hidden Markov model (HMM) method. CBS is a segmentation-based method that scans for change points in an ordered sequence of values to delineate segments with different distributions of values (measured by having different means). In other words, it will recursively divide up the genome until segments that have probe distribution different than neighbors have been identified. 20 For the HMM method, the aim is to uncover the hidden copy number states (0, 1, 2, 3 copies, etc) by searching the data point by point to determine the most probable copy number states based on observation and transitions between states correspond to changes in copy number.21,22
These methods share a common feature in which the CNV regions are segmented by individual-specific breakpoints, and detection is carried out sample by sample. However, it is much more likely for shared/common CNV regions (ie, recurrent CNV) to occur at the same genomic positions across different individuals in a homogeneous group of people. As a result, recurrent CNV regions are more likely to harbor disease-causal genes, as it is more probable to encompass “driver” alterations (functionally significant for disease initiation or progression), while individual-sample CNVs are subject specific and would be more likely to contain “passenger” alterations (random somatic events irrelevant to pathological events) than disease-relevant alterations. 23 Several methodologies have been proposed for recurrent CNV detection.24–29 These methods mainly differ in the type of input data and the algorithm models being implemented. For the input, most of the recurrent CNV detection methods fall under two categories: continuous (log 2 ratio)25,27,28 and discrete (gains/losses).24,26,29 For the algorithms, they can be categorized into three main models: permutation,24,27 probabilistic null model,25,28–30 or none. 26
In this study, we aim to develop a graph-based algorithm to identify recurrent CNV regions. The algorithm will be applied to analyze the breast cancer data retrieved from Molecular Taxonomy of Breast Cancer International Consortium (METABRIC). 31
Materials and Methods
Data Source
We retrieved the individual patient-level CNV data from METABRIC, 31 which consists of whole-gene expression profiles, SNPs, and individual patient-level DNA CNV data. All samples were derived from ~2,000 clinically annotated primary fresh-frozen breast cancer specimens from tumor banks in the UK and Canada, which were divided into two subsets by METABRIC: Discovery (997 samples) and Validation (995 samples).
Experimental Assay and Genotype Calling
DNA was extracted from each tumor specimen and subject to copy number analysis on the Affymetrix Human SNP 6.0 platform. Data from Affymetrix SNP 6.0 arrays were preprocessed and genotyped using the SNP-RMA (Robust Multi-Array Average) algorithm, available in the crlmm Bioconductor R package. This quantifies raw intensity values into proportional amount of DNA in the target sample associated with each of the alleles, A and B, for each SNP. Feature intensities were corrected for fragment length and sequence effects, followed by quantile normalization to a predefined reference distribution. Intensities were then summarized by median polish, with a single value for each allele. A mixture model was then used to adjust for remaining fragment length and intensity-dependent biases on the log ratio of the summarized intensities. Samples with a signal-to-noise ratio <5 were flagged in downstream analyses.
Individual CNV Calling
Affymetrix SNP 6.0 arrays were preprocessed for copy number segmentation using aroma. affymetrix. Both tumor and normal samples were independently normalized using the single-array method referred to as copy number estimation using robust multichip analysis 32 (CRMAv2), along with a publicly available SNP 6.0 data set consisting of 270 HapMap individuals. For each sample, allelic cross-talk calibration, probe sequence effects normalization, probe-level summarization, and polymerase chain reaction fragment length normalization were performed in order to obtain log 2 intensity values for total copy number estimation. Afterward, probes were sorted by their genomic position, replicate probes were summarized by their median value, and missing values (generated by negative intensities in the normalization) were imputed using the loess procedure included in the snapCGH Bioconductor R package.
Two pooled references were generated, one using the median intensities across the HapMap individuals and another for the normals and tumors, using the median intensity values from a set of 473 normals. Next, log 2 ratios were generated for the HapMap samples by subtracting the pooled value from the log 2 intensities. Similarly, log 2 ratios were obtained for the 473 normals using the corresponding pool. For the 997 tumor samples, two data sets were produced: one using the normal pool as the reference for all the tumors and another using the matched normal for each tumor when available, and the normal pool for the remainder. A similar approach was taken for the validation set.
The HapMap and normal data sets were used to estimate the frequency of germline CNVs in the cohort, while the tumor samples were used for estimating somatic CNVs. After computing the log 2 ratios for each probe, samples were segmented using the CBS algorithm implemented in the DNAcopy R Bioconductor package and individual patient-level CNVs were called. For the tumor samples, any segmented mean that fell within a region included in the HapMap+Normals CNV list was labeled as an inherited CNV. In order to remove all possible germline CNVs, the frequencies of somatic CNVs in the tumor samples were obtained after removing the germline CNVs from the normalized pool reference. For the Discovery data set, a total of 13,391 individual patient-level CNV gains and 20,540 individual patient-level CNV losses were detected. For the Validation data set, a total of 13,963 individual patient-level CNV gains and 20,308 individual patient-level CNV losses were detected.
Identification of Recurrent CNVs
Representing CNVs as an Interval Graph.
We denote a CNV segment as
Figure 1A shows the examples of six individual patient-level CNV segments (A, B, C, D, E, F) on the same chromosome. Each of the six CNVs contains chromosomal-specific start (left) and end (right) positions. To identify the common regions of individual patient-level CNVs on the same chromosome, the intersection among the individual patient-level CNVs can be represented as an interval graph, treating each called individual patient-level CNV as a vertex of the graph and connecting two vertices only if the corresponding intervals have an intersecting region. Thus, the constructed interval graph

Representing CNVs as an interval graph. (
Finding Maximal Cliques from an Interval Graph.
A clique is a set of vertices in which any two vertices are connected by an edge in the interval graph. A maximal clique is a clique that cannot be a subclique of a larger clique. In the context of a CNV set
To find maximal cliques in an interval graph constructed from individual patient-level CNVs, we applied the algorithm of Gentleman and Vandal. 33 The main idea of the algorithm is to sort the vertices based on their chromosomal end positions. The ordering is important because it allows the algorithm to discard vertices in each iteration without losing the triangulation property. The input of the algorithm is the individual patient-level CNVs on a specific chromosome, which include two parameters for each CNV segment: start and end positions (base pair). It should be noted that we need to analyze CNV gains and losses separately. The algorithm adapted for processing individual patient-level CNV data is summarized as follows:
Finding the maximal cliques
Sort all the vertices in terms of their chromosomal end positions
Initialize
For each vertex
For each vertex
return
The output of the algorithm will be a list of maximal cliques. We implemented the algorithm using R package Icens,
33
which implemented the algorithm to find maximal cliques of a triangulated graph based on the fact that all interval graphs are triangulated.
34
The method is efficient and the time complexity of the maximal clique detection algorithm is
Analyzing Recurrent CNVs from the Maximal Cliques.
Each of the identified maximal cliques is a recurrent CNV, which is common in multiple patients. The shared region of the recurrent CNV across multiple patients is the minimal common region (MCR) of the CNV, which has the potential to harbor cancer-causing genes. In practice, the size of the maximal cliques should be at least 2 and the size of the MCRs should be at least 1 kb.
Unlike the algorithm of Gentleman and Vandal to identify maximal cliques, Wu et al.
35
also proposed an algorithm to identify maximal cliques for detecting recurrent CNVs. However, this algorithm is based on a scoring scheme where blocks of consecutive maximal cliques were scored, defining a
Results
Figure 2 shows our analysis flowchart using the maximal clique-based recurrent CNV detection. The individual patient-level CNV data in Discovery data set containing 997 patient samples was separated into two CNV types: gain and loss. Filtering criteria include retaining CNV data that were generated by ≥10 probes and having a CNV size of at least 5 kb. Among the total 997 patients, there are 13,391 individual patient-level CNV gain regions and 20,540 individual patient-level CNV loss regions. The recurrent CNV calling algorithm was run separately for the CNV gains and CNV losses, and analysis was done chromosome by chromosome. Further filtering at the recurrent CNV level includes retaining those that have a minimal region of at least 1 kb, and the number of patients per recurrent CNV region to be at least 5. In total, there are 351 recurrent CNV gain regions (99/351 gain regions encompassing protein-encoding genes) and 475 recurrent CNV loss regions (111/475 loss regions encompassing protein-encoding genes).

Flowchart analysis for implicating proposed algorithm.
Validation testing was then performed using the Validation data set, which contains 995 patient samples. All filtering criteria and algorithm implementations followed the same procedure as the Discovery data set analysis. For recurrent CNV gain regions, a total of 252 regions have been validated (found in both the Discovery and Validation data sets), of which 67/252 regions have encompassed 57 unique protein-encoding genes (Supplementary Tables 1 and 2). For recurrent CNV loss regions, a total of 350 regions have been validated, of which 77/350 regions have encompassed 70 unique protein-encoding genes (Supplementary Tables 3 and 4). In total, 144 validated recurrent CNV regions with protein-encoding genes have been identified, along with 458 validated recurrent CNV regions that did not encompass any protein-encoding genes. There is no significant difference of the validated number of CNV gain and loss regions in both Discovery and Validation data sets (

Comparison of the number patients in each recurrent CNV in Discovery set and Validation set. The number of patients in validated recurrent gain CNV regions with genes (
Gene ontology (Go) analysis via ConcensusPathDB.
Gene set pathway overrepresentation analysis was performed separately for the 67 validated recurrent CNV gain regions (Supplementary Table 1) and 77 validated recurrent CNV loss regions (Supplementary Table 3) via Consensus-PathDB with default settings (http://consensuspathdb.org/). Enrichment map analysis was then performed using the software Cytoscape (http://www.cytoscape.org/). For the recurrent CNV gain regions, an enrichment in the glutathione metabolism pathway involving cytochrome 450 is detected (Fig. 4A), with genes GSTM2 and GSTT1 playing a major role in this pathway. For the recurrent CNV loss regions, an enrichment in the metabolism pathway involving starch and sucrose digestion is detected (Fig. 4B), with genes AMY1A and MGAM playing a major role in this pathway.

Pathway enrichment map generated by Cytoscape. (
Gene ontology analysis was performed to determine the functions of the genes encompassed within the recurrent CNV gain (Table 1A) and loss regions (Table 1B). Overall, majority of the genes in both the recurrent CNV gain and loss regions play a role in sensory perception, especially in G-protein coupled receptor (GPCRs) events and chemical stimulus detection. GPCRs constitute a large family of proteins that sense extracellular stimuli and activate intracellular signal transduction and have been shown to be crucial players in tumor growth and metastasis.
36
This finding came from the observation that in order for tumor cells to survive and proliferate, they often seize control of the normal physiological functions of GPCRs, including evasion of the immune system, increase in their blood supply, invasion of surrounding tissues, and metastasis to other organs.
36
Chemical carcinogenesis is another major pathway found involving the role of chemical stimulus detection.
Conclusion
In this study, we first propose how to build an interval graph based on individual patient-level CNVs. A maximal clique-based graph algorithm is then applied to call recurrent CNVs in breast cancer patients. The algorithm has an optimal solution, which means all maximal cliques can be identified. Additionally, it guarantees that the identified CNV regions are the most frequent and that the minimal regions have been delineated. Our algorithm is based on the sample-specific CNVs called from other published CNV calling algorithms. Therefore, the potential uncertainties or errors in our identified recurrent CNVs will depend on the sample-specific CNVs identified by other algorithms. From application to the METABRIC breast cancer data set, we have identified 252 validated recurrent CN gain regions and 350 validated recurrent CN loss regions and have located the corresponding candidate genes that were encompassed in these regions. It should be noted that the algorithm has also been successfully used to identify and validate recurrent CNV-based genomic signatures in circulating tumor cells from breast cancer,38,39 but the algorithm details have not been described in those applications.
It is becoming progressively clear that genetic studies of complex diseases must heed to the involvement of recurrent CNVs. Therefore, investigation into recurrent CNVs could provide significant contributions to the understanding of the basis of genetic variations in biological functions and disease predisposition. At present, CNV studies with regard to cancer are still in its infancy, but it is an area that is growing rapidly due to denser microarrays and next-generation sequencing technologies. As we lean toward personalized structural genomic analysis and diagnostics, the conventional genomic definition of what is “normal” versus “diseased” will start to blur. There is much to take in from previous studies on genomic disorders and by incorporating the knowledge of the vast amount of CNVs present in our genome. It can give new insights into the role of CNVs in cancer predisposition and development and contribute to a more accurate and complete human genome sequence reference.
Author Contributions
Conceived and designed the experiments: PH. Prepared data: QK. Analyzed the data: CC. Wrote the first draft of the manuscript: CC, RA, PH. Contributed to the writing of the manuscript: CC, RA, PH, QK. Agreed with manuscript results and conclusions: All authors. Made critical revisions and approved final version: CC, RA, PH. All authors reviewed and approved the final manuscript.
Footnotes
Acknowledgment
The authors thank the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) for providing the data sets used in this study.
