Abstract
An algorithm to reduce multi-sample array CGH data from thousands of clones to tens or hundreds of clone regions is introduced. This reduction of the data is performed such that little information is lost, which is possible due to the high dependencies between neighboring clones. The algorithm is explained using a small example. The potential beneficial effects of the algorithm for downstream analysis are illustrated by re-analysis of previously published colorectal cancer data. Using multiple testing corrections suitable for these data, we provide statistical evidence for genomic differences on several clone regions between MSI+ and CIN+ tumors. The algorithm, named CGHregions, is available as an easy-to-use script in R.
Introduction
Array Comparative Genomic Hybridization (array CGH) is an increasingly popular high-resolution technique to discover at which chromosomal locations the DNA is aberrated. We refer to such a location by the array element covering it, a ‘clone,’ which technically could also be a synthetic oligonucleotide.
An aberration may be a ‘loss’ of at least one chromosomal copy or a ‘gain’ of at least one copy. In some cases, one distinguishes one extra state: ‘amplification’ which is a multiple copy number gain. Non-aberrated clones are referred to as ‘normal.’ Among others, cancer is an important application area of this technique. It has been successfully applied to discover genomic disorders in tumors. Also, due to the stability of the DNA molecule (as opposed to mRNA) it has potential to be used in a practical diagnostic context (Pinkel and Albertson, 2005). Integration of these data with mRNA gene expression data is currently actively studied (van Wieringen et al. 2006a), since these two data types measure the activity of genes at different levels: genome and transcriptome.
As with gene expression, array CGH data requires pre-processing and normalisation. The nature of the data is such that sudden state changes, which we refer to as ‘transitions’, occur along the DNA. Moreover, the measurements, which are log2-ratios with respect to a normal ‘two-copy everywhere’ sample, reflect hidden underlying discrete states (loss, normal, gain). Therefore, before starting with downstream analysis, one usually divides the measurements into ‘segments’ with similar log2-ratios and discretizes the log2-ratios back to these three states. These two processes are termed segmentation and calling. Many have proposed methods to automate this process. We refer to Willenbrock and Fridlyand (2005) and Lai et al. (2005) for comparisons of several methods. After calling, the data, which is ordered along the DNA, may be represented as sequences containing –1, 0, 1, which code for loss, normal and gain.
For further downstream analysis (testing, clustering, classification) the starting point would be a large matrix of, say,
The benefits of such a dimension reduction are several. Obviously, decreasing
Methods
Construction of Regions
The input for our method is a
The regions are constructed by restricting the maximum distance between any two clones in a region by threshold
where
After application of (2) the majority of regions usually satisfies (1). However, some regions may not satisfy (2) because of more gradual changes. In order to satisfy (1) we insert a break before or after the clone for which the ‘gradient’ is largest. For this discrete setting, we define the gradient for clone
where
The gradient rule is iteratively applied until condition (1) is satisfied. The entire algorithm is symmetric, which means that the same regions are formed when starting the break insertion from the left or the right.
Robustification of the Algorithm
The algorithm may pick up so-called mono-regions, which are regions consisting of one clone only. While in principle such a region could be real, it is very likely that the clone is wrongly positioned. This happens regularly due to mis-mapping of the genomic sequence to the position of the clone on the DNA. Therefore, such a clone may appear to be an outlier between its ‘neighbors’. We prefer to have an algorithm which is robust against such mis-mapping and have therefore included the (recommended) option to delete such mono-regions.
When mono-regions are deleted, the algorithm needs an extra step to decide whether adjacent regions need to be concatenated. If (2) is satisfied for the two clones adjacent to the deleted monoregion, a break is inserted and the rest of the two adjacent regions remains unchanged. If not, the two adjacent regions are connected and condition (1) is checked. If it is satisfied, we keep the new region, otherwise the gradient rule is iteratively applied to break the region in pieces until (1) is satisfied.
Representative Signature
Once the regions are defined, we have to choose a representative signature, after which the data for a region will collapse into the boundaries of the region and one signature only. We propose the medoid of all signatures in the region as the representative signature, which is that signature with least average distance to the other signatures. Conventional computation of the medoid would require quadratic time, because for each clone the distances to all the other clones need to be computed. We developed a more efficient computation of this medoid, which we discuss later.
Example
We illustrate the algorithm for a toy example with 10 samples and 18 clones for 2 chromosomes. The data are displayed in Table 1, where –1, 0, 1 indicate loss, normal and gain, respectively.
Signatures and distances (d) to previous clone for 18 clones.
We illustrate the algorithm for
Regions created using
Choice of c
For imperfect data, wrong calls of aberrations may cause nonsense and random transitions in each sample. Even for perfect data one may argue that transitions that occur locally for only a small number of individuals may not be relevant. However, if such transitions are of a more global (across genome) character consistently for a small number of individuals, one may want to pick these up. The automatic choice of threshold
where
where
Using (4) we quantify how much information is maintained when summarizing the clones and their signatures by the region boundaries and their representative signatures. Formulating (4) in terms of proportions rather than absolute numbers avoids sample size bias: the number of regions depends on the number of transitions that are persistent over a reasonably large proportion of the samples. If there exists a small subset of samples (say, for example, 10%) with many transitions located elsewhere than for the 90% majority, a too large
In (4),
The reduction effect of the algorithm depends on the platform (e.g. BAC or oligonucleotide), the heterogeneity of the transition locations among the tumor samples and the value of
Example, Continued
Suppose we accept maximally 1% information loss, so we set
Computational Issues
We increased computational efficiency of CGHregions in two ways. Firstly,
Implementation
CGHregions is implemented using the statistical software environment R (R Development Core Team (2006)). The scripts and example data are available from the author's web site: http://www.few.vu.nl/~mavdwiel/CGHregions.html. The scripts contain instructions for running the algorithm, which requires almost no knowledge of R.
Results: Application to Colorectal Cancer Data
We show the positive effect of using regions rather than individual clones in a two group testing context. Douglas et al. (2004) discuss an array CGH data set containing cell line and colorectal tumor samples, respectively. Here, we focus on partly re-analyzing the tumor data. These consist of two groups: 7 microsatellite instable (MSI+) and 30 chromosomal instable (CIN+) samples. Among many other questions, the authors are interested in genomic differences between the two groups and the chromosomal locations of these differences. When comparing the proportions of aberrations between the MSI+ and CIN+ group, Douglas et al. (2004) (see p. 4820) report mainly about entire chromosomal arms (8p, 17p, 18q) or chromosomes (20), although exceptions are made for one individual clone and a small region on 18q. The authors do not mention statistical measures (such as
The data, available at the supplementary website of Douglas et al. (2004), were first filtered (clones with less than 30% missings were kept), segmented using DNAcopy (Olshen et al. 2004) and then segment-wise aberrations were called using a Normal mixture model. From the resulting clone data set, the clone region data set was produced using the above algorithm with

Visualization of 68 regions created by CGHregions for 37 colorectal tumor samples (Douglas et al. 2004). Y-axis: chromosome, X-axis: base pair position. A new region is displayed by a slight jump with respect to the previous region. The number of regions per chromosome ranges from 1 (several) to 9 (chromosome 8), indicating that the resolution of the results adapts to the heterogeneity of the transition locations. Each region is displayed as a bi-colored segment, the lower and upper part of which correspond to the proportions pl and pg of samples with a loss (red) or gain (green), respectively. The color coding is displayed as well: ‘1’:
For the clone data set, the smallest uncorrected
Regions significantly different for MSI+ and CIN+ colorectal cancers.
A positive side-effect of the reduction of the number of items to be tested is the saving in computing time, because this is approximately linear with the number of items. Here, computing time decreased from 72 sec. to 1.2 sec. using a PC with 1.73 Ghz processor and 1 Gb internal memory; for larger group sizes, computing time may reduce from a couple of hours to minutes. Also for oligonucleotide data, which contain many more clones, the ratio (#clones)/(#regions) is even larger and so is the saving in computing time. The major advantage, however, is the far less conservative FDR-control when reducing the clone data to regions, which results in more detection power and thus more statistically confident findings.
For a one-group approach, CGHregions may be of use to find focal regions that contain (many) more aberrations than the neighboring regions. Figure 2 shows the aberration frequency plots for the regions found using

Region-wise frequency plots for 37 colorectal tumor samples (Douglas et al. 2004). Regions were created using settings
The positive effect of using regions rather than individual clones was also observed for other data. In Smeets et al. (2006), we studied chromosomal aberrations in 24 head-and-neck squamous cell carcinomas (HNSCC), 12 of which were positive for human papillomavirus type 16 (HPV) and 12 were HPV-negative. We used a preliminary version of CGHregions to define regions and performed two-group testing for 12 HPV-positive versus 12 HPV-negative HNSCCs. In that study, at least four significant (FDR ≤ 0.1) regions were determined (see Table 2 in Smeets et al. 2006). Repeating the analysis on 4699 individual clones revealed one clone corresponding to unadjusted
Discussion
A new algorithm was introduced to reduce the dimension of array CGH data while assuring the loss of information is marginal. The high resolution of the experimental platform implies that one wants to be able to find small genomic regions of interest. The ability to do so is not lost with our algorithm: small regions are created in genomic neighborhoods with many transitions at different nearby locations, and large regions are formed in genomic neighborhoods with very few transitions for the vast majority of the samples (see Figure 1). It was shown that using the regions resulting from CGHregions instead of separate clones may lead to statistically rigorous findings, even for relatively small sample sizes.
Currently, many calling methods do not automatically detect amplifications. Some guidelines for automatic detection are provided by Fridlyand et al. (2004). When amplifications are present in some profiles one has to consider the discrete level assigned to these amplifica tions. We simply set this to ‘2,’ hence twice the level of a gain. The algorithm itself remains unchanged for such four-class data.
In this paper, we performed hypothesis testing on the clones and the regions using the aberration calls rather than the continuous (segmented) log2-ratios. The reason for this is simple: interpretation. In the first case, one can conclude that the aberration levels differ when rejecting the null hypothesis, which has a clear interpretation. In the second case, since one has not translated the log2-ratios to copy number levels, one can only conclude that the mean log2-ratios differ between the groups, which lacks a clear interpretation.
We illustrated the potential benefit of the reduction algorithm in the context of statistical testing. More types of downstream analysis could benefit from the reduction. Integration with gene expression data may become easier, for example by focusing on (usually relatively) few highly active DNA regions. CGHregions is also advantageous for cluster analysis of the called array CGH data (van Wieringen et al. 2006b). The use of regions, rather than clones, as input of the cluster method introduces a natural and data-driven weighting in the clustering. Long ‘dull’ chromosomal areas with normal DNA copy number and small amplifications are weighted equally. One thus clusters on the relevant features of the data, without letting dull areas dominate the cluster output. We performed simulations which show that the use of regions improves the performance of clustering methods. Moreover, the number of regions, being only a small percentage of the number of clones, simplifies a manual screening for regions with differences in the copy number frequency tables between clusters.
Construction of regions from individual clone data in a multi-sample context has been addressed before by Diskin et al. (2006) and Rouveirol et al. (2006). However, the main objective of these studies is different from ours. Their main aim is to detect (preferably small) chromosomal regions with relatively many aberrations, after which a list of potentially interesting genes is generated by locating the ones in or near those regions. Rouveirol et al. (2006) mention the potential dimension reduction effect of their method for further downstream analysis, but no control of information loss is provided with their method. While we showed that our method can be useful to find focal regions (Figure 2), this was not formalized here; instead, we emphasize dimension reduction with control of information loss.
In summary, we demonstrated that, with almost no information loss, CGHregions reduces array CGH clone data to region data, which leads to much more powerful downstream data analysis of array CGH data.
Footnotes
Acknowledgement
This work was partly supported by the Center for Medical Systems Biology (CMSB) es tablished by the Netherlands Genomics Initiative/Netherlands Organization for Scientific Research (NGI/NWO).
