Abstract
We present a pipeline to perform integrative analysis of mate-pair (MP) and paired-end (PE) genomic DNA sequencing data. Our pipeline detects structural variations (SVs) by taking aligned sequencing read pairs as input and classifying these reads into properly paired and discordantly paired categories based on their orientation and inferred insert sizes. Recurrent SV was identified from the discordant read pairs. Our pipeline takes into account genomic annotation and genome repetitive element information to increase detection specificity. Application of our pipeline to whole-genome MP and PE sequencing data from three multiple myeloma cell lines (KMS11, MM.1S, and RPMI8226) recovered known SVs, such as heterozygous TRAF3 deletion, as well as a novel experimentally validated SPI1 - ZNF287 inter-chromosomal rearrangement in the RPMI8226 cell line.
Introduction
Structural changes to cancer genomes have important effects such as oncogene amplification, tumor suppressor gene disruption, and chimeric fusion gene formation.
1
Structural changes from each of these categories contribute important neoplastic events such as amplification of
Recent array CGH and whole-genome sequencing studies have shown that cancer genomes are often highly rearranged. For example, array CGH copy number profiles often contain tens to hundreds of discrete copy number changes.5–7 Such complexity has been difficult to define using conventional cytogenetics, and many clinical and research laboratories now rely on array CGH as a first-line assay for structural and numerical changes to chromosomes.
However, array CGH only detects copy number changes and no structural information is implicit in this methodology. Nevertheless, cytogeneticists and researchers now face a new challenge: to make clinical sense of a complex array CGH profile. To do this, they must assign each separate copy number imbalance to one of the two categories: pathogenic or benign. Although some copy number changes such as amplification of
When structural information is available in conjunction with copy number data, variants of uncertain significance can often be classified as pathogenic or benign. For example, a 500-kb duplication containing only one gene would likely be classed as uncertain significance so long as the gene had no known role in cancer. If, however, we knew that this 500-kb region had inserted itself into the
The necessary structural information can come from whole-genome paired-end (PE) or mate-pair (MP) sequencing. These next-generation sequencing methodologies provide information about the genes disrupted at chromosome breakpoints. Although many tools are available to detect structural changes and their genetic consequences from whole genome and transcriptome,7–18 all are stand alone tools that are relatively difficult for a non-specialist to integrate into their clinical analysis workflow.
Here, we describe structural variation (SV) finder a fast, lightweight, and easy to use tool that identifies structural rearrangements in cancer genomes and outputs data that can be integrated into downstream analysis or viewed in a genome browser with other type data. We show the utility of this approach using integrated genomic data from three highly rearranged multiple myeloma cell lines.
Results
Whole genome PE and MP sequencing data
From Illumina PE and MP sequencing of three multiple myeloma cell lines (KMS11, MM.1S, and RPMI8226), we obtained around 15x PE and 5x MP sequence-level coverage (Table 1). The MP reads differ from PE reads, by having a larger insert size (approximately 3 kb) and an outward facing (reverse-forward) read pairs orientation due to a circularization procedure used in MP preparation. The average sequencing quality of MP and PE reads are satisfactory (over 30) as shown in Table 1. Therefore, read trimming is not carried out prior to mapping. We reverse-complemented all MP reads and aligned the PE and preprocessed MP reads with the Burrows-Wheeler Aligner (BWA) algorithm. 19 Over 90% of PE and 50% of MP reads were mapped to human reference genome GRCh37 (hg19).
Summary of sequencing data.
SV identification with SVfinder
To detect chromosomal rearrangements, we developed the SVfinder pipeline (Fig. 1). The first step of the algorithm involves classifying mapped read pairs into two groups: concordant and discordant pairs based on the bitwise flag component of the sequence alignment/map (SAM) file. Concordant pairs are defined as read pairs that mapped to the reference genome with the expected orientation and insert size. For PE reads, the SAM file bitwise flag 0x2 indicates that the reads are mapped properly, meaning that the reads are correctly oriented with respect to one another, ie, that one of the MPs maps to the forward strand and the other maps to the reverse strand and both the ends were mapped within a reasonable distance given the expected distance (and standard deviation) that the aligner inferred. In this study, BWA is used for mapping, but the SV detection pipeline can accept SAM format output files from other aligners as well, for example Novoalign (http://www.novocraft.com).

An overview of the SVfinder pipeline. SVfinder takes a SAM file as input and divides the mapped reads into concordant and discordant group. The normal insert size range is estimated from reads in concordant group. The discordant reads are clustered and classified into different types of variation subgroup according to their insert size and read pairs orientation. Identified SVs are annotated and output in BED format, allowing for easy downstream analysis or viewing in a genome browser.
Next, SVfinder estimates the normal insert size range by calculating the mean and standard deviation of insert sizes of concordant reads. In parallel, the discordant pairs are grouped into interconnected clusters, which are hypothesized to originate from the same SV. The clustering procedure is performed by extending each seed pair 1 kb from each end and merging any read pair with overlapping of the extended genomic interval. In the third step, six types of SV are determined, including deletion, insertion, inversion, tandem duplication, inter-chromosomal translocations, and complex events, based on the insert size and read-pair orientation as shown in Figure 1. A novel feature of the SVfinder pipeline is to add the genomic annotation information in the identified SV list at the final step. In this way, possible gene fusions caused by genomic structural variation (SV) are predicted if both sides of the SV breakpoint reside in coding regions. Moreover, SVfinder provides BED format output for identified SVs, which can be easily imported into a genome browser for visualization, such as Integrative Genome Viewer (IGV). 20
Application of SVfinder to whole-genome sequencing data in multiple myeloma
We applied SVfinder to whole-genome PE and MP sequencing data for the three multiple myeloma cell lines. These cell lines have several well-characterized mutations and SVs, allowing us to validate SVfinder. In our identified SVs list, three known variations were recovered from PE sequencing data (Table 2): the t(4;14) translocation that fuses the IGH locus with MMSET identified in KMS11 cells, the t(14;16) translocation involving the IgH and MAF genes, 21 and CDKN2A deletion 22 found in MM.1S. Additionally, the known TRAF3 deletion 23 was found in both MP and PE data in RPMI8226.
Known and novel SVs associated with genes identified by SVfinder.
We next searched for high-confidence, novel SVs – those that were predicted in both MP and PE data. Second, we filtered out possible artifactual SV predictions that resided within repetitive regions. We identified four novel SV candidates, a SPI1-ZNF287 fusion caused by an inter-chromosomal rearrangement (Fig. 2A), and deletions of KDM6A, KCTD8, and ABHD17B (Table 2). All these variations were identified in RPMI8226. We selected the SPI1-ZNF287 rearrangement for further experimental validation. PCR primers designed to span the rearrangement were used to amplify genomic DNA from RPMI8226 cells and the product subjected to Sanger sequencing. The results confirmed that the exon 4 of SPI1 is rearranged with ZNF287 intron region (Fig. 2B).

Discovery of novel SPI1 and ZNF287 t(11;17) inter-chromosomal rearrangement in RPMI8226 multiple myeloma cell line. (A) Mapped discordant read pairs in mate pair (upper panel) and paired end (lower panel) sequencing shown in integrative genome viewer (IGV). (B) Identification of t(11;17) translocation and breakpoint using Sanger sequencing.
Conclusions
We present a tool – SVfinder to detect SV from whole-genome PE or MP sequencing data. Compared with other existing SV detection methods, SVfinder is easy to use and integrates the genomic annotation and repetitive region information to filter false positives as well as predicting gene fusions. By applying our method to multiple myeloma whole-genome sequencing data, we were able to recover known recurrent translocations and deletions in multiple myeloma as well as identify several novel SVs. SVs predicted by SVfinder could be experimentally validated. Our pipeline outputs an SV list in BED format, which provides a convenient gateway for downstream analysis by integrating with other software workflows and allowing direct visualization of the results in any genome browser.
Methods
Sample preparation and data processing
MP and PE libraries were constructed according to standard protocols using Mate Pair Library Prep Kit v2 Genomic DNA sample prep kits. Sequencing for each sample and library was over a single lane of an Illumina HiSeq 2000 instrument. Raw data in Fastq format were aligned to the hg19 human reference genome using the BWA version 0.5.8a. The raw sequencing data have been achieved at the Sequence Read Archive (SRA) with accession number SRP039529.
SV annotation, filtering, and validation
Each SV is annotated based on the positions of breakpoints and their overlap with gene regions. Gene annotation files were downloaded from UCSC genome browser with the track “UCSC genes” from human reference genome GRCh37 (hg19). The genomic zone of SVs included 5’ distal, 5’ proximal, 5’ UTR, coding, 3’ UTR, 3’ proximal, 3’ distal, intergenic. Identified SVs were excluded if they reside in repetitive regions. Genomic repeat information was obtained from UCSC genome browser RepeatMasker track. Of the novel SV candidates, the SPI1 and ZNF297 rearrangement was PCR amplified from genomic DNA isolated from RPMI8226 and the product was subject to Sanger sequencing. The primers used for PCR amplification and sequencing were 5'-CTCGCCCTCCTC-CTCATCTGAGCT (SPI1) and 3'-AAGGCCATGCAT-TCTGTCAT (ZNF287).
Software availability
SVfinder is written in Python, and the source code and manual are available from: https://github.com/cauyrd/SVfinder
Author Contributions
Conceived and designed the experiments: RY, MR, LHB, SL, JK, ZSQ Analyzed the data: RY, LC, SN, KG, GD. Wrote the first draft of the manuscript: RY, SN. Contributed to the writing of the manuscript: PMV, LHB, SL. Agree with manuscript results and conclusions: CSM, LBM. Jointly developed the structure and arguments for the paper: RY, SN. Made critical revisions and approved final version: RY, SN, LBM, SL, LHB, MY, JK, ZSQ. All authors reviewed and approved the final manuscript.
