Abstract
Background
Next-generation sequencing (NGS) methods pose computational challenges of handling large volumes of data. Although cloud computing offers a potential solution to these challenges, transferring a large data set across the internet is the biggest obstacle, which may be overcome by efficient encoding methods. When encoding is used to facilitate data transfer to the cloud, the time factor is equally as important as the encoding efficiency. Moreover, to take advantage of parallel processing in cloud computing, a parallel technique to decode and split compressed data in the cloud is essential. Hence in this review, we present SOLiDzipper, a new encoding method for NGS data.
Methods
The basic strategy of SOLiDzipper is to divide and encode. NGS data files contain both the sequence and non-sequence information whose encoding efficiencies are different. In SOLiDzipper, encoded data are stored in binary data block that does not contain the characteristic information of a specific sequence platform, which means that data can be decoded according to a desired platform even in cases of Illumina, Solexa or Roche 454 data.
Results
The main calculation time using Crossbow was 173 minutes when 40 EC2 nodes were involved. In that case, an analysis preparation time of 464 minutes is required to encode data in the latest DNA compression method like G-SQZ and transmit it on a 183 Mbit/s bandwidth. However, it takes 194 minutes to encode and transmit data with SOLiDzipper under the same bandwidth conditions. These results indicate that the entire processing time can be reduced according to the encoding methods used, under the same network bandwidth conditions. Considering the limited network bandwidth, high-speed, high-efficiency encoding methods such as SOLiDzipper can make a significant contribution to higher productivity in labs seeking to take advantage of the cloud as an alternative to local computing.
Availability
http://szipper.dinfree.com. Academic/non-profit: Binary available for direct download at no cost. For-profit: Submit request for for-profit license from the web-site.
Introduction
Next-generation sequencing (NGS) methods, which are revolutionizing genomics research by reducing sequencing cost and increasing its efficiency, 1 pose various computational challenges of handling large volumes of short read data. For example, human genome re-sequencing at ∼30X sequencing depth requires a level of computational power achievable only via large-scale parallelization. 2
One potential solution to these computational challenges is the use of cloud computing.
Langmead and colleagues 3 genotyped data comprising 38-fold coverage of the human genome in ∼4 h on the Amazon cloud (Amazon EC2) using the Crossbow genotyping program. In a recent study, Parul Kudtarkar and colleagues 4 computed orthologous relationships for 245,323 genome-to-genome comparisons on the Amazon cloud using the genomic tool, Roundup, at a lesser cost. Applied Biosystems provides a cloud computing service as an alternative to maintaining an in-house computing infrastructure for NGS data analysis (ie, SAMtools 5 ) to the SOLiD system users (ABI SOLiD system). Despite the promises and potential of cloud computing, the biggest obstacle to moving to the cloud may be network bandwidth, since it may take at least a week to transfer a 100 gigabyte NGS data file across the internet in a typical research environment. 6
The more dramatically the advantages of NGS data sequencing or analysis in a cloud environment are revealed, the more apparent will the limitations of access to a cloud environment become. For example, we are currently working on the Amazon cloud wherein we can control the number of nodes needed for the analysis at our chosen time and predict the cost required for the analysis operation. However, we cannot say that our chosen transmission time will guarantee optimal bandwidth when transmitting large volumes of NGS data to a cloud environment. In addition, the usable bandwidth in a lab is a limited resource. Thus, we can decrease the rate of offsetting time benefit, one of the many benefits during an experiment on a cloud, by applying a proper encoding method and using a transmission bandwidth efficiently to transmit NGS data. Furthermore, the important points that we should take into consideration are the encoding/decoding time and the possibility of parallel/selective decompression as well as the compression rates when adopting an encoding method aimed at cloud transmission unlike the traditional DNA compression method aimed at efficient storage.
Efficient encoding methods may enable to overcome the problems of transferring such a large dataset. Recently, Tembe and colleagues 7 showed that NGS data can be reduced by 70%—80% in size by using their algorithm. However, when a large dataset is encoded for transferring, the time required for encoding and decoding is equally as important as the encoding efficiency. Accordingly, an ideal compression algorithm that can be used in combination with cloud computing for sequence data analysis needs to have the following features: 1) high encoding/decoding rate; 2) high encoding efficiency; 3) a parallel technique to decode and split compressed data in the cloud.
Here, we present SOLiDzipper, a new encoding method, by which we can encode NGS data with high speed and high efficiency. SOLiDZipper is best optimized to encode csfasta and QV files from ABI SOLiD system.
Methods
The basic strategy of SOLiDzipper is to divide and encode. NGS data files contain both the sequence and non-sequence information whose encoding efficiencies are different. In SOLiDzipper, the non-sequence information including the sequence IDs and number in plain text format is encoded by a general purpose compression algorithm (ie, gzip, bzip2, lzma(LZMA SDK)), whereas the sequence information consisting of ‘0123’ in csfasta format, which has random patterns and thus a low encoding efficiency, is encoded by bitwise and shift operations. Figures 2 and 3 summarize the encoding process and the method of SOLiDzipper, respectively.
Decoding of SOLiDzipper is basically a reverse of encoding, except for non-calls. In SOLiDzipper, non-calls (‘.’ in csfasta files) are converted into temporary binary data when encoded. When decoded, QV values are used in order to recover temporary binary data to previous non-calls.
Unlike other general encoding methods, SOLiDzipper does not use compression dictionary scheme or statistical pattern matching (ie, palindromes, string comparisons, repeat detection, data permutation),7–11 thereby minimizing computing resource requirements and dictionary exploring time. For example, G-SQZ 7 utilizes Huffman coding 12 method, which generates a Huffman tree in the process of highly efficient encoding. And the DNA Compress 10 program shows fast and effective encoding using detection of repeats.
Such a dictionary exploring time or statistical pattern matching time used in the computational method can require a significant amount of time for the encoding process that should not be ignored when processing huge volumes of NGS data. Thus, it will be more effective to complete encoding as fast as possible even by lowering the encoding rate a little when transferring data to cloud computing for high-performance sequence analysis. SOLiDzipper performs high-speed, high-efficiency encoding on the bitwise level by taking advantage of the characteristic features of NGS data.
In SOLiDzipper, encoded data are stored in binary data blocks that do not contain the characteristic information of a specific sequence platform, which means that data can be decoded according to a desired platform even in cases of Illumina, Solexa or Roche 454 data.
Implementation
SOLiDzipper is implemented in Java 1.6 command-line mode at 64 bit Linux machine (Linux: 2.6.29.4–167.fc11.x86_64 Fedora 11 64 bit, Intel(R) Core (TM) 2 Duo CPU E8400 3.00GHz, 4 GByte memory).
Table 1 shows the comparison experiments for zipping using high speed zipping option (—fast) of general purpose compression tool gzip (version 1.3.12), highest zipping efficiency option (mx = 9) of LZMA (version 4.65) (LZMA SDK) and G-SQZ (version 0.6). 7 133 gigabytes of mate-paired data from the ABI SOLiD 3.5 system were used as the test data set.
Comparison of encoding efficiencies and time between the different encoding methods (decoding unit count = 1).
Results and Discussion
Encoding efficiency is usually regarded as the most important criterion to determine the performance of encoding algorithms, especially when it is used to reduce the long-term storage cost. However, when encoding is used in combination with cloud computing, NGS data need to be encoded in the local servers and then decoded in the cloud as quickly as possible (ie, in this case, encoding is used to facilitate transfer, not for long-term storage).
When cloud computing is used for NGS data analysis, ready to job time (Rt) represents the sum of time required for compression in the local servers, transferring the compressed data to the cloud and decompression in the cloud. Rt increases in proportion with the increase in time required for compression and decompression, thereby offsetting the advantages of using cloud computing for higher efficiency.
The Ready to job time (Rt) of cloud computing for NGS data analysis can be calculated using equation below (1).
When time factor is considered, the advantages of using encoding methods are offset when the data transfer speed exceeds a certain threshold (Fig. 1). However, within the current limitations of data transfer, SOLiDzipper is more time-efficient than both gzip (low compression rate and high operation speed) and G-SQZ (high compression rate and low operation speed).

Rt changes according to the data transfer speed and encoding methods.

Encoding process of SOLiDzipper.

Encoding methods of SOLiDzipper.
In addition, SOLiDzipper does not use compression scheme, generating data blocks of the same length. Since there is no link between the compressed data blocks, encoded data can be distributed for parallel decoding, thereby drastically enhancing the decoding rate in the cloud.
The contributory point of SOLiDzipper to bio-informatics was to address the issue of high-speed transmission infrastructure that could not expand easily at a lower cost, which can be achieved with a DNA analysis environment in a cloud environment like Amazon EC2.
The objective of SOLiDzipper is to minimize the percentage of preparation time in the entire DNA analysis process time so that the analysis environment can smoothly move to the cloud, rather than to merely increase the encoding speed or compression efficiency.
We divided the entire processing time into two parts; the first part is the preparation time for analysis and represents the time required to compress DNA data produced on a sequence platform, transmit them over a network, and decode them on a cloud; and the other is the main computation time that represents the time required to carry out computation analysis on a cloud in parallel.
Table 2 presents the calculations of the required time until the final analysis results considering the communication bandwidth and data compression method based on the whole-genome computation time of Crossbow.
Comparison of the total processing time in the cloud-based NGS dataset computation.
According to the main computation time of Crossbow, 10 workers took less than 7 hours to compute the whole genome and 40 workers achieved the same within 3 hours in the Amazon cloud environment. In addition, it took more than an hour to transmit (183 Megabit/second transfer speed) the compressed data set (103 GigaByte) to Amazon s3. In a situation where the transmission bandwidth is limited the data compression time should be considered, which raises an important issue since more time is required to prepare the analysis than the actual analysis time in a cloud environment. For example, a data set of about 300 GB compressed using G-SQZ with high compression rate, is transmitted through 45 Megabits bandwidth, and is decompressed in parallel at 40 nodes to conduct the Crossbow analysis. In such a case, it takes roughly three times to prepare the operation than the actual computation time in the cloud. Thus, compression time should be considered important in addition to compression efficiency when considering transmission to a cloud environment.
These findings indicate that the entire processing time can be reduced according to the encoding methods used, if the same communication bandwidth is adopted. Considering the limited network bandwidth, high-speed, high-efficiency encoding methods such as SOLiDzipper can make a significant contribution to higher productivity in labs seeking to take advantage of the cloud as an alternative to the local computing cluster.
Conclusions
The unique features of SOLiDzipper are: 1) It divides information in csfasta files for high encoding efficiency and speed; 2) It combines two different compression methods (ie, bitwise/shift and general purpose compression), allowing for the optimal preparation time (Rt) for Cloud Computing; 3) Data can be decoded selectively without unzipping the whole encoded file because encoded data are stored as data blocks of the fixed size; 4) In the cloud, encoded data can be distributed for parallel decoding; 5) It requires minimal computing resources.
In summary, SOLiDzipper is a fast encoding method that can efficiently encode and decode NGS data. This method can be especially more suited to typical research environments where the data transfer speed across the internet is limited.
Disclosure
This manuscript has been read and approved by all authors. This paper is unique and is not under consideration by any other publication and has not been published elsewhere. The authors and peer reviewers of this paper report no conflicts of interest. The authors confirm that they have permission to reproduce any copyrighted material.
