Abstract
Abstract
Haplotype assembly is to directly construct the haplotypes of an individual from sequence fragments (reads) of the individual. Although a number of programs have been designed for computing optimal or heuristic solutions to the haplotype assembly problem, computing an optimal solution may take days or even months while computing a heuristic solution usually requires a trade-off between speed and accuracy. This article refines a previously known integer linear programming-based (ILP-based) approach to the haplotype assembly problem in twofolds. First, the read-matrices of some datasets (such as NA12878) come with a quality for each base in the reads. We here propose to utilize the qualities in the ILP-based approach. Secondly, we propose to use the ILP-based approach to improve the output of any heuristic program for the problem. Experiments with both real and simulated datasets show that the qualities of read-matrices help us find more accurate solutions without significant loss of speed. Moreover, our experimental results show that the proposed hybrid approach improves the output of ReFHap (the current leading heuristic) significantly (say, by almost 25% of the QAN50 score) without significant loss of speed, and can even find optimal solutions in much shorter time than the original ILP-based approach. Our program is available upon request to the authors.
Get full access to this article
View all access options for this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
