Abstract
The Minimum Error Correction (MEC) is an important model for haplotype reconstruction from SNP fragments. However, this model is effective only when the error rate of SNP fragments is low. In this paper, we propose a new computational model called Minimum Conflict Individual Haplotyping (MCIH) as an extension to MEC. In contrast to the conventional approaches, the new model employs SNP fragment information and also related genotype information, thereby a high accurate inference can be expected. We first prove the MCIH problem to be NP-hard. To evaluate the practicality of the new model we design an exact algorithm (a dynamic programming procedure) to implement MCIH on a special data structure. The numerical experience indicates that it is fairly effective to use MCIH at the cost of related genotype information, especially in the case of SNP fragments with a high error rate. Moreover, we present a feed-forward neural network algorithm to solve MCIH for general data structure and large size instances. Numerical results on real biological data and simulation data show that the algorithm works well and MCIH is a potential alternative in individual haplotyping.
Keywords
Introduction
The availability of complete genome sequence for human beings [Vent et al. 2001] makes it possible to investigate genetic differences and to associate genetic variations with complex diseases [Hoehe et al. 2000]. Single nucleotide polymorphism (SNP)—a single DNA base varying from one individual to another, is believed to be the most frequent form to address genetic differences [Chakravarti, 1998; Li et al. 2005]. SNPs are found approximately every 1000 base pairs in the human genome and turn to be promising tools for doing disease association study. Many research works have been carried out for determining SNP sites or designing a detailed SNP map for human genome [Altshuler et al. 2000; Helmuth, 2001].
The nucleotides in a SNP position are called alleles. Almost all SNPs have two different alleles which we denote the wild type as 1 and the mutant type as −1. The SNP sequence information on each copy of a pair of chromosomes in a diploid genome is called a haplotype which is a string over {–1, 1}. A genotype is the conflated information of a pair of haplotypes on homologous chromosomes. For a genotype, if a pair of alleles at a SNP site is made of two identical values, this SNP site is called homozygous, otherwise it is called heterozygous.
Haplotypes generally have more information content than that individual SNPs have in disease as sociation studies [Stephens et al. 2001], but it is substantially more difficult to determine haplotypes than to determine genotypes or individual SNPs through experiments. Hence, computational methods that can reduce the cost of determining haplotypes become attractive alternatives. There are generally two classes of computational methods for determining haplotype. One class concerns with infering haplotypes from the genotype samples in a population. There are several models based on different assumptions on the biological system under consideration [Clark, 1990; Gusfield, 2002; Halperin and Eskin, 2004; Li et al. 2005; Wang and Xu, 2003]. The second class, called single individual haplotyping or haplotype assembly, is based on the data and methodology of shotgun sequence assembly [Lancia et al. 2001; Lippert et al. 2002]. The input data consists of aligned short genome fragments with SNPs coming from DNA shotgun sequencing or data generated by a resequencing effort for the purpose of large-scale haplotyping. When we focus on SNP positions, these short genome fragments are actually the aligned SNP fragments.
Computationally, the individual haplotyping problem is to determine the “best” pair of haplotypes from data (SNP fragments) which is possibly inconsistent and contradictory. It focuses on partitioning SNP fragments into two sets according to SNP states, with each set determining a haplotype [Lancia et al. 2001; Lippert et al. 2002]. For such a problem, there are several models based on different error assumptions [Lancia et al. 2001; Li, L.M. et al. 2004; Lippert et al. 2002; Rizzi et al. 2002; Wang et al. 2005], of which the Minimum Error Correction model (MEC) is widely adopted. MEC assumes that the inconsistence of the data comes from realistic sequence errors and these errors can be corrected. However, MEC is only effective in the case that SNP fragments have a low error rate. When the error rate of SNP fragments is relatively high, we can not reconstruct original haplotypes with a high accuracy by error correction only [Wang et al. 2005]. To improve the haplotyping quality, we need to either reduce the errors in SNP fragments which will call for improvement of the shortgun experiment, or add extra information to the given SNP fragment set. Since genotype data can be much more easily (and also economically) obtained, a computational model combining both SNP fragments and genotype information will be a realistic strategy. The idea is motivated by the method in [Eskin et al. 2004]. In their paper they phased long genotypes using local haplotype information. In this paper, we propose a new computational model (Minimum Conflict Individual Haplotyping: MCIH) for individual haplotyping by using this strategy, which is also proved to be NP-hard.
There are two ways to show that a new established model is practical or more effective than an existing model: one is to theoretically prove that the solution of the new model is superior to that of the existing model, the other is to numerically solve the problem and compare the solutions obtained by the two models. It is obvious that we can only do it in the second way. For this reason we try to design an exact algorithm for the MCIH problem. A special data structure suited for a dynamic programming algorithm is displayed and used to evaluate the model. Computational results by this exact algorithm confirm the significance of MCIH. Moreover, a feed-forward neural network (FNN) is designed for computing approximate solutions to the problem (MCIH) in general case and of large size. Extensive computational results show the effectiveness of the proposed FNN algorithm and MCIH.
The paper is organized as follows: In Section I, we give the problem definition together with the complexity analysis of MCIH. The model evaluation is shown in Section II. In Section III, a feed-forward neural network algorithm is designed for implementing MCIH. Experiment results are given in Section IV. Section V concludes the paper.
I. Formulation and Problem
For a genotype
Suppose that there are m SNP fragments from a pair of chromosomes and the length of each corresponding haplotype is n. Define an m × n SNP matrix
Let x, y ∊ {-1, 1, 0} and define
The MEC problem [Lippert et al. 2002; Wang et al. 2005] is defined as: Given a set of SNP fragments, correct a minimum number of SNP states (–1 into 1 and vice versa), such that the modified SNP fragments can be divided into two disjoint sets of pairwise compatible fragments, and each set determines a haplotype. In order to add the information of genotype into MEC, we propose the following combinatorial optimization model for individual haplotyping:
MCIH (Minimum Conflict Individual Haplotyping): Given a set of SNP fragments (a SNP matrix
MCIH is in fact to correct a minimum number of SNP states under the guidance of the genotype information so that the modified SNP fragments can be divided into two disjoint sets of pairwise compatible fragments which determine a pair of haplotypes compatible with the genotype. The computational complexity of the MCIH problem (similar spirit with [Eskin et al. 2004]) is discussed in Appendix A where we prove it to be NP-hard. It indicates that the MCIH problem may have no efficient algorithm for exact solutions.
II. Model Evaluation
The purpose of presenting the new model is to get a high-quality solution of the haplotype assembly problem. To show that MCIH is a potential alternative, we evaluate the model by studying its exact solutions. A special data structure with Markov property suited for a dynamic programming algorithm is considered as follows.
Let Ii and ri be the beginning and ending positions of the ith SNP fragments fi on the SNP matrix respectively, 1 ≤ i ≤ m. For any two fragments fi and fj, we assume
To solve MCIH with this special structure, we give a dynamic programming (DP) algorithm. The input data of the DP algorithm are a genotype
Suppose that
Let j = 1. N(1,
Define N(j,
When all N(j,
For all j, f(j,
Assume that L is the maximum length of SNP fragments. It can be shown that the dynamic programming algorithm solving the special MCIH problem has the complexity O(22Lm). Hence, the algorithm is exponential with the maximum length of SNP fragments. However, when the maximum length of SNP fragments is fixed, this algorithm is linear to m, i.e., the number of fragments. In real applications, the value of L is generally between 3 and 8.
As expected, numerical results (see Appendix B for details) of DP on the special data structure display that MCIH at the cost of genotype information improves the reconstruction rate greatly at various error rates of SNP fragments. An extreme case is considered where every SNP site is heterozygous. It means that the model has no homozygous site information available from the given genotype. The numerical results show that in this case MCIH still has a higher reconstruction rate than MEC. This indicates that MCIH is not trivial, i.e., a higher reconstruction rate does not only depend on homozygous site information. Heterozygous site information has much contribution to the accuracy of the reconstructed haplotypes.
As discussed in Introduction, the MCIH problem is actually a classification problem. That is, given a set of SNP fragments, we want to classify it into two fragment subsets such that each subset determines a haplotype to solve the problem with minimum conflicts. It is well known that feedforward neural network (FNN) is a powerful tool for classification. In general, an FNN maps a set of objects into classes C1,···,CS characterized by the attributes of the objects through repeatedly learning the objects and adjusting its parameters (neuron connection weights). In our problem, a set of SNP fragments are to be divided into two classes such that each class can be assembled as one of the haplotypes which will solve the problem with as few conflicts as possible.
The proposed FNN (see Figure 1) consists of three layers. The m input neurons represent m SNP fragments. Each input neuron accepts an n-dimensional vector on {1, −1, 0}. Two hidden neurons represent two subsets of fragments corresponding to one pair of haplotypes. The input dimension of the hidden neuron is also n and the outputs of the two neurons are a pair of tentative haplotypes. There is only one neuron in the last layer, which simply conflates the tentative haplotypes and outputs a tentative genotype. Comparing this tentative genotype with the given genotype will provide us the information to adjust the neuron connection weights by the popular back-propagation algorithm (see Appendix B and related literatures for feed-forward neural networks [Rumelhart et al. 1986; Zhang, 2000]). The main characteristic of the designed neural network is trying to achieve two objectives, i.e., compatibility and minimum number of conflicts simultaneously.
A three layer forward neural network.
Forwarding process of FNN
The forwarding process of FNN can be stated as follows.
1) The inputs and outputs of the first layer are m rows of the SNP matrix,
2) There are 2m parameters, i.e., the weights from the first layer to the second layer,
3) The inputs to the second layer are
5) The inputs to the third layer are
Learning rules
The MCIH problem is to find a pair of haplotypes (
Objective-1: minimize
Objective-2: satisfy
1)
The distance between
where d is defined by the formula (1) and
Classify all SNP fragments into two disjoint sets according to their distances to
For the neuron corresponding to
2)
The objective that the third layer adjusts to is to minimize the difference between the tentative genotype
A Back-Propagation-like procedure
The main steps of the algorithm is as follows:
Set parameter values L1, L2, ρ, λ and ε. Randomly set the initial values of the weight matrix
Input a SNP matrix. Do computation of (3), (5) and (6) to get a pair of tentative haplotyes (
Substep 2.1: Obtain a fragment partition (F1, F2) using (
Substep 2.2: Calculate the gradients of the error functions
Substep 2.3: Update the current weight matrix
where
ρ is the step length along the negative gradient. L1 and L2 are parameters to be selected.
Substep 2.4: If ‖
Record a pair of haplotypes
The formulae to compute the derivatives in the algorithm are given in Appendix B. In contrast to DP algorithm that gives an exact solution, the algorithm based on the neural network can not ensure an exact optimal solution but is proved very efficient for large-scale problems by the numerical results in next section.
IV. Simulation and Results
In this section, we will use both real data and simulation data to test MCIH and the FNN algorithm for haplotype reconstruction. The computation is implemented on a 2.26G Hz Pentium 4 processor PC using Microsoft Visual C++ compiler 6.
In our experiments, we use reconstruction rate (RR), the similarity degree between the original haplotypes and the reconstructed haplotypes, to measure performance of an algorithm or a model. Assume that
where
For a partition P = (F1, F2) and a pair of haplotypes (
In this paper, we set step length ρ = 0.05, ε = 0.01 and λ,= 0.1 (in fact, the algorithm is robust with these parameters) in the algorithm. In (15) and (16), we set L1 = 0.2 and L2 = 0.8.
Experiment on angiotensin-converting enzyme (ACE)
Angiotensin-converting enzyme catalyses the conversion of angiotensin I to the physiologically active peptide angiotensin II, which controls fluid-electrolyte balance and systemic blood pressure. Because it has a key function in the renin-angiotensin system, many association studies have been performed with DCP1 (encode angiotensin-converting enzyme). Literature [Frieder et al. 1999] completed the genomic sequencing of the DCP1 gene from 11 individuals and reported 78 SNP sites in 22 chromosomes. Out of the 78 varying sites, 52 are non-unique polymorphic sites.
Among these 11 individuals, there are two identical genotypes. We omit one of them. In addition, we omit the genotypes with no more than one heterozygous site whose haplotypes can be infered immediately. Now each of the 8 pairs of haplotypes is used to generate 15 instances in which SNP fragments are randomly generated according to different parameter settings: the number of SNP fragments m = 20; other parameters, the hole rate of fragments hr: 0.25, 0.5, 0.75; the error rate of fragments e: 0.05, 0.1, 0.15, 0.2 and 0.25.
The results of MCIH and MEC (solved by an exact algorithm in [Wang et al. 2005]) averaged on eight individuals is illustrated in Figure 2. All the results are obtained by running the algorithms only once. When the error rate of SNP fragments is low, the neural network is robust and efficient. When the error rate of SNP fragment is high, the network may get into a plight of local minima. The genotype compatibility rate returned by the algorithm for most instances is 100% or at least larger than 98%. Only several instances with error rate 0.25 and hole rate 0.75 have genotype compatibility rate between 94% and 96%. In addition, the neural network algorithm solves each of these instances in several seconds.
The results of MCIH and MEC on ACE. From left to right, hr = 0.25, hr = 0.5, hr = 0.75.
Figure 2 shows that MCIH has a much higher reconstruction rate at various parameter settings, which indicates that with additive genotype information MCIH is effective. Even if an approximate algorithm is employed, it is more effective for haplotype reconstruction than MEC.
Experiment on data from chromosome 5q31
Now we performed simulations using the data from public Daly set [Daly et al. 2001]. They reported a high-resolution analysis of a haplotype structure across 500kb on chromosome 5q31 using 103 SNPs in a European derived population which consists of 129 trios. The haplotypes of 129 children from the trios can be infered from the genotypes of their parents through pedigree information and the nontransmitted chromosomes as an extra 129 (pseudo) haplotype pairs. Markers for which both alleles could not be inferred are marked as missing. Among the resulting set of 258 haplotype pairs, the ones with more than 20% missing alleles are removed, which leaves us 147 haplotype pairs. Among these pairs, 18 genotypes with no more than one heterozygous site are omitted, then 129 pairs of haplotypes are left as the test set.
Each of the 129 pairs of haplotypes is used to generate 15 instances in which SNP fragments are randomly generated according to different parameter settings: m = 30; other parameters, hr: 0.25, 0.5, 0.75; e: 0.05, 0.1, 0.15, 0.2 and 0.25. The results of MCIH and MEC averaged on 129 pairs of haplotypes are illustrated in Figure 3.
The results of MCIH and MEC on Daly set. From left to right, hr = 0.25, hr = 0.5, hr = 0.75.
The picture again shows that MCIH is quite effective and the designed back-propagation-like algorithm has a very good performance. The genotype compatibility rate returned by the algorithm for most instances is 100% or greater than 98%. Only a few instances with error rate 0.25 and hole rate 0.75 have genotype compatibility rate between 92% and 96%. In addition, the neural network algorithm solves instances with a low error rate and hole rate in several seconds. For a few instances with a high error rate and hole rate, however, it takes the algorithm several minutes to stop.
Experiments on Hudson's simulation data
We use a well-known program ms [Hudson, 2002] that uses coalescent theory to generate a simulated population of haplotypes. ms has a parameter r as the recombination rate of population haplotypes. Firstly, let r = 0, then 20 haplotypes with 80 SNP sites are generated using ms and form a haplotype set. Then we randomly combine two halotypes in the haplotype set into a pair of individual haplotypes. 15 pairs of haplotypes are obtained by this way. Each of the 15 pairs of haplotypes is used to generate 15 instances according to the parameters as those in the last subsection. All of these instances form a dataset. The results of the two models averaged on 15 pairs of haplotypes are shown in Figure 4. To further evaluate MCIH, let r = 100 and other parameters be the same. The results of two models on this dataset are summarized in Figure 5. The genotype compatibility rate for almost all instances is 100% or larger than 98%. From Figure 4 and Figure 5 we can see that MCIH is effective on haplotypes not only without recombination but also with a high recombination rate.
The results of MCIH and MEC on Hudson's data with r = 0. From left to right, hr = 0.25, hr = 0.5, hr = 0.75. The results of MCIH and MEC on Hudson's data with r = 100. From left to right, hr = 0.25, hr = 0.5, hr = 0.75.

V. Conclusion
Individual haplotyping is an important problem in computational biology. In this paper, we proposed a new computational model for individual haplotyping—MCIH as an improvement of MEC and as an alternative way for biologists to solve the haplotyping problem more efficiently. The model is proved to be NP-hard. To evaluate the new model, we displayed a special SNP matrix structure for which a dynamic programming algorithm can be used to solve MCIH exactly. Comparing the exact solutions of MCIH and MEC, we argue that the proposed MCIH is worth further studying. Due to the computational intractability of the MCIH problem, a feed-forward neural network is designed and a back-propagation-like procedure is formed as an efficient approximate algorithm. Computational results on multiple data sets show that the designed algorithm performs well and MCIH at the cost of additive genotype information has a higher accuracy of haplotype reconstruction than MEC, especially in the case of SNP fragments with a high error rate. Since genotype information can be obtained much easily and economically, the new model is practical as a supporting tool for reconstruction of haplotypes.
Footnotes
Appendix
Acknowledgement
This work is supported by the National Natural Science Foundation of China under grant No. 10471141 and No. 60503004, and the Important Research Direction Project of CAS “Some Important Problems in Bioinformatics,” CAS, China.
The first author's work is partially supported by the Informatics Research Center for Development of Knowledge Society Infrastructure, Graduate School of Informatics, Kyoto University, Japan. The second author is also supported by the Research Fund of School of Information, Renmin University of China.
