Minimum Conflict Individual Haplotyping from SNP Fragments and Related Genotype

Abstract

The Minimum Error Correction (MEC) is an important model for haplotype reconstruction from SNP fragments. However, this model is effective only when the error rate of SNP fragments is low. In this paper, we propose a new computational model called Minimum Conflict Individual Haplotyping (MCIH) as an extension to MEC. In contrast to the conventional approaches, the new model employs SNP fragment information and also related genotype information, thereby a high accurate inference can be expected. We first prove the MCIH problem to be NP-hard. To evaluate the practicality of the new model we design an exact algorithm (a dynamic programming procedure) to implement MCIH on a special data structure. The numerical experience indicates that it is fairly effective to use MCIH at the cost of related genotype information, especially in the case of SNP fragments with a high error rate. Moreover, we present a feed-forward neural network algorithm to solve MCIH for general data structure and large size instances. Numerical results on real biological data and simulation data show that the algorithm works well and MCIH is a potential alternative in individual haplotyping.

Keywords

individual haplotyping minimum conflict individual haplotyping NP-hard dynamic programming feedforward neural network reconstruction rate

Introduction

The availability of complete genome sequence for human beings [Vent et al. 2001] makes it possible to investigate genetic differences and to associate genetic variations with complex diseases [Hoehe et al. 2000]. Single nucleotide polymorphism (SNP)—a single DNA base varying from one individual to another, is believed to be the most frequent form to address genetic differences [Chakravarti, 1998; Li et al. 2005]. SNPs are found approximately every 1000 base pairs in the human genome and turn to be promising tools for doing disease association study. Many research works have been carried out for determining SNP sites or designing a detailed SNP map for human genome [Altshuler et al. 2000; Helmuth, 2001].

The nucleotides in a SNP position are called alleles. Almost all SNPs have two different alleles which we denote the wild type as 1 and the mutant type as −1. The SNP sequence information on each copy of a pair of chromosomes in a diploid genome is called a haplotype which is a string over {–1, 1}. A genotype is the conflated information of a pair of haplotypes on homologous chromosomes. For a genotype, if a pair of alleles at a SNP site is made of two identical values, this SNP site is called homozygous, otherwise it is called heterozygous.

Haplotypes generally have more information content than that individual SNPs have in disease as sociation studies [Stephens et al. 2001], but it is substantially more difficult to determine haplotypes than to determine genotypes or individual SNPs through experiments. Hence, computational methods that can reduce the cost of determining haplotypes become attractive alternatives. There are generally two classes of computational methods for determining haplotype. One class concerns with infering haplotypes from the genotype samples in a population. There are several models based on different assumptions on the biological system under consideration [Clark, 1990; Gusfield, 2002; Halperin and Eskin, 2004; Li et al. 2005; Wang and Xu, 2003]. The second class, called single individual haplotyping or haplotype assembly, is based on the data and methodology of shotgun sequence assembly [Lancia et al. 2001; Lippert et al. 2002]. The input data consists of aligned short genome fragments with SNPs coming from DNA shotgun sequencing or data generated by a resequencing effort for the purpose of large-scale haplotyping. When we focus on SNP positions, these short genome fragments are actually the aligned SNP fragments.

Computationally, the individual haplotyping problem is to determine the “best” pair of haplotypes from data (SNP fragments) which is possibly inconsistent and contradictory. It focuses on partitioning SNP fragments into two sets according to SNP states, with each set determining a haplotype [Lancia et al. 2001; Lippert et al. 2002]. For such a problem, there are several models based on different error assumptions [Lancia et al. 2001; Li, L.M. et al. 2004; Lippert et al. 2002; Rizzi et al. 2002; Wang et al. 2005], of which the Minimum Error Correction model (MEC) is widely adopted. MEC assumes that the inconsistence of the data comes from realistic sequence errors and these errors can be corrected. However, MEC is only effective in the case that SNP fragments have a low error rate. When the error rate of SNP fragments is relatively high, we can not reconstruct original haplotypes with a high accuracy by error correction only [Wang et al. 2005]. To improve the haplotyping quality, we need to either reduce the errors in SNP fragments which will call for improvement of the shortgun experiment, or add extra information to the given SNP fragment set. Since genotype data can be much more easily (and also economically) obtained, a computational model combining both SNP fragments and genotype information will be a realistic strategy. The idea is motivated by the method in [Eskin et al. 2004]. In their paper they phased long genotypes using local haplotype information. In this paper, we propose a new computational model (Minimum Conflict Individual Haplotyping: MCIH) for individual haplotyping by using this strategy, which is also proved to be NP-hard.

There are two ways to show that a new established model is practical or more effective than an existing model: one is to theoretically prove that the solution of the new model is superior to that of the existing model, the other is to numerically solve the problem and compare the solutions obtained by the two models. It is obvious that we can only do it in the second way. For this reason we try to design an exact algorithm for the MCIH problem. A special data structure suited for a dynamic programming algorithm is displayed and used to evaluate the model. Computational results by this exact algorithm confirm the significance of MCIH. Moreover, a feed-forward neural network (FNN) is designed for computing approximate solutions to the problem (MCIH) in general case and of large size. Extensive computational results show the effectiveness of the proposed FNN algorithm and MCIH.

The paper is organized as follows: In Section I, we give the problem definition together with the complexity analysis of MCIH. The model evaluation is shown in Section II. In Section III, a feed-forward neural network algorithm is designed for implementing MCIH. Experiment results are given in Section IV. Section V concludes the paper.

I. Formulation and Problem

For a genotype g = (g₁, g₂, ···, g_n), if the 7th SNP site is wild type homozygous, g_j = 2; if it is mutant type homozygous, g_j = −2; when it is heterozygous, g_j = 0. A pair of haplotypes h ₁ = (h₁₁, h₁₂, ···, h_1n) and h ₂ = (h₂₁, h₂₂, ···, h_2n) is called compatible with a genotype g if the following conditions hold: for each SNP site j where $g_{j} = - 2, h_{1 j} = h_{2 j} = - 1$ for each SNP site j where $g_{j} = 2, h_{1 j} = h_{2 j} = 1$ for each SNP site j where $g_{j} = 0, h_{1 j} = - h_{2 j} = - 1$ or $h_{1 j} = - h_{2 j} = 1$ .

Suppose that there are m SNP fragments from a pair of chromosomes and the length of each corresponding haplotype is n. Define an m × n SNP matrix M = (m_ij), whose entry m_ij has the value −1,1 or 0 (for a missing or skipped base, we call it a hole). Each row m _i corresponds to a SNP fragment f_i and each column corresponds to a SNP site. Since the given SNP fragments may have different lengths, but are generally less than n, we also assign value 0 to the uncovered elements in a row.

Let x, y ∊ {-1, 1, 0} and define

d (x, y) = {\begin{cases} 1 if x \neq 0, y \neq 0 and x \neq y, \\ 0 otherwish, \end{cases}

(1)

then the distance between two SNP fragments

f_{i} = (m_{i 1}, \dots, m_{i n})

and

f_{k} = (m_{k 1}, \dots, m_{k n})

is defined as

H D (f_{i}, f_{k}) = \sum_{j = 1}^{n} d (m_{i j}, m_{k j}) . If H D (f_{i}, f_{k}) > 0

we say two fragments f_i and f_k are in conflict, otherwise we call them compatible. HD(f_i, f_k) > 0 indicates that either f_i, f_k are not from the same chromosome copy or there are errors in the data. HD(f₁, f_k) is similar to the Hamming distance, i.e., the number of mismatches (conflicts) between two fragments. The distance between a fragment and a haplotype is defined in the similar way.

The MEC problem [Lippert et al. 2002; Wang et al. 2005] is defined as: Given a set of SNP fragments, correct a minimum number of SNP states (–1 into 1 and vice versa), such that the modified SNP fragments can be divided into two disjoint sets of pairwise compatible fragments, and each set determines a haplotype. In order to add the information of genotype into MEC, we propose the following combinatorial optimization model for individual haplotyping:

MCIH (Minimum Conflict Individual Haplotyping): Given a set of SNP fragments (a SNP matrix M) from an individual's DNA and the related genotype g, reconstruct a pair of haplotypes compatible with g and involving a minimum number of conflicts with the given SNP fragments.

MCIH is in fact to correct a minimum number of SNP states under the guidance of the genotype information so that the modified SNP fragments can be divided into two disjoint sets of pairwise compatible fragments which determine a pair of haplotypes compatible with the genotype. The computational complexity of the MCIH problem (similar spirit with [Eskin et al. 2004]) is discussed in Appendix A where we prove it to be NP-hard. It indicates that the MCIH problem may have no efficient algorithm for exact solutions.

II. Model Evaluation

The purpose of presenting the new model is to get a high-quality solution of the haplotype assembly problem. To show that MCIH is a potential alternative, we evaluate the model by studying its exact solutions. A special data structure with Markov property suited for a dynamic programming algorithm is considered as follows.

Let I_i and r_i be the beginning and ending positions of the ith SNP fragments f_i on the SNP matrix respectively, 1 ≤ i ≤ m. For any two fragments f_i and f_j, we assume

if l_{i} \leq l_{j}, then r_{i} \leq r_{j} .

(2)

The rows of the SNP matrix is reordered according to the beginning positions and the ending positions of SNP fragments. For two rows m_i and m_j, i < j if and only if l_i < l_j. If l_i = l_j and r_i < r_j, the ith row will also be put before the j'th row. From the proof of the computational complexity in Appendix A we know that the MCIH with this special structure remains NP-hard.

To solve MCIH with this special structure, we give a dynamic programming (DP) algorithm. The input data of the DP algorithm are a genotype g with length n and a SNP matrix of the type (2). The outputs of the DP algorithm are a partition of the SNP fragments and a pair of halotypes ( h ₁, h ₂) generated from the partition. In fact, the DP algorithm is used to reconstruct one haplotype, say h ₁. Clearly, h ₂ can be obtained immediately from g and h ₁.

Suppose that $x (j) = (x_{l_{j}}, \dots, x_{r_{j}})$ is an assignment to the positions l_j, ···', r_j of h ₁, and $\bar{x} (j)$ is the corresponding assignment to h ₂ at the same positions. If g_k ≠ 0, x_k = $\bar{x}$ _k = g_k/2, otherwise x_k + $\bar{x}$ _k = 0 for k = l_j, ···, r_j. Let f_l(j, x(j)) denote the number of conflicts between the SNP fragment f_j and h ₁ with assignment x(j), f₂(j, $\bar{x}$ (j)) denote the number of conflicts between the SNP fragment f_j and h ₂ with assignment $\bar{x}$ (j). Define f(j, x(j)) = min{f_l(j, x(j)), f₂(j, $\bar{x}$ (j))} which implies the haplotype that the jth fragment belongs to. The main steps of the dynamic programming algorithm are as follows:

Step 1 Initialization.

Let j = 1. N(1, x (1)) = f(1, x (1)) = min{f₁(1, x (1)),f₂(1, $\bar{x}$ (1))} for all possible x (1).

Step 2 Follow the recurrence formula.

Define N(j, x(j)) as the minimum associated number of conflicts between the first j fragments and haplotypes ( h ₁, h ₂) at positions 1, 2, ···, r_j, with the assignment x (j).

\begin{array}{l} N (j + 1, x (j + 1)) = min_{(b_{l j} \dots, b_{l j} + 1^{- 1})} \\ {h_{1} (j, (b_{l j}, \dots, b_{l j + 1^{- 1}}, x_{l j + 1}, \dots, x_{r j})) \\ + f (j + 1, x (j + 1))}, \end{array}

where b_k = g_k/2 for g_k ≠ 0, otherwise b_k ∊ {1,-1}, k = l_j, ··· l_j+1 −1.

Step 3 Trace the solution to obtain a pair of haplotypes ( h ₁, h ₂) and a partition of the SNP fragments.

When all N(j, x(j)) for 1 ≤ j ≤ m are computed by the recurrence formula, we can find N(m, x(m)) for all $x (m) = (x_{l_{m}}, \dots, x_{r_{m}})$ . Thus, the solution of the MCIH problem can be obtained by tracing the solution forwardly which leads to a minimal value at each j from the following formula:

min_{x l_{m}, \dots, x r_{m}} N (m, x (m)) .

For all j, f(j,x(j)) = min {f₁(j, x(j)), f₂(j, $\bar{x}$ (j))} determines a partition of the SNP fragments.

Assume that L is the maximum length of SNP fragments. It can be shown that the dynamic programming algorithm solving the special MCIH problem has the complexity O(2^2Lm). Hence, the algorithm is exponential with the maximum length of SNP fragments. However, when the maximum length of SNP fragments is fixed, this algorithm is linear to m, i.e., the number of fragments. In real applications, the value of L is generally between 3 and 8.

As expected, numerical results (see Appendix B for details) of DP on the special data structure display that MCIH at the cost of genotype information improves the reconstruction rate greatly at various error rates of SNP fragments. An extreme case is considered where every SNP site is heterozygous. It means that the model has no homozygous site information available from the given genotype. The numerical results show that in this case MCIH still has a higher reconstruction rate than MEC. This indicates that MCIH is not trivial, i.e., a higher reconstruction rate does not only depend on homozygous site information. Heterozygous site information has much contribution to the accuracy of the reconstructed haplotypes.

III. A Feed-Forward Neural Network Algorithm

As discussed in Introduction, the MCIH problem is actually a classification problem. That is, given a set of SNP fragments, we want to classify it into two fragment subsets such that each subset determines a haplotype to solve the problem with minimum conflicts. It is well known that feedforward neural network (FNN) is a powerful tool for classification. In general, an FNN maps a set of objects into classes C₁,···,C_S characterized by the attributes of the objects through repeatedly learning the objects and adjusting its parameters (neuron connection weights). In our problem, a set of SNP fragments are to be divided into two classes such that each class can be assembled as one of the haplotypes which will solve the problem with as few conflicts as possible.

The proposed FNN (see Figure 1) consists of three layers. The m input neurons represent m SNP fragments. Each input neuron accepts an n-dimensional vector on {1, −1, 0}. Two hidden neurons represent two subsets of fragments corresponding to one pair of haplotypes. The input dimension of the hidden neuron is also n and the outputs of the two neurons are a pair of tentative haplotypes. There is only one neuron in the last layer, which simply conflates the tentative haplotypes and outputs a tentative genotype. Comparing this tentative genotype with the given genotype will provide us the information to adjust the neuron connection weights by the popular back-propagation algorithm (see Appendix B and related literatures for feed-forward neural networks [Rumelhart et al. 1986; Zhang, 2000]). The main characteristic of the designed neural network is trying to achieve two objectives, i.e., compatibility and minimum number of conflicts simultaneously.

Figure 1

A three layer forward neural network.

Forwarding process of FNN

The forwarding process of FNN can be stated as follows.

1) The inputs and outputs of the first layer are m rows of the SNP matrix, m ₁,···, m _m , i.e., m SNP fragments f_i, i = 1, ···, m. In fact, the neurons in the first layer are trivial identity maps I.

2) There are 2m parameters, i.e., the weights from the first layer to the second layer,

W_{m \times 2} = (\begin{matrix} w_{11} \\ w_{21} \\ \dots \\ w_{m 1} \end{matrix} \begin{matrix} w_{12} \\ w_{22} \\ \dots \\ w_{m 2} \end{matrix}) .

3) The inputs to the second layer are y ₁ = (y₁₁, ···, y_1n) and y = (y₂₁, ···, y_2n), where

y_{l k} = \sum_{i = 1}^{m} m_{i k} w_{i l}, l = 1, 2, k = 1, \dots, n .

(3)

Two neurons in the hidden layer have the same sigmoidal neuron function:

ϕ (λ x) = tanh (λ x) = \frac{1 - e^{- 2 λ x}}{1 + e^{- 2 λ x}} .

(4)

4) The outputs of the second layer neurons are

h_{1} = (h_{11}, \dots, h_{1 n})

and

h_{2} = (h_{21}, \dots, h_{2 n})

where

h_{l k} = ϕ (λ y_{l k}), l = 1, 2, k = 1, \dots, n .

(5)

5) The inputs to the third layer are h ₁ and h ₂. The neuron function of the third layer is a linear summation, i.e., the output of the third layer is z = (z₁,···, z_n), where

z_{k} = h_{1 k} + h_{2 k}, k = 1, \dots, n .

(6)

Learning rules

The MCIH problem is to find a pair of haplotypes ( h ₁, h ₂) to

Objective-1: minimize

\sum_{f i \in F_{1}} H D (h_{1}, f_{i}) + \sum_{f i \in F_{2}} H D (h_{2}, f_{i})

(7)

for all partitions (F₁, F₂) of f₁, ···, f_m and all possible pairs of haplotypes ( h ₁, h ₂), and

Objective-2: satisfy

h_{1 k} + h_{2 k} = g_{k}, k = 1, \dots, n,

(8)

or minimize

\sum_{k = 1}^{n} (h_{1 k} + h_{2 k} - g_{k}), k = 1, \dots, n,

(9)

for all possible pairs of haplotypes ( h ₁, h ₂), where g = (g₁, g₂, ···, g_n) is the given genotype.

1) Learning to satisfy objective-1

The distance between h ₁ and m _i (i.e., f_i) is defined as

H D (h_{l}, m_{i}) = \sum_{k = 1}^{n} d (sgn (h_{l k}), m_{i k}),

where d is defined by the formula (1) and

sgn (x) = {\begin{cases} 1, x \geq 0, \\ - 1, x < 0. \end{cases}

Classify all SNP fragments into two disjoint sets according to their distances to h ₁ and h ₂, i.e.,

F_{1} = {f_{i} : H D (h_{1}, m_{i}) < H D (h_{2}, m_{i}), i = 1, \dots m},

(10)

F_{2} = {f_{i} : H D (h_{2}, m_{i}) \leq H D (h_{1}, m_{i}), i = 1, \dots m} .

(11)

For the neuron corresponding to h ₁ in the second layer, the network learns to minimize the following error function between h ₁ and the SNP fragments in F₁:

R_{11} = \sum_{f_{i} \in F_{1}} \sum_{k = 1}^{n} {(h_{1 k} - m_{i k})}^{2} | m_{i k} | .

(12)

For the neuron corresponding to h ₂, the network learns to minimize the error function between h ₂ and the SNP fragments in F₂:

R_{12} = \sum_{f_{i} \in F_{2}} \sum_{k = 1}^{n} {(h_{2 k} - m_{i k})}^{2} | m_{i k} | .

(13)

2) Learning to satisfy objective-2

The objective that the third layer adjusts to is to minimize the difference between the tentative genotype z and the original genotype g :

R_{2} = \sum_{k = 1}^{n} {(z_{k} - g_{k})}^{2} .

(14)

A Back-Propagation-like procedure

The main steps of the algorithm is as follows:

Initialization

Set parameter values L₁, L₂, ρ, λ and ε. Randomly set the initial values of the weight matrix W (0) with entries w_il ∊ [0, 1], i = 1, ···, m, l = 1,2. t = 0.

Step 1—Feed-forwarding

Input a SNP matrix. Do computation of (3), (5) and (6) to get a pair of tentative haplotyes ( h ₁, h ₂).

Step 2—Learning

Substep 2.1: Obtain a fragment partition (F₁, F₂) using ( h ₁, h ₂) according to the formulae (10) and (11).

Substep 2.2: Calculate the gradients of the error functions $\nabla_{w_{1}} R_{11}, \nabla_{w_{2}}, R_{12}, \nabla_{w_{1}} R_{2}, \nabla_{w_{2}} R_{2}$ according to (17), (18), (19), and (20).

Substep 2.3: Update the current weight matrix W(t) by

\begin{array}{l} w_{1} (t + 1) = w_{1} (t) - ρ Δ w_{1}, \\ w_{2} (t + 1) = w_{2} (t) - ρ Δ w_{2}, \end{array}

where

Δ w_{1} = L_{1} \nabla_{w_{1}} R_{11} + L_{2} \nabla_{w_{1}} R_{2},

(15)

Δ w_{2} = L_{1} \nabla_{w_{2}} R_{12} + L_{2} \nabla_{w_{2}} R_{2},

(16)

ρ is the step length along the negative gradient. L₁ and L₂ are parameters to be selected.

Substep 2.4: If ‖ W (t + 1) – W (t)‖ < ε, go to Step 3, otherwise return to Step 1.

Step 3—Output

Record a pair of haplotypes $({\hat{h}}_{1}, {\hat{h}}_{2})$ and a genotype $\hat{g}$ as the output.

The formulae to compute the derivatives in the algorithm are given in Appendix B. In contrast to DP algorithm that gives an exact solution, the algorithm based on the neural network can not ensure an exact optimal solution but is proved very efficient for large-scale problems by the numerical results in next section.

IV. Simulation and Results

In this section, we will use both real data and simulation data to test MCIH and the FNN algorithm for haplotype reconstruction. The computation is implemented on a 2.26G Hz Pentium 4 processor PC using Microsoft Visual C++ compiler 6.

In our experiments, we use reconstruction rate (RR), the similarity degree between the original haplotypes and the reconstructed haplotypes, to measure performance of an algorithm or a model. Assume that h = ( h ₁, h ₂) is the original pair of haplotypes, and $\hat{h} = ({\hat{h}}_{1}, {\hat{h}}_{2})$ is the reconstructed pair of haplotypes. We define RR as:

R R (h, \hat{h}) = 1 - \frac{min {r_{11} + r_{22}, r_{12} + r_{21}}}{2 n},

where $r_{i j} = H D (h_{i}, {\hat{h}}_{j}), i = 1,2, j = 1,2$ . We use compatibility rate (CR) to measure the similarity degree between the original genotype g and the reconstructed genotype $\hat{g}$ :

C R (g, \hat{g}) = \frac{H D (g, \hat{g})}{n} .

For a partition P = (F₁, F₂) and a pair of haplotypes ( h ₁, h ₂), the corresponding conflict number is defined by (7).

In this paper, we set step length ρ = 0.05, ε = 0.01 and λ,= 0.1 (in fact, the algorithm is robust with these parameters) in the algorithm. In (15) and (16), we set L₁ = 0.2 and L₂ = 0.8.

Experiment on angiotensin-converting enzyme (ACE)

Angiotensin-converting enzyme catalyses the conversion of angiotensin I to the physiologically active peptide angiotensin II, which controls fluid-electrolyte balance and systemic blood pressure. Because it has a key function in the renin-angiotensin system, many association studies have been performed with DCP1 (encode angiotensin-converting enzyme). Literature [Frieder et al. 1999] completed the genomic sequencing of the DCP1 gene from 11 individuals and reported 78 SNP sites in 22 chromosomes. Out of the 78 varying sites, 52 are non-unique polymorphic sites.

Among these 11 individuals, there are two identical genotypes. We omit one of them. In addition, we omit the genotypes with no more than one heterozygous site whose haplotypes can be infered immediately. Now each of the 8 pairs of haplotypes is used to generate 15 instances in which SNP fragments are randomly generated according to different parameter settings: the number of SNP fragments m = 20; other parameters, the hole rate of fragments hr: 0.25, 0.5, 0.75; the error rate of fragments e: 0.05, 0.1, 0.15, 0.2 and 0.25.

The results of MCIH and MEC (solved by an exact algorithm in [Wang et al. 2005]) averaged on eight individuals is illustrated in Figure 2. All the results are obtained by running the algorithms only once. When the error rate of SNP fragments is low, the neural network is robust and efficient. When the error rate of SNP fragment is high, the network may get into a plight of local minima. The genotype compatibility rate returned by the algorithm for most instances is 100% or at least larger than 98%. Only several instances with error rate 0.25 and hole rate 0.75 have genotype compatibility rate between 94% and 96%. In addition, the neural network algorithm solves each of these instances in several seconds.

Figure 2

The results of MCIH and MEC on ACE. From left to right, hr = 0.25, hr = 0.5, hr = 0.75.

Figure 2 shows that MCIH has a much higher reconstruction rate at various parameter settings, which indicates that with additive genotype information MCIH is effective. Even if an approximate algorithm is employed, it is more effective for haplotype reconstruction than MEC.

Experiment on data from chromosome 5q31

Now we performed simulations using the data from public Daly set [Daly et al. 2001]. They reported a high-resolution analysis of a haplotype structure across 500kb on chromosome 5q31 using 103 SNPs in a European derived population which consists of 129 trios. The haplotypes of 129 children from the trios can be infered from the genotypes of their parents through pedigree information and the nontransmitted chromosomes as an extra 129 (pseudo) haplotype pairs. Markers for which both alleles could not be inferred are marked as missing. Among the resulting set of 258 haplotype pairs, the ones with more than 20% missing alleles are removed, which leaves us 147 haplotype pairs. Among these pairs, 18 genotypes with no more than one heterozygous site are omitted, then 129 pairs of haplotypes are left as the test set.

Each of the 129 pairs of haplotypes is used to generate 15 instances in which SNP fragments are randomly generated according to different parameter settings: m = 30; other parameters, hr: 0.25, 0.5, 0.75; e: 0.05, 0.1, 0.15, 0.2 and 0.25. The results of MCIH and MEC averaged on 129 pairs of haplotypes are illustrated in Figure 3.

Figure 3

The results of MCIH and MEC on Daly set. From left to right, hr = 0.25, hr = 0.5, hr = 0.75.

The picture again shows that MCIH is quite effective and the designed back-propagation-like algorithm has a very good performance. The genotype compatibility rate returned by the algorithm for most instances is 100% or greater than 98%. Only a few instances with error rate 0.25 and hole rate 0.75 have genotype compatibility rate between 92% and 96%. In addition, the neural network algorithm solves instances with a low error rate and hole rate in several seconds. For a few instances with a high error rate and hole rate, however, it takes the algorithm several minutes to stop.

Experiments on Hudson's simulation data

We use a well-known program ms [Hudson, 2002] that uses coalescent theory to generate a simulated population of haplotypes. ms has a parameter r as the recombination rate of population haplotypes. Firstly, let r = 0, then 20 haplotypes with 80 SNP sites are generated using ms and form a haplotype set. Then we randomly combine two halotypes in the haplotype set into a pair of individual haplotypes. 15 pairs of haplotypes are obtained by this way. Each of the 15 pairs of haplotypes is used to generate 15 instances according to the parameters as those in the last subsection. All of these instances form a dataset. The results of the two models averaged on 15 pairs of haplotypes are shown in Figure 4. To further evaluate MCIH, let r = 100 and other parameters be the same. The results of two models on this dataset are summarized in Figure 5. The genotype compatibility rate for almost all instances is 100% or larger than 98%. From Figure 4 and Figure 5 we can see that MCIH is effective on haplotypes not only without recombination but also with a high recombination rate.

Figure 4

The results of MCIH and MEC on Hudson's data with r = 0. From left to right, hr = 0.25, hr = 0.5, hr = 0.75.

Figure 5

The results of MCIH and MEC on Hudson's data with r = 100. From left to right, hr = 0.25, hr = 0.5, hr = 0.75.

V. Conclusion

Individual haplotyping is an important problem in computational biology. In this paper, we proposed a new computational model for individual haplotyping—MCIH as an improvement of MEC and as an alternative way for biologists to solve the haplotyping problem more efficiently. The model is proved to be NP-hard. To evaluate the new model, we displayed a special SNP matrix structure for which a dynamic programming algorithm can be used to solve MCIH exactly. Comparing the exact solutions of MCIH and MEC, we argue that the proposed MCIH is worth further studying. Due to the computational intractability of the MCIH problem, a feed-forward neural network is designed and a back-propagation-like procedure is formed as an efficient approximate algorithm. Computational results on multiple data sets show that the designed algorithm performs well and MCIH at the cost of additive genotype information has a higher accuracy of haplotype reconstruction than MEC, especially in the case of SNP fragments with a high error rate. Since genotype information can be obtained much easily and economically, the new model is practical as a supporting tool for reconstruction of haplotypes.

Footnotes

Appendix

Acknowledgement

This work is supported by the National Natural Science Foundation of China under grant No. 10471141 and No. 60503004, and the Important Research Direction Project of CAS “Some Important Problems in Bioinformatics,” CAS, China.

The first author's work is partially supported by the Informatics Research Center for Development of Knowledge Society Infrastructure, Graduate School of Informatics, Kyoto University, Japan. The second author is also supported by the Research Fund of School of Information, Renmin University of China.

References

Altshuler

, and Pollarn

V.J.

. 2000. An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature, 407: 513–519.

Chakravarti

1998. It's raining, hallelujah? Nature Genetics , 19: 216–217.

Clark

A.G.

1990. Inference of haplotypes from PCR-amplified samples of dipoid populations. Molecular Biology and Evolution, 7(2): 111–122.

Daly

, Rioux

, Hudson

, and Lander

2001 High-resolution haplotype structure in human genome. Nature Genetics, 29: 229–232.

Eskin

, Halperin

, and Sharan

2004. Optimally phasing long genomic regions using local haplotype prediction. In Proceedings of the 2nd RECOMB Satellite Workshop on Computational Methods for SNPs and Haplotypes, Pittsburg, PA.

Gusfield

2002. Haplotyping as perfect phylogeny: conceptual framework and efficient solutions. In Proceedings of the 6th International Conference on Research in Computational Molecular Biology (RECOMB), pp. 166–175.

Halperin

, and Eskin

2004. Haplotype reconstruction from genotype data using imperfect phylogeny. Bioinformatics, 20(12): 1842–1849.

Helmuth

2001. Genome research: Map of the human genome 3.0. Science, 293(5530): 583–585.

Hoehe

, Kopke

, Wendel

, Rohde

, Flachmeier

, Kidd

, Berrettini

, and Church

2000. Sequence variability and candidate gene analysis in complex disease: association of μ opoid receptor gene variation with substance dependence. Human Molecular Genetic, 9: 2895–2908.

10.

Hudson

R.R.

2002. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics, 18(2): 337–338.

11.

Lancia

, Bafna

, Istrail

, Lippert

, and Schwartz

2001. SNPs problems, complexity, and algorithms. In Proceedings of the 9th Annual European Symposium on Algorithms (ESA), pp. 182–193.

12.

L.M.

, Kim

J.H.

, and Waterman

M.S.

2004. Haplotype reconstructon from SNP alignment. Journal of Computational Biology, 11(2-3): 507–518.

13.

, Zhou

, Zhang

, and Chen

2005. A parsimonious tree-grow method for haplotype inference. Bioinformatics, 21(17): 3475–3481.

14.

Lippert

, Schwartz

, Lancia

, and Istrail

2002. Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem. Briefings in Bioinformatics, 3: 23–31.

15.

Rieder

, Taylor

, Clark

, and Nickerson

1999. Sequence variation in the human angiotensim converting enzyme. Nature genetics, 22: 59–62.

16.

Rizzi

, Bafna

, Istrail

, and Lancia

2002. Practical algorithms and fixed-parameter tractability for the single individual SNP haplotyping problem. In Proceedings of the 2nd International Workshop on Algorithms in Bioinformatics, (WABI), pp. 29–453.

17.

Rumelhart

D.E.

, Hinton

G.E.

, and Williams

R.J.

1986. Learning internal representations by error propagarion, in Parallel Distributed Processing, McClelland, J.L. and Rumelhart, D.E. (eds.), 1, pp. 318.

18.

Stephens

J.C.

, and Schneider

J.A.

. 2001. Haplotype variation and linkage disequilibrium in 313 human genes. Science, 293: 489–493.

19.

Venter

J.C.

, and Adams

M.D.

. 2001. The sequence of the human genome. Science, 291(5507): 1304–1351.

20.

Wang

, and Xu

2003. Haploptye inference by maximum parsimony. Bioinformatics, 19(14): 1773–1780.

21.

Wang

, Wu

, Li

, and Zhang

2005. Haplotype reconstruction from SNP fragment by Minimum Error Correction. Bioinformatics, 21(10): 2456–2462.

22.

Zhang

2000. Neural Networks in Optimization. Kluwer Academic Publishers.