The problem of inferring haplotype phase from a population of genotypes has received a lot
of attention recently. This is partly due to the observation that there are many regions on
human genomic DNA where genetic recombination is rare (Helmuth, 2001; Daly et al., 2001;
Stephens et al., 2001; Friss et al., 2001). A Haplotype Map project has been announced by
NIH to identify and characterize populations in terms of these haplotypes. Recently, Gusfield
introduced the perfect phylogeny haplotyping problem, as an algorithmic implication of the
no-recombination in long blocks observation, together with the standard population-genetic
assumption of infinite sites. Gusfield's solution based on matroid theory was followed by
direct θ(nm2
) solutions that use simpler techniques (Bafna et al., 2003; Eskin et al., 2003),
and also bound the number of solutions to the PPH problem. In this short note, we address
two questions that were left open. First, can the algorithms of Bafna et al. (2003) and Eskin
et al. (2003) be sped-up to O(nm + m2
) time, which would imply an O(nm) time-bound
for the PPH problem? Second, if there are multiple solutions, can we find one that is most
parsimonious in terms of the number of distinct haplotypes.We give reductions that suggests
that the answer to both questions is "no." For the first problem, we show that computing
the output of the first step (in either method) is equivalent to Boolean matrix multiplication.
Therefore, the best bound we can presently achieve is O(nmω–1), where ω ≤ 2.52 is the
exponent of matrix multiplication. Thus, any linear time solution to the PPH problem likely
requires a different approach. For the second problem of computing a PPH solution that
minimizes the number of distinct haplotypes, we show that the problem is NP-hard using a
reduction from Vertex Cover (Garey and Johnson, 1979).