Abstract
We characterized the variation in the reconstructed ancestor of 118 HIV-1 envelope gene sequences arising from the methods used for (a) estimating and (b) rooting the phylogenetic tree, and (c) reconstructing the ancestor on that tree, from (d) the sequence format, and from (e) the number of input sequences. The method of rooting the tree was responsible for most of the sequence variation both among the reconstructed ancestral sequences and between the ancestral and observed sequences. Variation in predicted 3-D structural properties of the ancestors mirrored their sequence variation. The observed sequence consensus and ancestral sequences from center-rooted trees were most similar in all predicted attributes. Only for the predicted number of N-glycosylation sites was there a difference between MP and ML methods of reconstruction. Taxon sampling effects were observed only for outgroup-rooted trees, not center-rooted, reflecting the occurrence of several divergent basal sequences. Thus, for sequences exhibiting a radial phylogenetic tree, as does HIV-1, most of the variation in the estimated ancestor arises from the method of rooting the phylogenetic tree. Those investigating the ancestors of genes exhibiting such a radial tree should pay particular attention to alternate rooting methods in order to obtain a representative sample of ancestors.
Introduction
Reconstructing proteins from ancestral genomic sequences (Cai et al 2004; Cunningham et al 1998) allows us to investigate the evolution of protein function (Chang and Donoghue 2000), or to construct biologically significant molecules. Many factors may affect the inference of ancestral sequences but assessment has tended to focus on the method of character state reconstruction on a given phylogenetic tree. Maximum likelihood (ML) had been shown to be more accurate than maximum parsimony (MP) (Koshi and Goldstein 1996; Thornton 2004; Zhang and Nei 1997) but there is recent evidence that optimality methods induce deterministic bias (Krishnan et al 2004).
There are several potential sources of error which could contribute to variation in the reconstructed ancestor. Generally we have no prior knowledge of which reconstructed ancestral sequence is correct, and so when investigating the properties of an ancestor, one might use several reconstruction methods to generate a population of potential ancestors which we think will capture method-induced variation (for example, Jermann et al 1995).
Deciding which methods to include in an experiment requires knowledge of the relative contributions of the methodological options to the variation in the reconstructed ancestor. Here we assess these sources of variation in ancestral sequence reconstruction, to assist in the choice of relevant methods to investigate in any particular application. Our focus is on software and methods which are readily available to investigators.
Ancestral sequence reconstruction is based on a phylogeny. We assess the variation arising from standard methods of estimating the phylogenetic tree. That is, do the different tree building methods create trees sufficiently different so as to have an impact on the reconstructed ancestor? We assess how the phylogenetic tree is rooted to define the ancestral node of the tree, whether by an outgroup or using a computed “center of tree”, an enhanced version of mid-point rooting that reduces the weight given to the most divergent taxa (Nickle et al 2003). We apply three computational approaches to inferring ancestral character states: MP (Edwards and Cavalli-Sforza 1963; Farris 1970; Swofford and Maddison 1987), ML (Yang et al 1995; Zhang and Nei 1997) and empirical Bayes' (Huelsenbeck and Bollback 2001; Schultz et al 1996). Lastly, we assess the variation due to the informational level (nucleotide, codon or amino acid) at which the reconstruction is performed.
The biological properties of predicted ancestors are best screened first by computational methods. Only later would it be economic to synthesize and analyze some of the postulated proteins from the reconstructed sequences (Chang and Donoghue 2000). By using structural modeling and expert systems, we compare the variation in predicted characteristics of the hypothesized ancestral proteins with the variation in predictions obtained for the observed sequences.
Little is known about the effects of taxon sampling, ie the number and selection of ingroup sequences, on ancestral sequence reconstruction. A logarithmic improvement in ancestral state reconstruction using MP has been demonstrated with increasing taxon number (Salisbury and Kim 2001) and a relatively modest number of taxa are required to have reasonable confidence that a phylogenetic tree includes the true root of the clade (Sanderson 1996). Here we used a random deletion method to determine empirically the effect of taxon sampling.
This assessment was conducted in the context of Human Immunodeficiency Virus type 1 (HIV-1) vaccine design, an area of vigorous study (Mullins et al 2004), but our results will be of interest to investigators of other biomolecules. Between-host variants of HIV-1 usually form phylogenetic trees with a star topology, with long terminal branches and short branches near the root, as a result of rapid divergence (Figure 1A). It has been hypothesized that a suitable vaccine might be designed using a reconstructed ancestral protein sequence, because it will represent a sequence most similar to all circulating strains and may therefore elicit a stronger immunologic response than other candidate vaccines (Gaschen et al 2002; Learn and Mullins 2000; Mullins et al 2004; Nickle et al 2003). We identify the placement of the root and selection of the outgroup as potentially significant sources of variation in the ancestral sequence as it relates to HIV-1 vaccine design (Gaschen et al 2002; Nickle et al 2003; Novitsky et al 2002). Our purpose here is not to present a vaccine candidate but to identify those methodological stages that contribute to the variation in reconstructed ancestral sequences, and hence potential vaccines, These results will also assist investigators of other such systems in designing their computational experiments and in deciding where to direct expensive laboratory resources.
An NJ phylogenetic tree of the HIV-1 env sequences used in this study: (A) unrooted and (B) rooted using an outgroup. See the text for a discussion of nodes 1, 2, 3 and 4.
Methods and Materials
Observed Sequences
The primary data comprised 118 B-subtype DNA sequences from the C2-V5 region of the HIV-1 envelope glycoprotein (env) gene (GenBank accession numbers AY139268 - AY139370, AY139372, AY139374 - AY139381, AY139383 - AY139388), each obtained, one per subject, by the U.S. Centers for Disease Control and Prevention in five different cities in the United States (Anderson et al 2001). As outgroups, we selected four D-subtype sequences at random (GenBank accession numbers K03454, M27323, U88822, U88824), and the consensus sequences for the M group and the D subtype, all from the LANL HIV-1 sequence database (Korber et al 2003). The D subtype is the sister group of the B subtype within the M group of HIV-1 (Korber et al 2000).
Experimental Design and Coding
The ancestral sequence for the observed sequences was derived using combinations of different methods of tree estimation and sequence reconstruction (Figure 2 and below), except where they have not been implemented in software or are meaningless. Some combinations were abandoned when computation time exceeded 2 months.
Each ancestral sequence reconstruction was given a 4-letter code in which each position indicates sequentially (a) how the ancestral node was defined, (b) the method of tree building, (c) the method of ancestor reconstruction and (d) the format in which the ancestor was reconstructed. All potential reconstructions for a single outgroup are shown, while a similar number were considered for the other outgroups and methods of rooting the tree.
Alignments
A DNA alignment of the observed sequences was obtained with ClustalX v1.81 (Thompson et al 1997), and then adjusted manually to restore the reading frame and the translated amino acid sequence reported in GenBank. The consensus of the aligned DNA sequences was then recorded (
Rooting the tree
Phylogenetic trees were estimated for each of the four alignments. Three sets of trees were rooted using as outgroup the four D-subtype sequences (
Tree Estimation
The phylogenetic tree was estimated by: (1) maximum parsimony (MP) using PAUP* v4.0b10 (Swofford 2002) and a heuristic search with the starting tree obtained by stepwise addition and branch swapping using the TBR algorithm (
Ancestral Sequence Reconstruction
The ancestral sequence was reconstructed at the node basal to the clade of observed sequences as: amino acids (
(1) The
(2) ML reconstructions (
(3) MP reconstructions (
Comparisons among reconstructed ancestral sequences
When comparing reconstructed sequences we deleted unreliable sites where gaps in the observed sequences had a frequency of 0.50 or greater. These gaps arose through alignment with outgroup sequences or observed sequences with rare insertions.
Structural Prediction
The variation in the tertiary structure of ancestral sequences was assessed using (a) threading (reported in the online Supplementary Material) and (b) the variation in the energy of the structure predicted by the spatial constraints method (Sali and Blundell 1993). A few protein sequences were first submitted to the Swiss-Model comparative modeling protein server (http://www.expasy.org/swissmod/) to identify suitable structural templates for subsequent modeling. The three templates identified (PDB IDs: 1g9mG, 1g9nG, and 1gc1G) are each derived from HIV-1 B-subtype sequences with the V3 loop mutationally excised. The observed and ancestral sequences were aligned with the template sequences, again pruned both of unreliable sites and of sites that did not occur in the templates, and then modeled using Modeller (Sali and Blundell 1993) to obtain the energy of the inferred structure. Again, an energy with a larger negative value indicates a better fit to the templates.
Immunological Prediction
All of the original and ancestral sequences were searched for potential epitopes. Again, sites where gaps had a frequency of 0.50 or greater were deleted. Then ungapped sequences were compared with all of the antibody (Ab), helper T-cell (HTL) and cytotoxic T-cell (CTL) epitopes recorded in the LANL HIV-1 immunology database (Korber et al 2003).
Further, the number of CTL epitopes was predicted using MHCPred (Guan et al 2003), which predicts the 9-mer sites recognized by a MHC Class I allele, for all of the alleles supported by MHCPred in late 2003 (A*0101, A*0201, A*0202, A*0203, A*0206, A*0301, A*1101, A*6801, A*6802, and B*3501). The sites were ranked by their predicted affinity threshold (IC50), the concentration (nM) for which 50% binding is predicted. Sites were categorized as having high (IC50 ≤ 50 nM), intermediate (IC50 in range 50–500 nM) or low (IC50 in range 500–5000 nM) affinity (Sette et al 1994). Analysis was limited for practical reasons to ancestral sequences reconstructed from
N-glycosylation Prediction
The number of acceptor sites for N-linked glycosylation on each sequence was estimated using NetNGlyc (Gupta et al 2002) with the default prediction threshold.
Taxon Sampling
The full set of B-subtype sequences was subsampled by removing a proportion of sequences selected at random. The proportions of sequence removed were 10% (i.e., leaving 90% intact), 25%, 50%, 75% and 90%, with 100 replicates of each. For each replicate sample, the phylogenetic tree was estimated using NJ (
Results
Sequence variation among reconstructed ancestral sequences
Ancestral sequences reconstructed under many combinations of methods (Figure 2) were compared as amino acid sequences. On average the pairwise p-distance between ancestral sequences (mean: 0.11, range: 0.005 - 0.23) was only half that among the observed sequences (mean: 0.26, range: 0.06 - 0.39) (Figure 3). When ancestral sequences were grouped by the method of rooting the phylogenetic tree, mean p-distance was least among sequences reconstructed at the center of the tree, increased through sequences reconstructed using D-subtype sequences or consensus as the outgroup, and was greatest when the M-group consensus was used as the outgroup. Clustering the sequences by UPGMA (Swofford 2002) identified major groups based on the method used to root the phylogenetic tree (Figure 4). The first cluster (group 1) contained all of the ancestral sequences from the center of the tree and the consensus of the observed sequences. Otherwise this cluster had little structure with respect to the other reconstruction techniques used here. The next cluster (group 2) comprised ancestral sequences derived using the M-group consensus as the outgroup, within which the sequences clustered according to the method used in phylogenetic tree construction. The second half of the tree included reconstructions which arose when the D-subtype sequences or consensus were the outgroup, in three subgroups: one (group 3) contained cases where the ancestor was reconstructed as amino acid sequences, another (group 4) contained ancestral sequences reconstructed as codon or nucleotide sequences, and the third (group 5) contained a miscellany of reconstructions. Four outlier sequences, including the three sequences reconstructed using Bayesian methods, fell distant from other sequences reconstructed with the same rooting method. Similar results were obtained from multidimensional scaling, and are available online as Supplementary Material.
Genetic distances between observed and ancestral sequences. The mean (error bars indicate range) pairwise p-distance between each ancestral sequence and the observed sequences (white bars), within each group of ancestral sequences (grey bars), and among the observed sequences (hatched bar), for ancestral sequences from phylogenetic trees rooted with an outgroup: D-subtype sequences (D), D-subtype consensus (E), M-group consensus (M), or at the Center of Tree (T), and the consensus of the observed sequences (Con). UPGMA tree constructed from pairwise p-distances among ancestral sequences expressed as amino acids. Each ancestral sequence is coded by the method by which it was derived (Figure 2). Con is the consensus of the observed sequences. Vertical bars delimit groups based on method of rooting the phylogenetic tree. The sequences labeled 2′, 3′, 4′ and 5′ indicate outliers related to groups 2 to 5 respectively.

Sequence differences between reconstructed ancestral and observed sequences
While there was considerable variation in the extent to which the ancestral sequences differed from observed sequences, from being identical to being different at one-third of amino acid sites (Figure 3), the major differences were primarily among the methods of rooting the phylogenetic tree, and only secondarily between the methods of reconstructing the ancestral sequence on those trees. The distribution of mean p-distance between an ancestral sequence and the original sequences was multimodal (Figure 5). The greatest similarity with observed sequences was found for the set comprising the consensus of the observed sequences, the ancestral sequences reconstructed at the center of the tree (group 1 in Figure 4) and an ancestral sequence reconstructed using the Bayesian method (outlier 2′ in Figure 4). When the M-group consensus was used as the outgroup, the reconstructed ancestral sequences (group 2′ in Figure 4) had a higher mean and greater range in distance values. Among this group of sequences, those reconstructed by MP tended to have greater mean distances to the observed sequences than did those reconstructed using ML. The highest mean distances arose when the D-subtype sequences or consensus were used as the outgroup (groups 3, 4 and 5 in Figure 4). Among this set of ancestral sequences, those reconstructed by MP were significantly more distant from the observed sequences than were those reconstructed by ML (Mann-Whitney U test, P < 0.001).
Frequency distribution of the mean p-distance between an ancestral sequence and each of the observed sequences separated by the method of rooting the phylogenetic tree.
Effect of the outgroup on the reconstructed ancestral sequence
The outgroup has two potential effects on variation in the reconstructed ancestors, indirectly in the placement of the root node on the tree, and directly through its influence on the reconstruction of character states at interior nodes in the tree. For some experimental situations, the ancestral sequence was reconstructed a second time on a rooted tree while excluding the actual outgroup sequence from the alignment. The relative magnitudes of the outgroup effects were assessed by considering the difference between paired sequences, reconstructed with or without the outgroup sequence, or with different positions of rooting. The presence of the outgroup sequence (D-subtype consensus vs M-group consensus) was responsible for a 10% difference in the ancestral sequence while the position of the root node (as set by the outgroups or at the center of the tree) was responsible for a 14% difference. By comparison, the method of inferring the ancestral sequence on the tree (e.g. ML vs MP) generated a difference of 9%, regardless of whether the outgroup sequence was present. More details are given in the Supplementary Materials.
Variation in predicted structural properties
The distribution of model energies predicted by Modeller for the observed sequences was symmetrical, but over-dispersed, while for the ancestral sequences the distribution was bimodal (Figure 6). A lower negative value for model energy indicates a poorer fit of the predicted structure on the template structures. Lower values were observed for ancestral sequences reconstructed on trees rooted with an outgroup, among which significantly lower energies were observed when MP, rather than ML, was used to reconstruct the ancestor (Mann-Whitney U test, P < 0.001). In contrast, higher model energies were found for the consensus of the observed sequences and for ancestral sequences reconstructed on trees rooted at the center. However there was no discernible pattern of variation due to the method of reconstruction among these sequences. Similar differences between ancestral sequences reconstructed on outgroup-rooted trees and on center-rooted trees were observed when threading was used to identify the protein fold (see online Supplementary Materials).
Energy of the 3-D structure for each observed and ancestral sequence predicted using the Modeller software and 3 structural templates. The frequency distribution of model energy is plotted for the observed sequences (line), and the consensus and ancestral sequences (bars) separated by the method of rooting the phylogenetic tree. Energies with larger negative value indicate better model fit.
Variation in predicted number of epitopes
Potential Ab, HTL and CTL epitopes, identified by sequence comparison with epitope databases, were found in most of the observed and ancestral sequences (Figure 7). In general, the ancestral sequences had very similar epitope profiles, containing all but the rarest epitopes in the observed sequences. Of the epitopes with frequency > 5% in the observed sequences, the majority were common (frequency > 50%) in the ancestral sequences (7 of 8 Ab, 2 of 4 HTL, and 7 of 8 CTL epitopes). No rare epitopes (frequency < 5% in the observed sequences), nor additional epitopes, were observed in the ancestral sequences. Single uncommon HTL and CTL epitopes (frequency 5–6% among observed sequences) were reconstructed only under Bayesian analysis, in the case when the D-subtype sequences were used to constrain the ingroup, but not when either the D-subtype or M-group consensus sequences formed the outgroup. The average number of epitopes observed in each of the ancestral sequences was greater than in the observed sequences (Table 1). Ancestral sequences reconstructed on trees rooted at the center had the greatest mean and smallest range in number of epitopes, while the lowest mean and greatest range in number of epitopes were found when the M-group consensus formed the outgroup. The consensus of the observed sequences had the greatest number of each type of epitope.
Frequency distributions of the number of (A) Ab, (B) HTL and (C) CTL epitopes identified by database searching for the observed sequences (line), and the consensus and the ancestral sequences (bars) separated by the method of rooting the phylogenetic tree. Immunologic epitopes predicted for reconstructed sequences. The number of neutralizing antibody (Ab), helper T-cell (HTL) and cytotoxic T-cell (CTL) epitopes identified in each observed sequence, their consensus, and in each ancestral sequence, are reported. The first three columns are the mean number of each type of epitope per sequence, while the last two columns are the mean and standard deviation (SD) of the total number of all of these epitopes per sequence.
Variation in predicted MHC binding sites
The number and binding affinity of T-cell epitopes or recognition sites, predicted by MHCPred to occur on each sequence, did not show clear differences on the basis of reconstruction method. In general, the majority of predicted binding sites had low predicted binding affinity, while sites with medium binding affinity were intermediate in number, and high binding affinity sites least common. The ancestral and observed sequences were compared by jointly ranking them on the number of predicted binding sites. The higher ranks among all sequences were dominated by those for ancestral sequences reconstructed using likelihood, either as codons or nucleotides, or from ancestral sequences reconstructed on trees rooted using the D-subtype sequences as the outgroup, depending on the way in which the average ranks were computed. The consensus of the observed sequences had an intermediate score in all aspects of this analysis. However, the number of predicted binding sites was not supported by the number of epitopes inferred by comparing sequences against databases of known epitopes (Table 1). The results of the analysis are given in greater detail in the online Supplementary Materials.
Variation in predicted number of N-glycosylation binding sites
The number of N-glycosylation binding sites predicted by NetNGlyc to occur in each sequence varied among both observed and ancestral sequences (Figure 8). The number of binding sites in the ancestral sequences varied in a similar manner to the number in the observed sequences, and there was no pattern discernible with respect to the way in which the phylogenetic tree had been rooted (Figure 8A). However, sequences reconstructed by ML had a significantly greater number of binding sites than did sequences reconstructed by MP (Mann-Whitney U test, P < 0.001).
Frequency distribution of the number of predicted N-glycosylation sites in each observed or ancestral sequence: (A) the frequency distribution plotted for the observed sequences (line), and the consensus or the ancestral sequences (bars) separated by the method of rooting the phylogenetic tree; (B) the same ancestral sequences, excluding the consensus, separated by the method of reconstructing the ancestral sequence.
Taxon Sampling
Overall, the ancestral sequence reconstructed from a subsample became progressively more divergent from the reference sequence as the size of the subsample decreased (Figure 9). The pattern of divergence differed qualitatively according to how the tree was rooted, and quantitatively according to the method used to reconstruct the ancestral sequence.
(A) Mean and (B) variance (N = 100) of the divergence of ancestral sequences reconstructed from a subset of sequences relative to that reconstructed for the entire sample of 118 sequences. circles - rooted using outgroup (D-subtype sequences); squares - rooted at center of tree; dotted line - MP reconstruction; solid line - ML reconstruction; triangle/dashed line - consensus of subsample.
When the tree was rooted using D-subtype sequences as the outgroup, the mean divergence of ancestral sequences increased relatively linearly with increasingly smaller subsamples (Figure 9A) whereas the variance in divergence was lower when either a large or small number of sequences was used, and was greater at intermediate numbers of sequences (Figure 9B). At low percentages of sequence removal, sequences reconstructed via MP or ML were equally similar to the reference ancestral sequences, but as more sequences were removed, sequences reconstructed using MP became more divergent than sequences reconstructed using ML, and showed greater variance.
In sharp contrast to ancestral sequences reconstructed on trees rooted with an outgroup, the mean and variance in divergence of sequences reconstructed at the center of the tree remained low across most of the range of sample sizes (Figure 9A). Only when 90% of the sample was removed did the mean divergence increase. The mean divergence of sequences reconstructed by MP was slightly greater than when ML was used. The change in the divergence of the consensus of the sequences in the subsample was similar in form, but of lower magnitude, to that for sequences reconstructed at the center of the tree.
The pattern of divergence observed for ancestral sequences reconstructed on trees rooted with the outgroup might have arisen because the placement of the root changed on the phylogenetic tree having fewer leaves. To test the effect of altered rooting, the original phylogenetic tree of 118 taxa was pruned of those taxa chosen for deletion, thereby leaving the root as close to its original position as possible, and the ancestral sequence was reconstructed. In all 10 cases involving the outgroup (5 proportions deleted X 2 methods), the mean divergence of the ancestral sequences on the pruned tree was less than on the re-estimated tree (results not shown) whereas for trees rooted at the center, differences in mean divergence were observed only when small proportions of sequences were deleted. These observations suggested that the position of the root was changing significantly on outgroup-rooted trees, but not on center-rooted trees, as the number of sequences was reduced. However, by examining the individual pairs of trees (re-estimated and pruned), we found that the position of the root for outgroup-rooted trees had changed only at the higher percentages removed (10%: 0; 25%: 2; 50%: 13; 75%: 63; 90%: 82, each of 100 cases). In contrast, for center-rooted trees the position of the root had changed in every one of 500 cases. Clearly a confounding factor was involved.
Closer inspection of the whole tree rooted using an outgroup (Figure 1B) revealed four sequences on relatively long branches near the base of the ingroup. It was hypothesized that the presence of these sequences in the sample had a significant effect on the placement of the root, and consequently on the ancestral sequence. When some or all of these sequences were present, the ancestral sequence was reconstructed at nodes 1, 2 or 3, but when all were absent then node 4 became the ancestral node. The replicate subsamples were separated on the basis of the number of these basal sequences present and the mean and variance of divergence were recalculated. For sequences reconstructed on outgroup-rooted trees there was an inverse relationship between the mean divergence and the number of these basal sequences present, for both ML (Figure 10A) and MP (similar results, not shown). Furthermore, the variance in divergence for each size of subsample was positively correlated with the variance in the number of these basal sequences present in the subsample (Spearman rS = 0.9), for either method. In contrast, for sequences reconstructed at the center of the tree we found that these basal sequence had no direct effect on the reconstructed ancestral sequence (Figure 10B) for either method of reconstruction. Divergence was constant with respect to the number of the basal sequences present in the subsample, and changed only on the basis of the size of that subsample.
Mean divergence of ancestral sequences reconstructed using ML on subsets of sequences from that reconstructed for the entire sample of 118 sequences in relation to the number of basal sequences present (i.e., those joining the rooted tree at nodes 1, 2, and 3 in Figure 1) for ancestors estimated on phylogenetic trees (A) rooted using an outgroup and (B) rooted at the center. Symbols: circle - 10% removed; diamond - 25% removed; triangle - 50% removed; inverted triangle - 75% removed; square - 90% removed; filled - very small sample sizes (N ≤ 5).
Discussion
The variation in the sequence and structural characteristics of the ancestral gene sequences reconstructed by different methods was dominated by the effect of the method used to root the phylogenetic tree on which the reconstructions were based. The method used to reconstruct the ancestral sequence, or the type of the reconstructed sequence, had only secondary relevance to the variation observed. Generally the technique used to estimate the phylogenetic tree was irrelevant to the outcome. For only one of the immunological characters estimated, the number of N-glycosylation sites, were ancestral gene sequences strongly influenced by the analytic method used to reconstruct the ancestral sequence.
Previous assessments of ancestral state reconstructions have focused on the accuracy of the method, rather than the variation among methods. Although parsimony has been shown to reconstruct true ancestors with a high degree of accuracy in laboratory settings involving the serial propagation of bacteriophage (Bull et al 1993; Hillis et al 1992), likelihood methods have been considered to be more accurate (Huelsenbeck and Crandall 1997; Koshi and Goldstein 1996; Thornton 2004; Yang et al 1995; Zhang and Nei 1997) because of more realistic model specification. Empirical Bayes' methods (Huelsenbeck and Bollback 2001; Schultz et al 1996) extended these optimality methods by integrating across tree topologies to choose the most likely character state. Krishnan and colleagues (2004) have recently demonstrated however that each of these optimality methods introduces a deterministic bias in nucleotide frequencies and that the entire posterior probability distribution of ancestral states should be considered when inferring ancestral function. However, they take the phylogenetic tree as a given, including the taxa represented, the tree's topology and its root, and do not investigate the variation arising from these factors. Pagel and colleagues (2004) assessed confidence in the state reconstructed at internal nodes of the tree, including the root, by assessing variation across trees sampled by MCMC and in particular by considering the occurrence of the node. When a node exists in only some of the potential phylogenetic trees, but the reconstruction is constrained to those trees which contain that node (for example, Huelsenbeck and Bollback 2001), Pagel and colleagues showed that confidence in the reconstruction is over-estimated.
The study reported here was initiated prior to the publication of Krishnan's and Pagel's results and used methods and software tools which are commonly available to investigators. While the optimality methods used in all reconstructions may introduce bias, we may ask how important that bias is in the presence of the other factors investigated. Because the set of input sequences was constant, each reconstruction was conceptually of the same ancestor, and so we expected the variation in the ancestor to be less than that of the input sequences. However, the range of values of the structural and immunological attributes predicted computationally for the inferred ancestral sequences matched that of the input sequences (Figures 6, 7, 8). Both when the variation was due primarily to the method of rooting the tree, as for the predicted structure, or when it was due to the method of reconstruction, as for the number of predicted N-glycosylation sites, the range of values associated with each technique (eg, MP or ML) was a significant proportion of the range of values predicted for the input sequences. While optimality methods of sequence reconstruction may induce deterministic biases in inferred properties (Krishnan et al 2004), it appears that the other factors in the current study either compensate for, or overwhelm, the magnitude of these biases.
It is perhaps perplexing that the rooting method should have such significance when previous studies have revealed major differences between MP and ML. This may be due in part to the strongly radial form of the phylogenetic tree (Figure 1A). Rooting with one outgroup or another will often cause the ancestral node to be placed differently on the tree, in either case somewhere other than at the center. However the actual sequence of the outgroup also introduces variation through its influence on the reconstruction of character states at interior nodes of the tree. This outgroup effect may simply be a common source of variation which is routinely overlooked in assessments of reconstruction methods. The magnitude of this outgroup effect should relate to the divergence of the outgroup sequence from the most recent common ancestor with the ingroup. With tree topologies having a less radial topology, as when branch lengths are sampled by a Yule process, the placement of the root should be less sensitive to the choice of outgroup. However, as we show here, the outgroup affects both the placement of the root and the reconstruction of character states at internal nodes, so that an appreciable outgroup effect is still to be expected.
Whereas ML and MP are well-established techniques for ancestral reconstruction, Bayesian methods have not had wide use. Here, the empirical Bayes' method reconstructed ancestors very different from other ancestral sequences reconstructed on a rooted tree, and from one another, in three cases using different sets of sequences as the constraining outgroup (Figure 4), and in other analyses reported in the online Supplementary Material. Because the phylogenetic tree for HIV-1 has short internal branches and long terminal branches (Figure 1), our data may lack sufficient phylogenetic information for reliable determination of character states at nodes near the root. Under the empirical Bayes' method, the ancestral state for each site is the nucleotide having the greatest posterior probability given the outgroup constraints and the phylogenetic trees visited during MCMC sampling. So at highly variable sites the nucleotide selected may have only a modest posterior probability (P), approaching one-quarter in the limit. While most sites had nucleotide assignments with very high posterior probability (P ≫ 0.99), 5–9% of sites had P < 0.9 and 2–3% of sites had P < 0.7. Our results suggest that empirical Bayes' methods may in some situations, especially those with larger data sets, contribute to variation in the reconstructed ancestor. Methodological improvements may help to reduce that variation (Krishnan et al 2004).
The center-of-tree approach to mid-point rooting of the phylogenetic tree minimizes the effect of highly divergent taxa on the tree (Nickle et al 2003). Given the relatively symmetric and star-like phylogenetic tree for HIV-1 (Figure 1A), it is unsurprising that the observed sequence consensus shared many attributes with ancestral sequences estimated using center-of-tree rooting. Ancestors estimated using outgroups must be more divergent than the consensus or those obtained from a center-of-tree root because these outgroups are more distant (Korber et al 2000, their Fig. 1).
While we have studied the variation in ancestral sequences reconstructed on star-like phylogenetic trees (Figure 1), we recognize that for other tree topologies the factors associated with tree and sequence estimation may make different relative contributions to the variation in the reconstructed ancestor. Nevertheless, we have restricted ourselves to this one topology because, in our experience, highly infectious viruses such as HIV-1 and influenza exhibit such topologies and because this study was initiated in the context of identifying potential vaccine candidates. We wished to understand the variation in the ancestor which might arise from the methods used. An examination of ancestral reconstruction on other topologies, while desirable, is beyond the scope of this study.
Taxon sampling, that is whether the sample is representative of natural sequence variation, may be a source of variation in the reconstructed ancestor. Our data set is based on viral extracts from infected individuals sampled in a short time in many widespread US cities (Anderson et al 2001), and thus reduces the potential issue of having some clades over- or under-represented by recent clonal expansion within a host or host population. Here the contribution of taxon sampling to the variation in the reconstructed ancestor lay not so much in the number, as in the nature, of the input sequences, especially when the phylogenetic tree was rooted with an outgroup (Figure 9). A small number of relatively divergent sequences falling near the root played a dominant role in specifying the ancestor (Figures 1B, 10). Although sequences reconstructed on outgroup-rooted trees became progressively more different, and highly variable, at smaller taxon samples, this trend was due to the frequency of occurrence of these divergent sequences. In contrast, ancestors reconstructed on trees rooted at the center were largely unaffected by taxon sampling. These divergent sequences did not fall near the root when it was placed at the center of the tree, and consequently had less significance.
We might conjecture that the variation in ancestral sequences reconstructed on trees rooted at the center will always be lower. Because the tree center is defined as the point having the least squared distance to the leaves, the proportionate effect of a few long branches at the edge of the tree will be much less than when an outgroup is used to root the tree. Also, the standard method for midpoint rooting, using the midpoint of the longest path on the tree, should be more sensitive to the inclusion of highly-divergent taxa in the sample.
The impact of these basal taxa on outgroup-rooted trees will have significance for the sampling strategy of studies designed to characterize ancestral sequences. When speciation and extinction occur at constant rates over time, Sanderson (1996) has shown that only a relatively modest sample size is needed for one to be 95% confident that the phylogenetic tree includes the basal node for a large clade. Our results suggest that recombination or introgression may be an unexpected and greater source of variation in the reconstructed ancestor. For example, in viruses such as HIV-1 (McCutchan et al 1999) or FIV (Bachmann et al 1997; Casado et al 2001), putative recombinant sequences usually fall on a phylogenetic tree between the hypothesized parental strains or subtypes. In such cases the inclusion of recombinant sequences in the sample may have a significant impact on the ancestor reconstructed. Similarly systematic bias in the sampling of natural sequence variation may result in greater variance in the estimated basal node, and ancestral sequence.
The HIV-1 envelope glycoprotein elicits a strong humoral response from the host, and is consequently under strong selection for sequence change in order to evade epitope recognition by neutralizing antibodies, cytotoxic T lymphocytes and helper T lymphocytes (McMichael and Phillips 1997; Phillips et al 1991; Ross and Rodrigo 2002) and to acquire surface-bound carbohydrates to mask the protein from host surveillance (Reitter et al 1998; Wei et al 2003). In the context of HIV-1 vaccine design, our results suggest that the method of rooting the phylogenetic tree and the method of reconstructing the ancestor contribute to variation in predicted immunological properties, and so should be investigated when reconstructing ancestral sequences in order to obtain promising candidate vaccine sequences. On the other hand, the methods of tree construction, and the format in which the sequence was reconstructed, made only slight contributions to variation in the ancestral sequence. So, computationally efficient methods, such as using neighbor-joining to estimate the phylogenetic tree and using nucleotides as the information level, may be adequate to sample suitable candidate vaccines.
Studies of the biochemical evolution of ancient biomolecules or their application to biotechnology must cope with uncertainty in the estimation of the ancestral sequence. One approach to assessing this uncertainty is to use Bayesian methods to estimate the posterior probability distribution of the ancestor. Another approach, used here, is to use different methods to obtain a sample of potential ancestors. Our results clearly show that, in the case of a radial phylogeny, variation in the sequence and structural properties of the ancestors arose from variation in the method of rooting the phylogenetic tree. When pragmatic decisions must be made regarding resource allocation in the design of an experiment to investigate possible ancestors more computationally intensive methods of tree or sequence reconstruction may not be necessary when simpler methods will give an adequate representation of the ancestor-space.
Footnotes
Aknowledgements
Shaun Lott and Vic Arcus gave important advice on protein structure modeling. Alan Russell assisted with the computation. Greg Ewing assisted with the Bayesian analysis. Comments by David Posada and Eddie Holmes significantly improved this paper. This work was supported by a grant to J. I. M. from the Boeing Corporation. Partial financial support for this study came from the Allan Wilson Center for Molecular Ecology and Evolution and from the Bioinformatics Institute.
