Sage Journals: Discover world-class research

Abstract

We propose a novel method for the task of protein subfamily identification; that is, finding subgroups of functionally closely related sequences within a protein family. In line with phylogenomic analysis, the method first builds a hierarchical tree using as input a multiple alignment of the protein sequences, then uses a post-pruning procedure to extract clusters from the tree. Differently from existing methods, it constructs the hierarchical tree top-down, rather than bottom-up and associates particular mutations with each division into subclusters. The motivating hypothesis for this method is that it may yield a better tree topology with more accurate subfamily identification as a result and additionally indicates functionally important sites and allows for easy classification of new proteins. A thorough experimental evaluation confirms the hypothesis. The novel method yields more accurate clusters and a better tree topology than the state-of-the-art method SCI-PHY, identifies known functional sites, and identifies mutations that alone allow for classifying new sequences with an accuracy approaching that of hidden Markov models.

Keywords

clustering trees top-down clustering decision trees protein subfamily identification phylogenomics

Introduction

We consider the task of protein subfamily identification: given a set of sequences that belong to one protein family, the goal is to identify subsets of functionally closely related sequences (called subfamilies). This is in essence a clustering task. Most current methods for subfamily identification use a bottom-up clustering method to construct a cluster hierarchy, then cut the hierarchy at the most appropriate locations to obtain a single partitioning. Such approaches rely on the assumption that functionally similar proteins have sequences with a high overall similarity but do not exploit the fact that these sequences are likely to be highly conserved at particular positions. This raises the question to what extent clustering procedures can be improved by making them exploit this property.

In this article, we propose and evaluate an alternative clustering procedure that does exactly this. The procedure uses the “top-down induction of clustering trees” approach proposed by Blockeel et al.¹ This approach differs from bottom-up clustering methods in that it forms clusters whose elements do not only have high overall similarity but also have particular properties in common. In the case of subfamily identification, these properties can be the amino acids found at particular positions.

Apart from possibly yielding higher quality clusterings, this approach has the advantage that it automatically identifies functionally important positions and that new sequences can be classified into subfamilies by just checking those positions.

We evaluate the proposed approach on eleven publicly available datasets using a wide range of evaluation measures. We evaluate the predicted clustering as well as the underlying tree topology for which we propose two new measures. Our results show that splits based on polymorphic positions (ie, positions that have more than one amino acid residue) are highly discriminative between protein subfamilies, that using such splits to guide a clustering procedure improves protein subfamily identification, that the identified positions yield accurate classification of new sequences, and that the resulting clustering tree identifies functionally important sites.

Methods

We first describe our novel method for protein subfamily identification. Next, we briefly describe SCI-PHY, the state-of-the-art approach that we use as a reference point. Finally, we review the evaluation measures used in this paper.

Proposed method

Sequences within a protein subfamily are not only similar to each other, they are also characterized by a small set of conserved amino acids at particular locations, which distinguish them from sequences in other subfamilies. The method we propose exploits this property. It creates clusters in which sequences are not only globally similar, but additionally, identical in particular locations. These locations are discovered by the clustering process as it goes.

The method works top-down. It starts with a set of sequences, which is given as a multiple sequence alignment, and tries to split it into subsets such that (1) sequences within a subset are similar and (2) the split is defined by a test of the form p = a, or more generally p ∈ S, with p a location, a an amino acid, and S a set of amino acids.

Figure 1 illustrates the effect of only allowing splits that can be defined by tests based on polymorphic positions. It shows how a set of sequences (S1 to S5) is split into two clusters, (S1, S3, S5) and (S2, S4), based on the test p6 = R that returns true for one cluster and false for the other. Looking only at overall sequence similarity, the clusters (S1, S3, S4, S5) and (S2) would be equally good, but from the biological point of view, the clustering with preserved amino acids within subclusters is preferred.

Figure 1.

Illustration of a split based on a polymorphic position.

After dividing a set into two subsets, the same principle can be used to further subdivide the subsets up to the level of singletons (subsets with only one sequence). This yields a hierarchical tree. For the purpose of subfamily identification, the tree is cut at particular locations, and the resulting clusters are assumed to form subfamilies.

Our method is implemented on top of Clus (http://dtai.cs.kuleuvenbe/clus/), a general-purpose system for top-down clustering¹ that follows exactly this procedure. Pseudocode is shown in Figure 2. In a first phase (GrowTree), the method starts with splitting the whole dataset then recursively splits subsets up to the level of single sequences. To split a set, the algorithm tries all possible tests of the form p ∈ S with p, a location, and S, a set of amino acids^a. It tentatively splits the set according to each test, evaluates this split (according to a certain heuristic), and remembers the best one. It finally splits the set according to the best test encountered. In a second phase (PruneTree), the tree is pruned. In a single pruning step, a pair of sibling leaves is pruned, turning their parent into a leaf. Which pair is pruned is determined by a pruning heuristic. This step is continued until the whole tree is reduced to a single node. Each tree encountered in the process defines a clustering, the leaves of the tree being the clusters. Among all the trees thus found, the one with the highest-quality clustering is returned as the final result.

Figure 2.

Pseudocode for the Clus-based approach.

The resulting clusters, which correspond to the predicted protein subfamilies, are then output along with the underlying tree, which explicates how the clusters were split and which tests were used. We show an example of such a tree in Figure 3. Note that the internal nodes typically contain multiple tests. This indicates that there are equivalent tests for that stage of the clustering process; tests are equivalent when they yield the same outcome for all the sequences.

Figure 3.

Example of a tree output by our method.

Apart from identifying subfamilies, the tree has two additional advantages. First, it allows for easy classification of new sequences into a subfamily. Starting at the root node, a new sequence is moved down the tree according to the outcome of the tests until it is classified into one of the predicted subfamilies. When not all tests in a node agree (which is impossible for the sequences used to build the tree, but may happen for other sequences), the majority decides. Second, the identified tests result in a candidate list of functionally important sites, that is, positions that are likely to play a role in the subfamily-specific functions. Protein functional site prediction is an important step in the functional analysis of new proteins (eg, Bickel et al,² Cheng et al,³ and Bray et al⁴). As biological validation is costly, providing a first selection of potential sites is an important advantage of our method.

Important parameters of the method are the heuristics used during tree growing (to select the best test to split the data at each step) and pruning (to evaluate the quality of a tree). We now discuss these in detail. We will end this section with a note on the computational complexity of the method.

Test selection heuristics

In the experimental section, we explore three test selection heuristics. Two of them are standard for hierarchical tree learners: maximization of the average intercluster distance and maximization of the minimum intercluster distance. We call the versions of Clus that use these heuristics Clus-MaxAvgDist and Clus-MaxMinDist, respectively.

These heuristics do not take into account the particular requirements of the phylogenetic context. Using the average distance heuristic, for example, one essentially gets the top-down counterpart of the UPGMA algorithm,⁵ which is known to have some undesirable behavior.⁶ Therefore, we include a third heuristic, which was designed specifically for the phylogenetic context.⁷ This heuristic is based on the principle of minimum evolution and it can be seen as the top-down counterpart of the criterion used by the well-know phylogenetic method Neighbor Joining.⁸ More specifically, it estimates the total branch length of the final tree that will be obtained if a particular test is chosen at this point and prefers the test that minimizes this estimate. We call the version of Clus with this heuristic Clus-MinLength. For the exact formula and more details about this heuristic, see Vens et al.⁷

The proposed selection heuristics make use of distances between pairs of amino acid sequences. Our method computes distances based on the Jones-Taylor-Thornton matrix,⁹ which is a model of amino acid substitution widely used for phylogenetic inference. Alternatively, we allow the user to give a pairwise distance matrix as input.

Extracting clusters from the hierarchical tree

The quality measure used during pruning is encoding cost,^10,11 which can be interpreted as the cost to encode a clustering given the homogeneity of the clusters and the number of clusters. Ideally, one wants to achieve two goals: to have as few clusters as possible and to have maximally homogeneous clusters. There is a trade-off between these two goals, as having fewer clusters implies larger clusters, which are less likely to be homogeneous. The encoding cost (Equation 1) combines these two goals.

E n c o d i n g c o s t = N l o g k - \sum_{l = 1}^{k} \sum_{i} \log \Pr (n_{li} | α)

(1)

The first component of the equation is the cost associated to the number of subfamilies, the second component is the cost to encode each subtree alignment for a certain clustering. More specifically, N is the number of sequences, k is the number of clusters, and Pr(n_li | α) is the probability of n_li, which is the count vector of observed amino acids for subfamily l at column i, under a Dirichlet mixture density α. Dirichlet mixture densities¹² contain prior information about amino acids and, when combined with observed amino acid frequencies, provide estimates of expected amino acid probabilities.¹⁰

Computational complexity

The computational complexity of the proposed method is O(aN²logN), with a the alignment length and N the number of sequences. We obtain this complexity by adding the complexity of the tree building and postpruning procedures, as described next.

The complexity to construct the tree is O(aN²logN) under the assumption that a reasonably symmetric tree is built (the depth of which is logarithmic in the number of leaves).⁷ In order to extract subfamilies from the tree, every pruning step and every merging candidate requires calculating Equation 1. In two subsequent calculations, most of the clusters remain the same, and therefore most of the Σ_i log Pr(n_li | a) terms do not change. We can avoid recomputing these values by calculating them only once for every node (cluster) and storing them. As a (complete) tree with N sequences contains 2N − 1 nodes, and the computation of Equation 1 has a complexity O(aN), the resulting complexity of the cluster extraction is O(aN²), leading to an overall complexity of O(aN²logN) for the complete method.

SCI-PHY

To identify protein subfamilies, SCI-PHY¹¹ first builds a hierarchical tree bottom-up. It then extracts clusters from the tree, which are output as the predicted subfamilies.

The tree construction process starts with each sequence being a separate cluster. Then, for each cluster, a profile is defined, which gives the expected amino acid probabilities for each position based on the observed amino acid distribution and a Dirichlet mixture density.¹² Next, using relative entropy¹³ to estimate the distance between the profiles, the two closest profiles are merged, and a profile for the new cluster is created. This merging procedure is repeated until all sequences are part of the same cluster. Finally, the resulting tree topology is given as input to a postpruning procedure. This pruning procedure returns the stage in the clustering procedure with minimal encoding cost.

Once the protein subfamilies have been predicted, SCI-PHY can classify new protein sequences into one of these subfamilies by subfamily hidden Markov model (SHMM) construction.¹⁴ A SHMM is built for each subfamily, and the best match with the query sequence is predicted.

We use SCI-PHY in our experimental comparison (Results and Discussion) because it has been extensively evaluated: it was compared to several methods, and was shown to give comparable or superior results.^11,15 To our knowledge, no other method has been shown to give better results.

Evaluation measure

In the Results and Discussion section, we evaluate the subfamilies output by Clus and SCI-PHY on a number of datasets for which the true subfamilies (reference clustering) are known. We evaluate both the tree topology from which clusters are extracted and the clusters themselves. The reason for evaluating also the tree topology is three-fold. First, as the authors of SCI-PHY also point out, the definition of the “right” clusters is somewhat arbitrary, since subfamilies can be defined on several levels of granularity. By evaluating the tree topology, which defines clusterings on multiple levels, we can analyze how the reference clustering is represented in the tree. Second, we can evaluate the results regardless of the quality of the pruning procedure. Third, the tree is often interesting in itself, as it can help biologists to interpret the predicted clustering and obtain insights in how the clusters are related.

Tree topology evaluation

We evaluate tree topologies using three measures: tree-based classification error,¹⁶ edited tree size, and number of subfamily changes.

Edited tree size

The edited tree size indicates how compact the smallest possible pure clustering derived from the tree is. It is calculated by repeatedly merging sibling leaves that belong to the same subfamily until such merging is no longer possible.

Consider the two trees shown in Figure 4, for example. They have five sequences, three of which belong to subfamily 1 (S1), and two of which belong to subfamily 2 (S2). The edited tree for tree a would merge the S2 sequences, resulting in an edited tree size of 4. The edited tree for tree b would merge the two S1 sequences connected by branches r₇ and r₈, also resulting in an edited tree size of 4.

Figure 4.

Two trees with the same edited tree size, a smaller TBC error for tree a, and a smaller number of subfamily changes for tree b.

Tree-based classification error

Similarly to the edited tree size, the tree-based classification (TBC) error¹⁶ evaluates to which extent the tree places sequences from the same subfamily in the same subtree. While the former considers clusters that are pure and as large as possible, the TBC error considers clusters that minimize the number of “classification errors” in the derived clustering, as follows.

A subtree is said to be “good” for a subfamily F if more than half of its sequences belong to F and more than half of F's sequences belong to the subtree. Given a set of disjoint good subtrees, a sequence is considered correctly classified if it occurs in a good subtree for its subfamily, and incorrectly classified otherwise. The TBC error is defined as the smallest number of incorrectly classified sequences in any set of disjoint good subtrees.

Tree a in Figure 4 defines two good subtrees: a cut in branch l₅ yields a good subtree for S2 (with three classification errors), and the complete tree is a good subtree for S1 (with two classification errors). Hence, the TBC error for this tree is two. Tree b defines 3 good subtrees. A cut in r₁ (or r₆) yields two disjoint good subtrees (resulting in 1 classification error): a good subtree for S2 at the left and a good subtree for S1 at the right. The complete tree is again a good subtree for S1, with two classification errors. Hence, the TBC error for this tree is 1. Lazareva-Ulitsky et al¹⁶ provide a algorithm to compute the TBC error.

The subtrees defined by the TBC error are more permissive than the ones defined by the edited tree size in the sense that clusters are not required to be pure; on the other hand, TBC error is stricter in the case where sequences from the same subfamily are spread over two or more subtrees.

Both measures are dependent on the place of the tree root. If the root of tree A would be in branch l₅, for example, the edited tree size would be 2 instead of 4 and the TBC error would be 0 instead of 2. Although evaluating the rooted tree is important, since the root influences the possible ways in which the tree can be cut, we also propose a measure that is independent of the place of the root.

Number of subfamily changes

If we associate a subfamily (or alternatively, a molecular function) to each internal node of the tree, then we say that a subfamily change occurs for each branch connecting two nodes with different associated subfamilies. For instance, if we associate subfamily 1 to the root of tree a in Figure 4, there is one change to subfamily 2 in branch l₅. The right tree, however, requires 2 subfamily changes (branches r₂ and r₅).

Counting the minimal number of subfamily changes corresponds to counting mutations in parsimony analysis,¹⁷ where one prefers the phylogenetic tree that requires the least evolutionary change to explain some observed data. Although we consider clustering trees rather than phylogenetic trees, we can directly apply the Fitch parsimony algorithm¹⁷ to count the number of subfamily changes.

Note that in contrast to the previous two measures, this measure does not penalize a tree for having a ladder-like shape. That is why tree a has a smaller number of subfamily changes, while having the same edited tree size and a higher TBC error as tree b (For an example with larger trees, consider Figures S1 and S2 in the supplemental material. The tree in Figure S1 has an edited tree size of 12, a TBC error of 11, and nine subfamily changes, while the tree in Figure S2 has an edited tree size of 28, a TBC error of 64, and 11 subfamily changes. The larger difference between the trees in their edited tree size and TBC error is due to the ladder-like shape of the tree in Figure S2.). However, it is important to note that the shape of the tree does influence the possible ways to cut it. Therefore, we use the three measures, as they provide complementary information to one another.

Clustering evaluation

We evaluate clusters using three measures earlier used for SCI-PHY¹¹ (purity, VI distance, and edit distance) and two additional ones: the percentage of sequences in pure clusters and category utility. The first four measures compare the predicted clustering to the reference clustering, while the latter evaluates the quality of the predicted clustering itself, regardless of a given reference clustering.

Purity

Purity is defined as the fraction of clusters in a given clustering that contain instances of only one reference cluster. It assesses the ability of the method to cluster instances of different kinds in different clusters. However, as it does not penalize if instances of one kind are spread over many pure clusters, perfect purity can be achieved when every instance corresponds to a single cluster. Therefore, singletons are not included in the calculation.

Percentage of examples in pure clusters

To complement the information given by purity, we also report the percentage of examples in pure clusters (denoted further as PctExPureC). Again, singleton clusters are discarded.

Edit distance

The edit distance between two clusterings is the number of merge and/or split operations required to transform one clustering into the other one. For example, if instances of three kinds–-A, B, and C–-are clustered in only one cluster, we need two-split operations to separate the three groups of instances. The higher the edit distance is, the more different the clusterings are.

The formal definition of edit distance is given by Equation 2,¹¹ where Edit (C¹,C²) is the edit distance to transform clustering C¹ into clustering C² (or the other way around), k' is the number of clusters in C¹, k” is the number of clusters in C² and $r (C_{m}^{1}, C_{n}^{2})$ is equal to 1 if clusters $C_{m}^{1}$ and $C_{n}^{2}$ have items in common and equal to 0 otherwise.

E d i t (C^{1}, C^{2}) = 2 * (\sum_{m = 1}^{k^{'}} \sum_{n = 1}^{k^{″}} (C_{m}^{1}, C_{n}^{2})) - k^{'} - k^{″}

(2)

Edit distance penalizes more strongly clusterings for which clusters are too small. For this reason, this measure can be used to counter-balance purity.

VI distance

The VI distance (variation of information distance) measures the amount of information that is not shared between two clusterings. The formula to calculate the VI distance is given by Equation 3,¹¹ where H(C¹) (Equation 4) is the entropy of clustering C¹, and I(C¹, C²) (Equation 5) is the mutual information between clusterings C¹ and C². In Equation 4, |C_l| is the number of instances in cluster C_l, |C| is the total number of instances in the clustering, and k is the number of clusters in C.

V I (C^{1}, C^{2}) = H (C^{1}) + H (C^{2}) - 2 * I (C^{1}, C^{2})

(3)

H (C) = \sum_{l = 1}^{k} \frac{| C_{l} |}{C} \log \frac{| C_{l} |}{C}

(4)

I (C^{1}, C^{2}) = \sum_{m = 1}^{k^{'}} \sum_{n = 1}^{k^{″}} \frac{| C_{m}^{1} \cap C_{n}^{2} |}{| C |} l o g \frac{| C_{m}^{1} \cap C_{n}^{2} |}{| C |}

(5)

In Equation 5, $| C_{m}^{1} \cap C_{n}^{2} |$ is the number of overlapping instances between clusters $C_{m}^{1}$ and $C_{n}^{2}$ , k′ and k″ are the number of clusters in C¹ and C², respectively.

Category utility

Category utility¹⁸ computes the improvement of the predictability of attributes given the clustering in comparison with the situation in which no clustering is defined; in the context of protein subfamily identification, the attributes are the positions in the sequence alignment. The definition of category utility is given by Equation 6, where Pred(A|C) (Equation 7) measures the predictability of the descriptive attributes A giving the clustering C, Pred(A) (Equation 8) measures the predictability of A when no clustering is defined and k is the number of clusters. Note that the division by the number of clusters is important to have a trade-off between improvement of the predictability of attributes and the number of clusters.

C U (C) = \frac{p r e d (A | C) - p r e d (A)}{k}

(6)

P r e d (A | C) = \sum_{l = 1}^{k} \Pr (C_{l}) \sum_{i} \sum_{j} \Pr {(A_{i} = a_{i j} | C_{l})}^{2}

(7)

P r e d (A) = \sum_{i} \sum_{j} {(A_{i} = A_{i j})}^{2}

(8)

In Equation 7, Pr(C_l) is the probability of an arbitrary instance to belong to cluster C_l, i ranges over the instance attributes, j ranges over the possible values for each attribute A, Pr(A_i = a_ij|C₁) is the probability that attribute A_i has value a_ij, given that the instance belongs to cluster C₁. In Equation 8, Pr(A_i = a_ij) is the probability that A_i has value a_ij when no clustering is defined.

Results and Discussion

We empirically evaluate, first, the soundness of the assumptions underlying our method and second, the method's capacity to respectively propose a meaningful tree topology, identify subfamilies, classify new sequences, and identify functional regions. Finally, we discuss related work.

Datasets

We use two groups of datasets. The first group consists of the five EXPERT datasets used by Brown et al¹¹ to evaluate SCI-PHY. The second group consists of six datasets extracted from NucleaRDB,¹⁹ which contains protein data for nuclear hormone receptor (NHR) families. These eleven datasets were chosen because for each of them a reliable subfamily identification is provided for every sequence, which gives us a gold standard to evaluate the results.

Each dataset consists of the multiple sequence alignment (MSA) for one protein family. The EXPERT datasets contain sequences from the families Enolase, Crotonase, Secretin, Aminergic (Amine), and NHR. The NucleaRDB datasets contain sequences from the families thyroid hormone like (Thyroid), estrogen like (Estrogen), nerve growth factor IB-like (Nerve), HNF4-like (HNF4), fushi tarazu-F1 like (Fushi), and DAX like (DAX). To construct the NucleaRDB datasets, we used MSAs for each family as provided by NucleaRDB, with replicate sequences removed.

For Amine, NHR, Thyroid, Estrogen, and HNF4, subfamilies are provided at more than one level of granularity. Thus, two sequences can be associated to the same subfamily x in one dataset but to different subfamilies x.1 and x.2 in the other dataset.

Some of the datasets are very unbalanced, complicating the subfamily identification task. For instance, Enolase contains a subfamily consisting of 60% of the sequences. The number of sequences in the subfamilies Crotonase and NHR1 ranges from 1 to 212 (58% of the total 365 sequences) and 139 (34% of the total 412 sequences), respectively.

Table 1 shows statistics for the EXPERT and NucleaRDB datasets.

Table 1.

Statistics for the datasets.

Datasets	Nb subfam	Nb seq	Align length	Avg dist (family)	Avg dist (subfam)
Enolase	8	472	431	2.229	1.041
Crotonase	10	365	264	1.842	0.728
Secretin	15	153	263	1.885	0.485
Amine 1	7	358	344	1.467	1.075
Amine 2	31	358	344	1.467	0.442
NHR 1	8	412	183	2.124	0.945
NHR 2	27	412	183	2.124	0.547
NHR 3	77	409	183	2.116	0.263
Thyroid 1	8	799	239	1.771	0.708
Thyroid 2	24	799	239	1.771	0.375
Estrogen 1	3	482	226	1.041	0.498
Estrogen 2	10	482	226	1.041	0.301
HNF4 1	5	448	229	1.276	0.619
HNF4 2	22	448	229	1.276	0.404
Nerve	5	76	219	0.429	0.26
Fushi	4	117	227	0.756	0.369
DAX	2	40	133	0.867	0.397

Note: For each dataset we report the number of subfamilies, the number of sequences, the MSA length, the average pairwise distance between all sequences within the family, and the overall average distance within the subfamilies (we first calculate the average pairwise distance for each subfamily, and then we report the average value over all subfamilies). The sequence distances were calculated based on the Jones-Taylor-Thornton model.

Testing the usability of polymorphic positions for clustering protein subfamilies

In this experiment, we verify our assumption that splits based on polymorphic positions can indeed discriminate protein subfamilies. To that aim, we add the subfamily information to the data and build a classification tree using Clus (ie, we performed supervised learning), without pruning, that is, up to the point where all leaves are pure. Table 2 shows the number of leaves in the resulting tree for each dataset.

Table 2.

Number of leaves in the classification trees (CTs).

Datasets	Nb leaves	Datasets	Nb leaves
Enolase	8	Thyroid 1	13
Crotonase	11	Thyroid 2	38
Secretin	15	Estrogen 1	4
Amine 1	14	Estrogen 2	15
Amine 2	34	HNF4 1	7
NHR 1	11	HNF4 2	36
NHR 2	30	Nerve	5
NHR 3	79	Fushi	4
		DAX	2

Note: The CTs were built using supervised learning. All the leaf nodes in the CTs are pure.

The results show that subfamilies can be perfectly separated from one another using compact trees containing slightly more leaves than the number of subfamilies in the datasets. For five of the datasets, Enolase, Secretin, Nerve, Fushi, and DAX, the classification tree has the same number of leaves as the number of subfamilies. From this we conclude that polymorphic positions are indeed highly discriminant for protein subfamily identification.

The fact that a good clustering tree exists does not imply it will be found by our learner. The above trees are built with the subfamily information, but in a real situation, this will not be the case. In the next sections we evaluate our unsupervised learning method.

Evaluating the tree topology

A first experimental comparison between the three variants of our method (Clus-MinLength, Clus-MaxAvgDist, and Clus-MaxMinDist) on the EXPERT datasets showed better performance for Clus-MinLength, the version adapted to phylogenetic data, than for the other versions, and this for all criteria (see Tables 3, 4, and 5). For this reason, we focus on Clus-MinLength for the remainder of the paper.

Table 3.

Edited tree size: choosing the test selection criterion.