Sage Journals: Discover world-class research

Abstract

Prediction of key features of protein structures, such as secondary structure, solvent accessibility and number of contacts between residues, provides useful structural constraints for comparative modeling, fold recognition, ab-initio fold prediction and detection of remote relationships. In this study, we aim at characterizing the number of non-trivial close neighbors, or long-range contacts of a residue, as a function of its “topohydrophobic” index deduced from multiple sequence alignments and of the secondary structure in which it is embedded. The “topohydrophobic” index is calculated using a two-class distribution of amino acids, based on their mean atom depths. From a large set of structural alignments processed from the FSSP database, we selected 1485 structural sub-families including at least 8 members, with accurate alignments and limited redundancy. We show that residues within helices, even when deeply buried, have few non-trivial neighbors (0–2), whereas β-strand residues clearly exhibit a multimodal behavior, dominated by the local geometry of the tetrahedron (3 non-trivial close neighbors associated with one tetrahedron; 6 with two tetrahedra). This observed behavior allows the distinction, from sequence profiles, between edge and central β-strands within β-sheets. Useful topological constraints on the immediate neighborhood of an amino acid, but also on its correlated solvent accessibility, can thus be derived using this approach, from the simple knowledge of multiple sequence alignments.

Keywords

long-range contact solvent accessibility multiple alignment sequence profile hydrophobicity regular secondary structures

Introduction

Among the set of relatively simple principles that governs the three-dimensional structures of globular protein domains (Chothia, 1984), two are of obvious importance: i) the masking of a large part of the main chain polarity through the establishment of hydrogen bonds between the amide protons and carbonyl oxygens (mainly within α-helices and β-sheets) and, ii) the hydrophobic effect, underlying the formation of hydrophobic cores of globular domains. In this context, we have highlighted several years ago that strong hydrophobicity has to be conserved in some key positions of a given fold, which were called “topohydrophobic” positions (Poupon and Mornon, 1998; Poupon and Mornon, 1999; Poupon and Mornon, 1999; Poupon and Mornon, 2001). Within a typical globular domain, a third of amino acids belongs to a clear hydrophobic group (VILFMYW), but only a half of these strong hydrophobic amino acids occupies “topohydrophobic” positions (Poupon and Mornon, 1998; Poupon and Mornon, 1999; Poupon and Mornon, 1999; Poupon and Mornon, 2001), which are mainly located within α- and β- regular secondary structures.

“Topohydrophobic” positions have noticeable features, as observed from a comprehensive analysis of structural alignments and their associated three-dimensional structures: i) the amino acids in these positions are much more buried than those occupying “non-topohydrophobic” positions (Poupon and Mornon, 1998); ii) the side chains of these amino acids are markedly less dispersed from one domain to another (though belonging to the same fold), than those located at “non-topohydrophobic” positions (Poupon and Mornon, 1998; Poupon and Mornon, 1999); iii) they constitute a continuous network of positions in close contact, matching well the inner part of the hydrophobic core (Poupon and Mornon, 1998; Poupon and Mornon, 1999); iv) they are mainly occupied by amino acids constituting the folding nuclei (Poupon and Mornon, 1999).

Identification of these “topohydrophobic” positions from the knowledge of sequence data only is possible in practice if an accurate alignment of a small number (e.g. 5 to 8) of sufficiently divergent sequences sharing the same fold (e.g. in the 15–25% sequence identity range) can be performed. From sequence data only, amino acids of crucial importance for the considered fold can be thus highlighted, thereby providing topological constraints at long distance along the sequences, which can be useful in a general way to understand topological features of the protein universe (Lindorff-Larsen et al. 2005).

In the present study, we refine and extend the concept of “topohydrophobic” positions, by introducing a generalized topohydrophobic index, which evaluates at each position of a given sequence alignment the fraction of amino acids belonging to the hydrophobic group. We then wish to characterize the number of non-trivial close neighbors of each position of a multiple alignment, depending on this generalized topohydrophobic index deduced from current evolutionary profiles and on the associated predicted secondary structure state. The non-trivial close neighborhood of a residue, which can also be defined as non-local or long range contacts, is the set of amino acids sufficiently distant in the 1D sequence but close in the tertiary structure of the considered protein domain. Residues known to be in local proximity (e.g. covalence and α or β local chain neighbors) are excluded from this set.

In order to define the foundations for predictive studies, we first perform a comprehensive analysis on the basis of accurate reference alignments, selected from structural databases. Hence, we consider a large set of structural alignments allowing good statistics and only focusing on regular secondary structures that are at the building blocks of protein globular domains. Thus, the core blocks defined in this way only include regions aligned with maximal reliability. The topohydrophobic index is based on the natural partition of amino acids in two groups, considering the mean atom depth associated with each kind of amino acid (Pintar et al. 2003; Pintar et al. 2003). This value is indeed closely related to the mean hydrophobicity, and provides a clear separation between hydrophobic residues and the other ones.

The present analysis significantly differs from previous estimations of absolute contact numbers of residues from amino acid sequence data (Fariselli and Casadio, 2000; Ishida et al. 2006; Kinjo et al. 2005; Pollastri et al. 2001; Pollastri et al. 2002; Yuan, 2005). Indeed, these studies generally consider all contacts in a large sphere (typical distance cut-off of 12 Å between Cβ atoms), whereas we focus here on the mean local non-trivial neighborhood of a position within both kinds of regular secondary structures (α-helices and β-strands) using multiple alignments and a short distance cutoff of 7 Å between Cα atoms. Consequently, the number of predicted neighbors is considerably smaller, in the range of 0 to 6, instead of typically 0–50, as described in previous works. Our study also differ from those devoted to the prediction of long range contact maps (e.g. Punta and Rost, 2005), as these do not generally focus on the quantification of these contacts with respect to the secondary structure and to the evolutionary hydrophobicity profile of the considered residue.

We show here that an informative neighborhood of residues can be highlighted from sequence data, which differs between helices (often 0 to 2 such neighbors) and strands (mainly 3 to 6 neighbors). Moreover, a clear multimodal behavior of strands can be observed, with a first main state around three neighbors (tetrahedral arrangement), and the other one around six neighbors (two tetrahedra sharing a vertex). This multimodal behavior allows the distinction between central and edge β-strands. Given the high accuracy reached by secondary structure predictors using multiple alignments (e.g. Frishman and Argos, 1997; Jones, 1999; Pollastri and McLysaght, 2005; Rost and Sander, 1995; Thompson and Goldstein, 1997), the present study offers the possibility of acquiring a good quality information to predict tertiary structures from sequence data only, using a minimal number of parameters.

Methods

Datasets and reduction of redundancy

The structural alignments used in this study provide enough data to obtain accurate results, while still supporting a structural relevance. Structural alignments performed and/or extensively corrected by human expertise, as those used for the previous description of “topohydrophobic” positions (Poupon and Mornon, 1998), furnish particular good data; however, due to the considerable increase of structural data, such an expert-based procedure is now unconceivable for analysis on a large scale.

Among the main available databases of structural alignments (e.g. BaliBASE (Thompson et al. 1999; Thompson et al. 1999), HOMSTRAD (Mizuguchi et al. 1998), PALI (Balaji et al. 2001), FSSP (Holm and Sander, 1994)), only FSSP (after Families of Structurally Similar Proteins) offers a large number of families, which include at least 8 members and display enough sequence divergence to be informative. For example, PALI, using the SCOP classification (Murzin et al. 1995), only includes, at the time of this study, 171 families with 8 members or more. Moreover, this number dramatically decreases when adding a sequence divergence criterion (Sequence Identity (SI) between two members belonging to a same family shall be less than 50%). FSSP is based on an automatic processing of structural alignments, using a score of structural similarity (Z-score) (Holm and Sander, 1993). The FSSP release we considered contains 2859 sub-families, 2520 being composed of at least 8 members and thus satisfying the selection criteria on work positions, as defined below (Fig. 1). The amount of data is important, as these 2520 alignments include 403 500 sequences, built from 26 577 different amino acid chains. Many chains are therefore present in several sub-families, particularly owing to the presence of the same globular folds within multi-domain proteins. This redundancy has to be reduced before any analysis. To that aim, we use two criteria: the level of sequence identity (SI) and the structural alignment quality (Z). One expects, as a main feature, that the structural quality is on average markedly better within regular secondary structures (α-helices and β-strands) than within coil regions. Hence, we do not consider loops and linker regions, in which alignments are known to be often of bad quality or even senseless.

Figure 1

General features of the FSSP sub-families. A. Mean pairwise sequence identity calculated on final files within each sub-family. B. Distribution of mean structural Z-scores within each FSSP sub-family. C. Size distribution of FSSP aligned families. The peak between 30 and 40 members per family corresponds to the existence of fold superfamilies (e.g. the terpenoid synthase superfamily (1di1A)).

Sequence Identity (SI). Among families, a pairwise sequence identity (SI) cut-off of 90% dramatically reduces the considered amino acid chain numbers from 26 577 to 5055. A more stringent SI threshold (50%) led to yet conserve 3519 different sequences. We consider this value as a good compromise between the amount of informative data and an acceptable level of redundancy. Meanwhile, the number of families with at least 8 members only slightly decreases (2520 for the initial dataset, 2431 for SI = 90% and 2406 for SI = 50%). Figure 1A shows that the mean pairwise identity on work positions within each sub-family is indeed low (8.3%), giving evidence for a low redundancy, while keeping good structural superimposition (Fig. 1B).

ii)

Structural alignment quality (Z) (Holm and Sander 1994). In the same order of idea, a compromise has to be searched between the amount of data and their structural relevance. Among several thresholds, we choose a low value of Z = 4 for the multiple alignment quality (this value is calculated regarding the leader sequence of the family). Indeed, higher values such as Z ≥ 10 reduce the number of sub-families with at least 8 members to 549, while Z ≥ 4 leads to consider 1721 sub-families. Figure 1B illustrates the actual distribution of Z values (the mean is 7.3), which are in the range of Z-scores between pairs of native-state structural homologues (typically >5 (Dietmann et al. 2002)).

Combining both thresholds (SI = 50% and Z ≥ 4), we obtain a database of 1721 sub-families of at least 8 members, including a total of 98 436 sequences, 2876 sequences being distinct from each other. Figure 2A summarizes this process (steps 1 to 3). Step 4 considers a composition identity (CI) threshold between families (0.5, 0.5) (see below and Fig. 2B).

Figure 2

Redundancy elimination. A. Evolution of the protein chain numbers: number of different chains (solid line), total number of chains (dotted lines). Step 1; 90 % sequence identity threshold. Step 2; 50 % sequence identity threshold. Step 3; Structural Z-score threshold ≥ 4. Step 4; Composition identity between families ≤ (0.5, 0.5). B. The three-steps CI redundancy elimination (see text), number of different chains (solid line), total number of chains (dotted lines).

iii)

Composition identity between families. On average, each amino acid chain appears in 35 sub-families. Two sub-families may thus contain identical members. This redundancy has also to be reduced as much as possible. To that aim, we compute the composition identity CI_ij for each pair (F_i, F_j) of N sub-families and consider that they are related if CI_ij > D. We then build all the subgroups of related sub-families and, among each subgroup, we eliminate the most common sequences in related families in order to decrease their composition identity to new acceptable CI_ij values. This is done until all remaining sub-families in the subgroup are unrelated. Note that if the number of sequences in a given sub-family becomes lower than 8, the sub-family is discarded. Moreover, by eliminating sequences in sub-families that belong to different subgroups, new composition similarities may appear between those sub-families. That is why we decided to perform successive cycles, decreasing the threshold D from 0.8 to the 0.5 final value. During this procedure, we only discard 200 sub-families and 100 amino acid sequences, while two thirds of redundant sequences (approximately 66 000) are eliminated. Figure 2B illustrates the convergence of this process, which leads to a dataset of 1485 sub-families (31 327 sequences and 2727 distinct amino acids chains) with at least 8 members (mean 20) and sharing no more than (0.5,0.5) composition identity (Fig. 1C). In a given family, pairwise sequence identity is necessarily less than 50% and is generally much lower (Fig. 1A) and members have a reliable structural alignment quality (Z ≥ 4) with respect to the leader sequence of the family (mean 7.3, Fig. 1B).

The original FSSP alignments are reformatted according to the following information: sub-family name and PDB accession number of the leader sequence, number of members (≥8), PDB accession numbers of these members, associated structural FSSP Z indexes, alignment length, corresponding aligned sequences and aligned secondary structures (assigned through DSSP (Kabsch and Sander, 1983)). In addition, 3D coordinates of α-Carbons and solvent accessibilities calculated by DSSP (Kabsch and Sander, 1983) are reported for each residue. Figure 3 shows a typical file for a family of eight members.

Figure 3

A sub-family example. A. Sequence and secondary structure alignment file. The sub-family “1mai”, whose leader sequence is the PH domain of the phospholipase C delta (pdb code 1mai), includes eight members. B. Superimposition of the PH folds of 1mai and 1bak (Z-score 8.3), according to the FSSP alignment shown in A. 53 Cα belonging to the seven strands and to the C-terminal helix have been superimposed (RMSD 1.59 Å). The superimposed segments of these two sequences share 19 % of identity (13 % on the entire domain). This superimposition is typical of this sub-family and is representative of the whole bank.

Amino acid classes

The large dataset of reliable multiple alignments constituted here remains however considerably too small to consider the twenty different amino acids in each work position. The clustering of amino acids into a limited number of classes is thus necessary. Usually, three to six classes may be rationally defined (e.g. VILFMYW for the strong hydrophobic class, mainly present within the internal sides of regular secondary structures, GPDSN as main loop-forming residues and ARC-QTEKH for the intermediate class (Callebaut et al. 1997; Hennetin et al. 2003)). Here, we consider a simple partition into two classes, derived from a continuous scaling of the 20 amino acids with respect to their mean atom depth, as defined from a representative set of globular proteins (Pintar et al. 2003; Pintar et al. 2003). Mean atom depth indeed allows the sorting of the 20 amino acids in two distinct groups: IVFLWMCYA (G₁) and HTGSPNRQDEK (G₂) (Fig. 4). This classification shows good agreement with mean amino acid burying values, defined through Voronoï tessellations on representative sets of globular domains (Soyer et al. 2000). The two main groups G₁ (mainly hydrophobic amino acids) and G₂ (mainly neutral and hydrophilic amino acids) gather 44 and 56% of the total number of amino acids, respectively. The amino acids of group G₁ are similar to those that were considered hydrophobic by other studies dedicated to long-range contacts (e.g. Punta and Rost, 2005).

Figure 4

Mean atom depth. The original data of Pintar and colleagues (Pintar et al. 2003; Pintar et al. 2003), plotted in the decreasing order of mean atom depths, show two distinct groups of amino acids; on the one hand, the mainly hydrophobic ones (44 % of the total number of amino acids in the bank) and on the other hand neutral and hydrophilic ones (56 % of the amino acids). Histidine, which lies at the frontier between these two groups, was also shown to be the most indifferent amino acid regarding its α β or coil states (Callebaut et al. 1997).

Work Positions

We name “work positions” positions in the multiple alignment for which at least 8 amino acids are aligned. The consideration of this absolute number, rather than a relative proportion of all aligned sequences, allows the handling of representative subsets of these alignments, while ignoring positions in which gaps are predominant.

Generalized topohydrophobic index

Each work position is characterized by its percentage in amino acids belonging to the G₁ group. We name it generalized topohydrophobic index or y₁, because it records the proportion of hydrophobic amino acids (G₁) occupying the position. Distributions of the y₁ parameter are plotted within histograms, according to grouping intervals of 1/8 as a reference to the minimal number of amino acids (8), which have to be present in a work position to be considered.

Major secondary structure

We choose to take into account only work positions in which a same secondary structure is sufficiently conserved (at more than x%). Figure 5A shows the number of work positions as a function of this threshold x. We consider that x ≥ 75% offers an acceptable compromise, ensuring that work positions are structurally relevant according to the secondary structure conservation and keeping enough data to perform a large-scale study. Figure 5B shows the distribution of work positions in the different secondary structures as a function of the generalized topohydrophobic index y₁.

Figure 5

Work positions. A. Number of work positions as a function of the percentage of major secondary structure observed at a position of FSSP-derived multiple alignments (x). For x = 75 %, there are 135197 work positions (60021 H, 38860 E, 38316 C). B. Populations of the 27 work position types (see the Results section) in the final bank with the two groups (G₁, G₂) model. H stands for Helix, E for Extended (β—strand) and C for Coil.

Mean solvent accessibility of a work position

Relative accessibilities are computed starting from the absolute accessibilities provided by DSSP (Kabsch and Sander, 1983). The standard accessible surfaces in Å² for residues are derived from canonical G-X-G configuration calculations by Shrake and Rupley (Shrake and Rupley, 1973): A: 124 / C: 94 / D: 154 E: 187 / F: 221 / G: 89 / H : 201 / I: 194 / K: 214 / L: 198 / M: 215 / N: 161 / P: 150 / Q: 190 / R: 244 / S: 126 / T: 152 / V: 169 / W: 265 and Y: 236. Relative accessibility of a work position is the mean value of the relative accessibilities of its residues.

Non-trivial neighbors

The non-trivial neighborhood of an amino acid can be described from the known atomic coordinates.

Two amino acids are defined as non-trivial neighbors if their Cα are separated by less than 7 Å (Tudos et al. 1994) and if they are distant in sequence from more than 6 residues (Fig. 6). The mean number of neighbors for a work position is defined as the average number of non-trivial neighbors of the amino acids belonging to that position. An even better way to consider the amino acid neighborhood, which is independent of a cutoff threshold value, would have been to use a description through pondered Voronoï tessellations (Angelov et al. 2002; Dupuis et al. 2005; Dupuis et al. 2004; Soyer et al. 2000). However, this description is prohibitively time-consuming and thus out of scope for a large-scale study.

Figure 6

Neighboring definition. A. Mean number of neighbors within a sphere of radius R, as a function of Cα-Cα distance R, calculated on the FSSP-derived bank. For R = 7 Å (a value generally retained to characterize close neighborhood of an amino acid), the mean coordination number is between 7 and 8. B. Evolution of the mean coordination number for R = 7 Å as a function of the sequence distance D, expressed in amino acids. For D = 1, all contacts are taken into account and the mean values are close from each other for strands (E) or helices (H). Above D = 2, behaviors of strands and helices differ, as strands assemble to form sheets with a high and constant mean number of neighbors (~4.5), while helices only show a small mean value of ~0.5 when D ≥ 4. For the E and H states, we consider that beyond D = 6, neighbors are only non-trivial ones.

Results

Dataset

A set of benchmark alignments is selected as described in the Methods section, in order to estimate the number of long-range (or non trivial) contacts of amino acids, with respect to the general topohydrophobic index deduced from the multiple sequence alignment and to the associated secondary structure. The dataset considered here includes 1485 sub-families (31 327 sequences and 2727 distinct amino acids chains) with at least 8 members (mean number 20) and sharing no more than (0.5, 0.5) composition identity, a parameter that was introduced in order to avoid redundancy between subfamilies. In a given family, pairwise sequence identity is necessarily less than 50% and quite always far below (mean 8.3 %) and the members have a confident structural alignment quality (Z) of at least 4 (mean 7.3) with respect of the leader sequence of the family. It is worth noting that all proteins sharing a same fold, fulfilling the selected sequence identity and structural alignment quality criteria described above, are not clustered into a unique family. Some sub-families described above are subsets of proteins possessing at least one domain with a given fold. This distribution in several sub-groups is directly dependent on the initial FSSP dataset and to the selection procedure. For example, some members of the family shown in Figure 3 (family 1mai—Pleckstrin Homology (PH) fold) are found in eight other families with a PH fold domain. However, the alignments well cover the known universe of globular domains, and are thus representative of the structural conservation and diversity within proteins.

We analyze the main features of “work positions” in multiple alignments (see definition in the Methods section), for which more than 75% of the residues share the same secondary structure. As structural superimpositions and secondary structure assignments were automatically performed, local mismatches may occur. However, these mismatches only constitute a marginal fraction within the final alignments obtained after filtering of the initial dataset. Only 8% of the 97 000 retained work positions exhibit more than one H/E discrepancy and thus only constitute a background noise, which do not sensibly modify the main results of this study. The good quality of solvent accessibility predictions, which are directly performed on our filtered database of structural alignments (see below) and are similar to results obtained with other methods (Gianese et al. 2003; Pascarella et al. 1998; Rost and Sander, 1994; Thompson and Goldstein, 1996), further supports the overall structural relevance of work positions.

The partition of amino acids in two groups G₁ (IVFLWMCA) and G₂ (HTGSPNRQDEK), as introduced in the Methods section, and the distribution of group compositions in 1/8 lead to 9 distinct topohydrophobic y₁ values (0, 0.125, …, 1), which can describe a work position. 27 classes of work positions (X, y₁) can thus exist, combining y₁ and X, the major secondary structure (X = helix, strand or coil). The 27 classes are often largely represented in the bank. The less populated classes are the limit cases, consisting in fully hydrophilic strands (Strand, 0) and fully hydrophobic coils (Coil, 1) (462 and 296 work positions, respectively; Fig. 5B). We principally consider the 18 classes of work positions associated with regular secondary structures (X = H or E; 60 021 and 36 830 work positions, respectively).

Positions within helices

Relative solvent accessibility

Figure 7A illustrates the behavior of the mean relative solvent accessibility in helix work positions within multiple alignments, as a function of the generalized topohydrophobic index y₁, ranging from 0 to 1. As expected, the mean relative accessibility to solvent diminishes when y₁ increases. We also consider the individual behaviors of G₁- and G₂-residues. We observe that the G₁- and G₂-values depend on the y₁ value of the work positions, and both diminish when y₁ increases. The two curves are quite parallel for the two groups, with the G₁ mean values smaller, as expected, than the G₂ ones. The distribution of mean relative accessibilities around the mean values, shown in Figure 7A, is illustrated in Figure 8A. For very low y₁ values (low hydrophobicity), the mean relative accessibilities are distributed according to a Gaussian-like rule centered on 0.45 and, as y₁ increases, this curve smashes towards the origin, with a mean below 0.1 for 95% of the 1977 totally hydrophobic work positions (y₁ = 1). For y₁ = 0 (fully neutral or hydrophilic positions), a small peak, indicated by a star, reveals the existence of buried positions. It likely corresponds to salt bridges, and more generally to pairs of side-chains in mutual neutralizing polar contacts within globular cores. This observation moreover provides indirect biophysical support to the data quality of the FSSP-derived bank.

Figure 7

Helices. A. Mean solvent accessibility for helices, as a function of the composition of work positions. When positions have a high topohydrophobic index y₁, the G2 class adopts a similar behavior as the G1 one, constrained by fold requirements. B. Evolution of the mean number of non-trivial neighbors as a function of the composition of work positions in α regular secondary structures. The same comment as for A can be made for G2 amino acids. C. Partners of helix work positions. When topohydrophobicity is high, Helix-Helix and particularly G₁-G₁ contacts dominate in α regular secondary structures.

Figure 8

Helices. A. Distributions of work positions according to mean relative solvent accessibility and hydrophobicity for α regular secondary structures. Star indicates an exceeding value, likely resulting from salt bridges and mutually neutralizing pairs of hydrophilic amino acids within protein cores. B. Distributions of work positions according to the mean number of non-trivial neighbors and hydrophobicity in α regular secondary structures. C. Typical single inter-helix contact found in 1mai, between H120 and A21.

Number of non-trivial close neighbors

The number of non-trivial close neighbors (Fig. 8B) shows a symmetrical behavior compared to the relative accessibility (Fig. 8A). The number of non-trivial neighbors of work positions within helices increases as hydrophobicity rises from y₁ = 0 to y₁ = 1, but is rarely greater than 2, even for completely buried positions (mean accessibility < 0.1), within the internal sides of helices. This mainly results from the principal occupancy, in such configurations, of the close neighborhood by trivial neighbors, which restrains the free space for external residues, and from the convex geometry of α-helices, roughly cylindrical, with a large dispersion of side chains. G₁ and G₂ groups are both concerned by this increase of the number of non-trivial neighbors (Fig. 7B). Work positions with high hydrophobicity within helices mainly establish contacts with other helices (Fig. 7C). Moreover, these contacts mainly involve G₁ amino acids within the hydrophobic core (data not shown). Figure 8C illustrates such a situation.

Positions within Strands

A similar investigation was performed for work positions associated with β-strands (Figs. 9 and 10). The most striking result for β-strands is a strong increase of the number of the non-trivial first neighbors and a clearly multimodal distribution observed for almost all y₁ values, and in particular for the less hydrophobic ones (low y₁ values). The weakly populated mode, centered on approximately one neighbor, is likely associated with highly external positions at the extremity of some strands. The two other modes (near 3 and 6 neighbors) are likely to correspond to external (edge) and internal (central) positions of strands within β-sheets, respectively. Indeed, the second mode (around 3) mainly relies on the architecture of β-strands within sheets, where side chains in

Figure 9

Strands. A. Evolution of the mean relative solvent accessibility for β-strands, as a function of hydrophobicity of work positions. B. Evolution of the mean number of non-trivial neighbors; as a function of the composition of β work positions. C. Partners of β-strand work positions. At high topohydrophobicity, strand-strand and particularly G1-G1 contacts dominate in β-sheets.

Figure 10

Strands. A. Distributions of work positions according to mean relative solvent accessibility and hydrophobicity for β regular secondary structures. The presence of salt bridges and hydrophilic pairs likely account for the value indicated by a star, as for helix positions. B. Distributions of work positions according to mean number of non-trivial neighbors and hydrophobicity in β regular secondary structures. Using Gaussian approximation to deconvoluate the overall profile highlights the multimodal distribution of strand neighbors. Three modes (1, 2 and 3) are present: ~1.2, 3.3 to 4.5 and 5.6 to 6.5 mean neighbors, respectively. C. Two first views. Current tetrahedron found between Cβ of residues i, i + 1, i + 2 of a strand and another residue in an adjacent strand. The example shown in two orthogonal views is from 1mai (S98, I99, V100 and V75). The mean tetrahedron edge size is 6.3 Å. Last view. Two tetrahedra sharing a vertex: i, i + 1, i + 2 of a strand; j, j + 1, j + 2 of another one, which sandwiches a residue. The shown example is also taken from 1mai (V75, R76, M77/L108, D109, L110/S98; mean edge size of 5.9 Å).

positions i, i + 1, i + 2 in one strand occupy a roughly equilateral triangle. This triangle constitutes the basis of the interaction with another amino acid j of a neighboring strand, linked to the “i” strand through canonical main chain H-bonds. These four residues constitute a more or less deformed tetrahedron (distance between Cβ ~6.2 Å), which represents the basic unit of compact packing of similar sized spheres (Fig. 10C). The third mode (around 6) mainly corresponds to a geometry with two tetrahedra (one strand sandwiched by two others) sharing a vertex, which has 6 first non-trivial neighbors (Fig. 10C). Many deviations from this ideal scheme occur and tend to flatten the Gaussian distribution. As for helices, the number of non-trivial neighbors increases with hydrophobicity of a work position (Fig. 9B) and strand non-trivial neighbors are very often found within other strands (Fig. 9C). The present study quantifies this behavior and offers the opportunity to gain information on the probable participation of an amino acid in an internal or external strand position, through the only knowledge of multiple sequence alignments.

Influence of Fold Classes

The dataset is large enough to estimate the putative influence of fold classes on some parameters. Four main classes, as described in the SCOP classification (Murzin et al. 1995), were considered (all-α (297 sub-families), all-β (370 sub-families), α/β (530 sub-families) and α + β (131 sub-families)). One can expect that differences in the tertiary structures between the four fold classes are reflected in the level of hydrophobic contacts, involving residues of the G1 group, and in particular in positions with a high topohydrophobic index (y₁ = 1). Hence, one can observe that the mean number of non-trivial neighbors belonging to the G₁ group for strand work positions with a high topohydrophobic index is sensibly higher for the α/β class than for the three others (4.51 versus 4.02 (α), 3.25 (β) and 3.79 (α + β); Fig. 11). This is all the more noticeable than the total number of non-trivial neighbors of strands work position with a topohydrophobic index of 1 is rather constant (Table 1). A hypothesis to explain such a behavior is that a larger number of fully hydrophobic work positions with a structural role exist in the α/β and even α classes, but this remains to be further investigated. Furthermore, one can note that better performance of programs for the prediction of long-range contacts are reported by at least two studies for this same α/β class (MacCallum, 2004; Punta and Rost, 2005).

Table 1.

Contacts achieved by strand work positions with topohydrophobic index y₁ = 1

Absolute number of non-trivial neighbors	Class
	Alpha	Beta	Alpha/beta	Alpha+beta
Neighbors within helices	0.22	0.04	0.26	0.28
Neighbors within strands	5.04	5.00	5.12	4.87
Neighbors within coils	0.90	0.88	0.88	1.00
Total number of neighbors	6.16	5.92	6.26	6.15
G1 neighbors	4.02	3.25	4.51	3.79

Figure 11

Mean number of observed G₁ non-trivial neighbors within the main fold classes.

Discussion

The prediction of non-trivial neighborhood, or long-range contacts, from protein sequences is of particular interest to improve comparative modeling and to enhance fold recognition and ab-initio fold prediction. It can also help to detect remote relationships between protein sequences and to solve experimental structures. Contact prediction methods have received much attention during the last decade and often combine the evolutionary information available from multiple alignments and the prediction of secondary structures. They can be roughly classified in two non-exclusive categories: statistical correlated mutations approaches (see for examples Halperin et al. 2006; Kundrotas and Alexov, 2006) and machine-learning approaches (see for example Punta and Rost, 2005). While most methods aimed at predicting contact maps, several other approaches have been developed to estimate the total number of contacts (Fariselli and Casadio, 2000; Ishida et al. 2006; Kinjo et al. 2005; Pollastri et al. 2001; Pollastri et al. 2002; Yuan, 2005), but these generally define large numbers of coordination, including trivial neighborhood, and rarely link these numbers to the topological and evolutionary features of the region which includes the concerned residue.

Our analysis outlines the relationship between the mean number of non-trivial neighbors and a topohydrophobic index, which relies on the mean hydrophobicity of a position within a multiple alignment of sequences, as a function of the secondary structure. The topological data we collected here might be used in a predictive perspective, as secondary structures can currently be predicted with a good accuracy using multiple alignments (see for example Rost and Sander, 1993). As noticed in earlier studies (Punta and Rost, 2005), the performance of the various estimations that can be made on the long-range contacts directly depends on the quality of the evolutionary profiles, which have to be large and to contain divergent sequences to furnish accurate information.

The original result of this study is that different behaviors relative to non-trivial neighbors can be observed for helix and for strand residues, and among strands, for central and edge β-strands. Starting from these observations, the prediction of the topological nature of β-strands can be approached using classification methods like decision trees (see Supplementary data 1). Briefly, using parameters such as the length of the strand, its mean hydrophobicity and periodicity of G₁ and G₂ residues, combined with topohydrophobic index, decision trees lead to an accuracy of 80% for the prediction of edge/central positions within β-sheets (Supplementary data 1). Although it is difficult to compare methods using different datasets for training and prediction, this approach appears to achieve a prediction accuracy similar to the one obtained by Siepen and coworkers (Siepen et al. 2003), which is based on the use of support vector machine (SVM) and decision trees.

The use of the topohydrophobic index, combined with information on the nature of secondary structures, the group (G₁ or G₂) to which the residue belongs, as well as environmental parameters, describing the local periodicity, also allows the prediction of the relative solvent accessibilities of a residue within a work position into two or three states models (exposed, intermediate and buried; see Supplementary data 2). In the ideal case, when secondary structures are “known”, solvent accessibility predictions using this methodology led to Q2 of 79% (16% threshold) versus 75% for other methods tested on the same dataset and based on neural networks (Rost and Sander 1994) or probability profiles/ support vector machines (Gianese et al. 2003) and to Q3 of 65% versus 58% for the same other methods (9–36 % threshold). On the one hand, the accurate prediction of solvent accessibility using generalized “topohydrophobicity” provides additional constraints on informative positions of a sequence (the work positions). On the other hand, these results further support the intrinsic quality of the dataset used for this study.

The present analysis shed light on important geometrical and topological parameters that can help to understand protein sequence-fold relationships. It appears of particular interest that the dichotomy (hydrophobicity—hydrophilicity) between only two nearly equally populated classes of amino acids provides a very simple way to derive useful and often accurate topological data, that can be useful for protein fold recognition.

Footnotes

Acknowledgments

G.F. acknowledges a PhD grant of the “Direction Générale de L'Armement”.

Supplementary Data 1

Supplementary Data 2

References

Angelov

, Sadoc

J.F.

, Jullien

2002. Nonatomic solvent-driven Voronoi tessellation of proteins: an open tool to analyze protein folds. Proteins, 49: 446–56.

Balaji

, Sujatha

, Kumar

S.S.

2001. PALI-a database of Phylogeny and ALIgnment of homologous protein structures. Nucleic. Acids Res., 29: 61–5.

Callebaut

, Labesse

, Durand

1997. Deciphering protein sequence information through hydrophobic cluster analysis (HCA): current status and perspectives. Cell. Mol. Life Sci., 53: 621–45.

Chothia

1984. Principles that determine the structure of proteins. Annu. Rev. Biochem., 53: 537–72.

Dietmann

, Fernandez-Fuentes

and Holm

2002. Automated detection of remote homology. Curr. Opin. Struct. Biol., 12: 362–7.

Dupuis

, Sadoc

J.F.

, Jullien

2005. Voro3D: 3D voronoi tessellations applied to protein structures. Bioinformatics, 21: 1715–6.

Dupuis

, Sadoc

J.F.

and Mornon

J.P.

2004. Protein secondary structure assignment through Voronoi tessellation. Proteins, 55: 519–28.

Fariselli

and Casadio

2000. Prediction of the number of residue contacts in proteins. Proc. Int. Conf. Intell. Syst. Mol. Biol., 8: 146–51.

Frishman

and Argos

1997. The future of protein secondary structure prediction accuracy. Folding and Design, 2: 159–62.

10.

Gianese

, Bossa

and Pascarella

2003. Improvement in prediction of solvent accessibility by probability profiles. Prot. Eng., 15: 987–92.

11.

Halperin

, Wolfson

and Nussinov

2006. Correlated mutations: advances and limitations. A study on fusion proteins and on the Cohesin-Dockerin families. Proteins, 63: 832–45.

12.

Hennetin

, Le Tuan

, Canard

2003. Non-intertwined binary patterns of hydrophobic/nonhydrophobic amino acids are considerably better markers of regular secondary structures than nonconstrained patterns. Proteins, 51: 236–44.

13.

Holm

and Sander

1993. Protein structure comparison by alignment of distance matrices. J. Mol. Biol., 233: 123–38.

14.

Holm

and Sander

1994. The FSSP database of structurally aligned protein fold families. Nucleic. Acids Res., 22: 3600–9.

15.

Ishida

, Nakamura

and Shimizu

2006. Potential for assessing quality of protein structure based on contact number prediction. Proteins, 64: 940–7.

16.

Jones

D.T.

1999. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol., 292: 195–202.

17.

Kabsch

and Sander

1983. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22: 2577–637.

18.

Kinjo

A.R.

, Horimoto

and Nishikawa

2005. Predicting absolute contact numbers of native protein structure from amino acid sequence. Proteins, 58: 158–65.

19.

Kundrotas

P.J.

and Alexov

E.G.

2006. Predicting residue contacts using pragmatic correlated mutations method: reducing the false positives. BMC Bioinformatics, 7: 503.

20.

Lindorff-Larsen

, Rogen

, Paci

2005. Protein folding and the organization of the protein topology universe. Trends Biochem. Sci., 30: 13–9.

21.

MacCallum

2004. Striped sheets and protein contact prediction. Bioinformatics, 20(Suppl. 1): i224–31.

22.

Mizuguchi

, Deane

C.M.

, Blundell

T.L.

1998. HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci., 7: 2469–71.

23.

Murzin

A.G.

, Brenner

S.E.

, Hubbard

1995. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247: 536–40.

24.

Pascarella

, De Persio

, Bossa

1998. Easy method to predict solvent accessibility from multiple protein sequence alignments. Proteins, 32: 190–9.

25.

Pintar

, Carugo

and Pongor

2003. Atom depth as a descriptor of the protein interior. Biophys. J., 84: 553–2561.

26.

Pintar

, Carugo

and Pongor

2003. Atom depth in protein structure and function. Trends Biochem. Sci., 28: 593–7.

27.

Pollastri

, Baldi

, Fariselli

2001. Improved prediction of the number of residue contacts in proteins by recurrent neural networks. Bioinformatics, 17 (Suppl 1): S234–S42.

28.

Pollastri

, Baldi

, Fariselli

2002. Prediction of coordination number and relative solvent accessibility in proteins. Proteins, 47: 142–53.

29.

Pollastri

and McLysaght

2005. Porter: a new, accurate server for protein secondary structure prediction. Bioinformatics, 21: 1719–20.

30.

Poupon

and Mornon

J.P.

1998. Populations of hydrophobic amino acids within protein globular domains: identification of conserved “topohydrophobic” positions. Proteins, 33: 329–42.

31.

Poupon

and Mornon

J.P.

1999. Predicting the protein folding nucleus from a sequence. FEBS Lett, 452: 283–9.

32.

Poupon

and Mornon

J.P.

1999. “Topohydrophobic positions” as key markers of globular protein folds. Theor. Chem. Accounts, 101: 2–8.

33.

Poupon

and Mornon

J.P.

2001. Deciphering globular protein sequence/structure relationships: from observation to prediction. Theor. Chem. Accounts, 106: 113–20.

34.

Punta

and Rost

2005. PROFcon: novel prediction of long-range contacts. Bioinformatics, 21: 2960–8.

35.

Rost

and Sander

1993. Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol., 232: 584–99.

36.

Rost

and Sander

1994. Conservation and prediction of solvent accessibility in protein families. Proteins, 20: 216–26.

37.

Rost

and Sander

1995. Progress of 1D protein structure prediction at last. Proteins, 23: 295–300.

38.

Shrake

and Rupley

J.A.

1973. Environment and exposure to solvent of protein atoms. J. Mol. Biol., 79: 351–71.

39.

Siepen

J.A.

, Radford

S.E.

and Westhead

D.R.

2003. Beta edge strands in protein structure prediction and aggregation. Protein Sci., 12: 2348–59.

40.

Soyer

, Chomilier

, Mornon

J.P.

2000. Voronoi tessellation reveals the condensed matter character of folded proteins. Phys. Rev. Lett., 85: 3532–5.

41.

Thompson

, Plewniak

and Poch

1999. A comprehensive comparison of multiple sequence alignment programs. Nucleic. Acids Res., 27: 2682–90.

42.

Thompson

J.D.

, Plewniak

and Poch

1999. BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics, 15: 87–8.

43.

Thompson

M.J.

and Goldstein

R.A.

1996. Predicting solvent accessibility: higher accuracy using bayesian statistics and optimized residue substitution classes. Proteins, 25: 38–47.

44.

Thompson

M.J.

and Goldstein

R.A.

1997. Predicting protein secondary structure with probabilistic scheme of evolutionarily derived information. Protein Sci., 6: 1963–75.

45.

Tudos

, Fiser

and Simon

1994. Different sequence environments of amino acid residues involved and not involved in long-range interactions in proteins. Int. J. Pept. Protein Res., 4: 205–8.

46.

Yuan

2005. Better prediction of protein contact number using a support vector regression analysis of amino acid sequence. BMC Bioinformatics, 6: 248.

47.

Jones

D.T.

1999. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol., 292: 195–202.

48.

Kabsch

and Sander

1983. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22: 2577–637.

49.

Matthews

B.W.

1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys. Acta., 405: 442–51.

50.

Siepen

J.A.

, Radford

S.E.

and Westhead

D.R.

2003. Beta edge strands in protein structure prediction and aggregation. Protein Sci., 12: 2348–59.

51.

Bork

, Hofmann

, Bucher

1997. A superfamily of conserved domains in DNA damage-responsive cell cycle checkpoint proteins. FASEB. J., 11: 68–76.

52.

Callebaut

, Courvalin

J.C.

and Mornon

J.P.

1999. The BAH (bromo-adjacent homology) domain: a link between DNA methylation, replication and transcriptional regulation. FEBS Lett, 446: 189–93.

53.

Callebaut

, Eudes

, Mornon

J.P.

2004. Nucleotide-binding domains of human cystic fibrosis transmembrane conductance regulator: detailed sequence analysis and three-dimensional modeling of the heterodimer. Cell. Mol. Life Sci., 61: 230–42.

54.

Callebaut

and Mornon

J.P.

1997. From BRCA1 to RAP1: a widespread BRCT module closely associated with DNA repair. FEBS Lett, 400: 25–30.

55.

Gianese

, Bossa

and Pascarella

2003. Improvement in prediction of solvent accessibility by probability profiles. Prot. Eng., 15: 987–92.

56.

Kabsch

and Sander

1983. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22: 2577–637.

57.

Letunic

, Copley

R.R.

, Schmidt

2004. SMART 4.0: towards genomic data integration. Nucleic Acids Res., 32: D142–4.

58.

Rost

and Sander

1994. Combining evolutionary information and neural networks to predict protein secondary structure. Proteins, 19: 55–72.

59.

Rost

and Sander

1994. Conservation and prediction of solvent accessibility in protein families. Proteins, 20: 216–26.

60.

Thompson

M.J.

and Goldstein

R.A.

1996. Predicting solvent accessibility: higher accuracy using bayesian statistics and optimized residue substitution classes. Proteins, 25: 38–47.

Characterization of Non-Trivial Neighborhood Fold Constraints from Protein Sequences using Generalized Topohydrophobicity.

Abstract

Keywords

Introduction

Methods

Datasets and reduction of redundancy

Amino acid classes

Work Positions

Generalized topohydrophobic index

Major secondary structure

Mean solvent accessibility of a work position

Non-trivial neighbors

Results

Dataset

Positions within helices

Relative solvent accessibility

Number of non-trivial close neighbors

Positions within Strands

Influence of Fold Classes

Discussion

Footnotes

Acknowledgments

Supplementary Data 1

Supplementary Data 2

References