Abstract
Comparative sequence analysis is widely used to infer gene function and study genome evolution and requires proper ortholog identification across different genomes. We have developed a program for the Identification of Orthologs in one-to-one relationship by Neighborhood and Similarity (IONS) between closely related species. The algorithm combines two levels of evidence to determine co-ancestrality at the genome scale: sequence similarity and shared neighborhood. The method was initially designed to provide anchor points for syntenic blocks within the Génolevures project concerning nine hemiascomycetous yeasts (about 50,000 genes) and is applicable to different input databases. Comparison based on use of a Rand index shows that the results are highly consistent with the pillars of the Yeast Gene Order Browser, a manually curated database. Compared with SYNERGY, another algorithm reporting homology relationships, our method's main advantages are its automation and the absence of dataset-dependent parameters, facilitating consistent integration of newly released genomes.
Introduction
Given the increasing number of large-scale sequencing projects (see http://www.ncbi.nlm.nih.gov/Genomes), comparative genomic approaches are now widely used.1–7 Indeed, comparison of genome sequences across species offers great potential for studying many aspects of their underlying biology, such as the prediction of gene function. Moreover, it provides insights into the processes of both genome and gene evolution. The reliable identification of orthologs in a one-to-one relationship is critical for many comparative genomics analysis, such as the construction of syntenic blocks,1,5,8 the reconstruction of accurate species or gene trees, or the automation of the functional annotation of genes.
Orthology is often interpreted as the functional equivalence of proteins across species, while in fact it defines only a particular relationship of homology in which two genes originating from a common single ancestral gene diverged following a speciation event.9,10 However, orthologs are more likely to have a functional similarity than paralogous genes. 11
In the process of identifying orthologs, the trend is to assume that if two sequences are significantly similar, they must be homologous; ie, they must share a common origin.9,10 However, similarity may be a false indication of homology, for example, in cases of convergence and events of duplication and loss that tend to blur the tracing of co-ancestrality. 1 Therefore, the identification of orthologs among the set of homologs defined by similarity requires more specific analysis.
Most approaches for the identification of orthologs may be based on the following evidence: sequence similarity, reconciliation of genes and species phylogenies, and synteny conservation (see11–16). The implementation is either manual, semi-automatic (some parameters are defined
We developed a program called IONS (Identification of Orthologs by Neighborhood and Similarity). This program relies on two types of evidence: sequence similarity at the protein level and the chromosomal neighborhood (see Algorithm). The method was initially developed for the Génolevures project, a large-scale comparative genomics project for
Comparisons required application of the different methods to a common dataset, in our case the hemiascomycete phylum. The adjusted Rand index (ART) 21 was used as a quantitative indicator of the equivalence of partitioning for pairwise comparisons of methods.
Material and Methods
Algorithm
Input data and preparation
The input required for IONS consists of a database of genes belonging to a number of species encompassing a particular taxonomic class or phylum. This database must contain comprehensive information about the relative position of genes as well as assignment to a group of similar gene products or to any other group of putative homologs obtained by any other method (Additional file 1: Sample of input and output files).
For this study, the database (Fig. 1a) consisted of the genomes of nine species covering the Hemiascomycetes class and was based on the assignment to Génolevures families (GL Family) defined by Nikolski et al. 22 These families of similar genes were based on the calculation of pairwise similarities of sequences provided by BLAST and Smith-Waterman and a subsequent clustering. An algorithm was then applied to construct consensus families from competing clustering computations by an election method (see 22 for details).

Flowchart of the IONS algorithm. The algorithm is automatically applied to each GL family k. The first clustering by transitivity, with s neighbors taken into account for each side of the query genes, produces 1 to X subsets with the genes belonging to the family k. These subsetsks are classified into different categories: TSing, InPara, Ortho1, Orthox, and Undet. Subsets that do not correspond to one of these categories are uncategorized and enter the sequential procedure. In the sequential procedure, a test is made to ensure that the new subsets created with the narrow neighborhood (Y subsets) do not result in a reduction of the number of species. A comparison is made between the number of species represented in the subset with the wide neighborhood subsetksand the maximum number of species represented in the 1 to Y subsets obtained with the narrow neighborhoods. If the number of species is equal, the 1 to Y subsetsk(s-1) obtained with the narrow neighborhood enter the classification. Otherwise, the subset with the wide neighborhood, subsetks, is validated and labeled as “Undet.” Striped frames indicate the end of the analysis for the genes belonging to the labeled subsets. Below these frames is an example of a possible phyletic pattern (pp).
Figure 1 presents a flowchart of the algorithm we developed to infer orthology from the combination of similarity (family) and shared neighborhood, defined as the preserved co-localization of some genes on chromosomes of different species (independently of their order). First, the database was read to identify the list of genes belonging to each family (Fig. 1b).
Then all families (Fig. 1c) were sequentially analyzed as described below.
Identification of neighborhoods
For a given family
Assignment of genes to subsets through a clustering by transitivity
IONS proceeds by comparisons of the neighborhoods of the
While analyzing a given family

Clustering by transitivity. If the environment (A) possesses a homologous protein in (B) and if the environment (C) possesses a protein that is homologous to another protein of (B), then the environments (A, B, and C) are parts of the same subset of orthologs by neighborhood and similarity.
The program offers the opportunity to change the number of neighbors required to assign genes to the same subset of orthologs. In all cases, the output of the analysis is either a confirmation of the initial family on the basis of the neighborhood evidence or a splitting of the family into different clusters (Fig. 1f), which are sequentially numbered (eg, GL3R1304_10010, GL3R1304_10020, etc.).
Classification of subsets (Fig. 1g and Box 1)
In a first step, the widest neighborhood size was used to calculate clusters, eg,
Sequential procedure
When a subset comprised genes from different species and if some species comprised more than one member (subsets labeled as “Uncategorized”), we progressively diminished the size of the neighborhood taken into account. Subsets in the “Uncategorized” category (Fig. 1n) were tested with a narrower neighborhood size, eg,
If at least one of the new subsets comprised the same number of species as the initial subset in the wide neighborhood, the new subsets were validated (Fig. 1r) and a supplementary suffix was added (ie, GL3R1304_10010_10, GL3R1304_10010_20). In turn, these new subsets were labeled as described in the previous section. “Uncategorized” subsets were recursively tested with a narrower neighborhood size, eg,
In the search of the orthologs in a one-to-one relationship, the algorithm was designed to privilege clusters of orthologs and in-paralogs rather than to split orthologs apart. Thus, if all new subsets comprised fewer species than the initial subset in the wide neighborhood, the procedure was stopped. In this case, genes were assigned to the subset defined in the previous step in terms of neighborhood width (Fig. 1s) and labeled as “Undet” for undetermined (Fig. 1t).
During the sequential procedure, if the neighborhood size (
A test case on Hemiascomycetes described in the next section illustrates the interest of the sequential procedure. With 15 neighbors, the IONS procedure assigned 33,258 of the 47,874 genes (see Fig. 3) to different types of subsets. Among these, 22,758 genes (47.53% of the total) formed subsets of orthologs (Ortho 1 and Orthox). The sequential procedure was then applied to the remaining 14,616 genes. Additional file 3 shows the cumulative results obtained after each step of this sequential procedure at the end of which 6,521 additional genes formed new subsets of orthologs, increasing the total percentage of genes classified into subsets of orthologs (Ortho1 and Orthox) to 61.16% of all 47,874 genes.

Improvement of the classification by the sequential procedure. (A) Distribution of genes in the different types of subsets resulting from the analysis with 15 neighbors taken on each side of the query genes and (B) improvement of the classification thanks to the sequential procedure that allows classifying genes not classified in (A).
Output
The IONS program produces two databases as well as a visual file and a neighborhoods file for each subset (Additional file 1: Sample of input and output files). The first database is ordered by gene and contains its family and the name and status of the subset to which it was assigned. The second database is organized by subsets. Each shows the family, the subset, the status of the subset, the number of genes of each species constituting this subset, and the total number of genes in the subset.
The neighborhoods file reports the presence-absence of the different families in the neighborhoods of the different queries of the subset. The visual file shows the names and families of the neighbors of genes belonging to a particular subset. The raw descriptions of neighborhood relationships available in these visual files may serve as support for ad hoc discussion of gene evolution in complex situations, like the emergence of ohnologs (duplicates arising from the Whole Genome Duplication, 24 abbreviated as WGD).
Results
Test case on nine hemiascomycetes
We applied the IONS method to resolve the homology relationships in the genomes of nine hemiascomycetous yeasts:

Cladogram of the nine hemiascomycetous yeasts. The cladogram is based on the phylogenetic tree in Souciet et al. 1 The first two columns refer to the number of genes. (A) Repartition of the 48,889 genes in the different species. (B) Repartition in the different species of the 29,279 genes classified in the 4,107 subsets of orthologs (Ortho1 and Orthox) by the IONS method. The last three columns show the number of subsets obtained by the different methods. (C) Subsets of orthologs produced by the IONS method; (D) pillars of the YGOB; and (E) orthogroups produced by SYNERGY for the 12,691 genes involved in the comparison.
The results in Génolevures
Figure 3 shows the distribution of genes into the five categories of subsets (TSing, InPara, Ortho1, Orthox, and Undet). With a maximum of 15 neighbors taken into account, the neighborhood was informative, ie, there was at least one neighbor in common with at least one other query gene, for 43,101 genes (90%). The method confirmed the co-ancestrality of orthologs in a one-to-one relationship of 1,309 families containing one gene in each species as well as identifying 388 new subsets of orthologs of this type (Table 1).
Comparison of the 7,927 Génolevures families (similarity) to the subsets produced by the IONS method.
Among the 4,822 families in “Others,” there are 3,343 families containing only one gene in one species. The subset in IONS is identical.
Another 1,142 families of orthologs with a maximum of one member in each species were confirmed by shared neighborhood, while 1,268 new subsets of orthologs of this type were created (Table 1). In total, the number of genes classified into subsets of 2 to 9 orthologs (Table 2) increased from 47.71% (22,841 genes, Table 1), as inferred from the families based on similarity, to 61.16% (29,279 genes; see Tables 1 and 2 for details) with the IONS method.
Types of subsets produced by the IONS method for the nine Génolevures species.
Mean and standard deviation.
Interests
The subsets of orthologs (Ortho1 and Orthox) produced by the IONS method have already proven to be of particular interest to serve as anchor points for the construction of syntenic blocks 1 and to identify orthologs. 17 Small differences between the results presented in 1 and in this paper result from an improvement in the IONS method by the inclusion of the sequential procedure that progressively decreases the neighborhood by one neighbor at a time rather than the rough iteration made for 15, 10, 5, and 1 neighbors used in the previous version (see additional file 4 for the correspondence between the two classifications). This modification allows identification of more subsets of the Ortho1 type.
The method, especially the Undetermined subsets, is also useful as a starting point for manual dissection of a functional family. For example, SONS was used to suggest a model for the evolution of the hexose transporters and glucose sensors. 20 The IONS method was also used for analysis of the evolution of the ATP-binding cassette transporters conferring multiple drug resistance in hemiascomycetous yeasts 19 and of the drug: H+ antiporter 1 family. 25
Comparison to other datasets of orthologs
The pillars of the YGOB and the orthogroups produced by SYNERGY
To assess the quality of subsets of orthologs (Ortho1 and Orthox) produced by the IONS method, we compared it to two different studies. The first comparison was done with a manually curated database that we assume to be the ‘gold standard’ for yeast genomes: the pillars of YGOB (v5, January 2011).
26
The second comparison concerned the orthogroups produced by SYNERGY,
27
an algorithm that reports orthology relationships using sequence similarity, synteny and a given species phylogeny to reconstruct the underlying evolutionary history of genes. This method is automated but, in contrast to IONS, requires an
Both the methods (manual vs automated, types of evidence) and results (subsets of different types of homologs) differ. Indeed, the pillars of the YGOB contain groups of orthologous genes that are allowed to contain one ohnolog in each post-WGD species. In contrast, the orthogroups produced by SYNERGY consist of sets of genes from extant species that are descended from a single gene in the species’ last common ancestor, 27 which means that they contain orthologs as well as all in-paralogs produced since the most ancestral speciation event of the species studied.
Among the 34,709 genes of the six species studied in Wapinski et al, 27 17,210 (50%) were classified into groups of orthologs in a one to maximum one relationship (Additional file 5: Comparison of gene classification in different types of subsets by IONS, YGOB, and SYNERGY). In comparison, the manual curation of the YGOB database allowed classification of 42,046 (69%) of the 60,876 genes of the 11 species of the YGOB v5. Of the 47,874 genes of the 9 species studied in our test case, our automatic method IONS classified 29,279 (61%) into groups of orthologs (Ortho 1 and Orthox) that can serve as anchor points for syntenic studies.
The difference in percentages between YGOB and IONS can be mainly explained by the evolutionary distance between the species included in our study. Indeed, if we removed the two most ancestral species that are not present in the YGOB database (
Comparison of a common dataset
A more precise comparison implies a focus on the same species and on the same subset of genes within these species. This comparison was restricted to genes belonging to four species common to the three studies (Table 3) for which we had no conflict with correspondence of names and that were classified into subsets of orthologs according to IONS (Ortho 1 and Orthox).
Comparison of the classification of the 12,691 genes in subsets of orthologs by the IONS method, in the YGOB database, and by SYNERGY.
The comparisons were done on 12,691 genes using the Rand index.
28
This index determines the similarity between two partitions as a function of positive and negative agreements based on the contingency table of the pairwise assignments of data items. The Rand index ranges from 0 to 1. The ARI
21
introduces a statistical normalization to yield values close to zero for random partitions.
29
A value of 1 indicates a perfect identity between the partitions. The Adjusted Rand Index, ARI, (Table 4) was very close to one (0.977–0.996) for the comparison with the pillars of the YGOB, indicating that the orthology assignments were almost exactly identical. The results of the IONS method differed a bit more from the SYNERGY orthogroups (ARI ranging from 0.895 to 0.913). The comparisons of the pillars of the YGOB to the orthogroups produced by SYNERGY also showed more divergent results (the ARI varied from 0.916 to 0.922, see additional file 6: ARI for the comparison between YGOB pillars and SYNERGY orthogroups). These small discrepancies with the SYNERGY orthogroups may be explained by the fact that the YGOB and IONS methods are essentially based on synteny, in contrast to SYNERGY for which synteny is only one of three parameters weighted
Adjusted Rand index of the SONS to the YGOB pillars and SYNERGY orthogroups.
A subset by subset comparison with YGOB is supplied in additional file 7: Discrepancies between the IONS subsets and the YGOB.
Implementation
The IONS program that finds orthologs subsets according to the method described above was written in Perl. The required input is a csv file (Additional file 1: Sample of input and output files) relating to genomes and that contains, in each line, the Coding DNA Sequences (CDS) name, the species abbreviation, the chromosome letter, the relative position, the family name, and the strand (this last information is optional). The program is available on the mini web site: http://web.me.com/philogene/IONS-method/IONS_2011.html.
Discussion
Relevance of the method
The IONS method subdivides precompiled sets of homologs (based on sequence similarity) using gene neighborhood in an iterative process that gradually decreases neighborhood size until a series of homologs with only one gene per species is obtained. The results are equivalent to those of the YGOB 26 or SYNERGY, 27 but the IONS method has several advantages:
The method is automated, in contrast to the YGOB method, which requires a time-consuming manual curation for each new genome.
There is no dataset-dependent parameter. While the parameters of SYNERGY must be redefined according to a new dataset, which can lead to contradictions between orthologs found in actual and subsequent results, the IONS method is applicable without reconfiguration. The addition of new species will not change the composition of extant groups of Ortho1 and Orthox; it will offer only the opportunity to complement them or to identify new groups.
The method is applicable to any predetermined families of homologs and versatile enough either to use with any existing package to define families of homologs or to use an existing database of families as an input. Another option would have been to develop a full package integrating both the delineation of families and the search for orthologs. The limitation of this option is the impossibility of taking advantage of the new development in the definition of families and the difficulty of using pre-determined classifications such as eggNOG 30 on which our method was tested, giving results similar to the Génolevures dataset. Some standard methods of family determination are proposed on the website.
The algorithm is based on a conservative approach that favors the most stringent criteria and minimizes the number of false positives.
An originality of the method is to allow some flexibility in parameterization, such as the neighborhood size and synteny constraint.
Neighborhood size
The initial number of neighbors considered on each side of the query gene was arbitrarily set to 15 based on current knowledge of the size of Hemiascomycetes syntenic blocks. 18 This choice also seems to be suitable for novel yeast species because the distribution of mean syntenic blocks size ranges between 14 and 26 genes. 1
Note that this initial number of neighbors is not critical because an originality of the method is that the process is iterative. Evolutionary mechanisms are not the same in different parts of a genome, so we could not expect that a standard neighborhood size would be appropriate for all gene families. The sequential procedure is a way to circumvent this limitation. The size of the neighborhood used to fix a SONS may vary from subset to subset. In some cases, a subset is defined using 15 neighbors on both sides of the query gene; in other cases, the iterative process leads to the subdivision of the initial subset into smaller units using fewer neighbors (the criteria to stop the subdivision are described in the “Sequential procedure” section).
Synteny constraint
The fact that only one neighbor has to be in common to assign two query genes to the same cluster may seem a rather low requirement. The algorithm allows modification of this criterion (using more neighbors in common), which may slightly decrease the false-positive rate (7 genes of 12,691 in our comparison with YGOB as the gold standard). However, this increase strongly decreases the number of subsets of orthologs with one gene in each species because extensive chromosome rearrangements may occur for the most evolutionarily distant species. In our test case on Hemiascomycetes, this number decreased from 1,697 identified subsets with a criterion of one neighbor to 948 with a requirement of two neighbors in common and to 428 with three neighbors. The IONS method also will benefit from the intensification of sequencing efforts because the method was designed to integrate new data quickly.
Evolutionary span
The rationale of the method is that the considered evolutionary span of the analyzed species is short enough to retain information on sequence similarity and neighborhood. Otherwise, new species are required. Indeed, at large evolutionary distances, while protein-sequence divergence becomes limited by saturation and functional constraints, extensive chromosome rearrangements may occur, 1 shuffling the traces of co-ancestrality. Any method of orthology detection must confront this limitation. The only solution is to diminish the evolutionary distance between genomes by filling the evolutionary gaps with newly sequenced genomes. The constant diminution of the cost of sequencing will certainly contribute to this objective and, as already mentioned, our method easily accommodates new species.
The IONS procedure reaches its maximum efficiency when families of homologs used as inputs are accurately and comprehensively calculated. If families are not accurate—for example, if a gene product is not present in a family—the current version of our program will not be able to find an ortholog that was placed in a wrong family. Because the method is conservative, a lack of information will never lead to wrong results but will decrease the number of identified SONS.
Using high coverage genome sequences allows avoidance of the problem of false gene losses. 31 High coverage also limits the probability of genome assembly errors that could lead to cases of false negatives in which orthology is not detected between two genes because of a lack of shared neighborhood. The probability of false positives resulting from assembly errors is close to zero because it would require a consistent misalignment of two regions in two different species.
The method was designed to yield a single final result, but the record of intermediate steps allows further analysis. For example, when the phylogenetic history is complicated by a whole genome duplication generating ohnologs, it is possible to easily identify the two SONS corresponding to the same set of ohnologs.
Perspectives
Species that are phylogenetically distant may present considerable sequence and synteny divergence, which makes it difficult to detect similarity at the nucleotide level and thus to classify gene products accurately into families. The addition of new species belonging to the same phylum will probably reduce sequence divergence, allowing a better classification of genes into families and improving results. The quality of the mapping, sequencing, and identification of the coding regions of these new species is crucial: comprehensive identification of genes and of their location relative to each other, as well as an accurate classification into families of homologs, is required to take advantage of the IONS method.
Conclusions
The identification of orthologs is a major issue in comparative genomics. The combination of both similarity and neighborhood evidence facilitates the identification of orthologs. The IONS method was developed using Hemiascomycetes genome sequences carried out by the Génolevures Consortium. The performance of IONS is comparable to that of more labor-intensive methods such as YGOB. The automatic nature of the procedure paves the way for easy application to new genomes.
Disclosures
Author(s) have provided signed confirmations to the publisher of their compliance with all applicable legal and ethical obligations in respect to declaration of conflicts of interest, funding, authorship and contributorship, and compliance with ethical requirements in respect to treatment of human and animal test subjects. If this article contains identifiable human subject(s) author(s) were required to supply signed patient consent prior to publication. Author(s) have confirmed that the published article is unique and not under consideration nor published by any other publication and that they have consent to reproduce any copyrighted material. The peer reviewers declared no conflicts of interest.
Footnotes
Acknowledgements
We would like to thank the Génolevures Consortium, coordinated by Jean-Luc Souciet, for access to the database of protein sequences from
