Abstract
The detection of orthologs is a key approach in genomics, useful to understand gene evolution and phylogenetic relationships and essential for gene function prediction. However, a reliable annotation of the encoded protein regions is still a limiting aspect in genomics, mainly due to the lack of confirmatory experimental evidence at proteome level. Nevertheless, the current ortholog collections are generally based on protein sequence comparisons, in addition to the availability of large transcriptome sequence collections. We developed
Introduction
The detection of ortholog genes is a relevant issue in molecular biology useful for structure, functional, and evolutionary inferences.1–14 Ortholog genes, ie, 2 gene copies from 2 different species that derived from a common ancestor and that diverged after a speciation event, are usually investigated for a wide range of applications in comparative genomics, phylogenetic analysis, genome annotation, and function prediction.4,15 In particular, it is common to investigate orthology relationships based on sequence similarities for transferring functional information from model organisms to still uncharacterized genes in newly sequenced genomes.16–18 These efforts pave the way for understanding speciation and gain loss of gene functionalities, highlighting peculiarities or conservation among species. The increasing number of fully sequenced genomes further pushed the flourishing of computational methods to detect orthol-ogs1,2,5,6,10,13–15,17,19–22 and to identify blocks of syntenic regions among species.23–28 Most of the approaches for inferring orthology can be grouped into graph-based methods, which define orthologs based exclusively on sequence similarity, or tree-based methods, which classify all the splits of a given gene tree as duplication or speciation, trying to reconcile the splits to the phylogeny of the analyzed species.8,21,22 Graph-based methods include 2 steps: first, pairs of ortholog genes are detected, and graphs, with nodes representing genes and edges representing relationships, are defined; second, clusters of ortholog genes are organized based on the structure of the graphs. The simplest approaches identify orthologs between genes in pair of genomes.
29
The key assumption is that orthologs are those genes among homologs with the minimum divergence. Orthologs from 2 genomes are all those genes detected as being the reciprocal best hit of each other.19,30,31 This widespread methodology is generally defined as the search for the Bidirectional Best Hit (BBH). It establishes that genes xi and yj, from species X and Y, are best putative orthologs if xi is the best hit of yj, and yj is the best hit of xi, when considering all versus all similarity searches.
32
The BBH detection between 2 genomes is computationally efficient because sequence alignments can be computed by well-established approaches, either based on dynamic programming
33
or on more efficient heuristic algorithms, such as the BLAST set of programs.
34
The BBH detection has some constraints. Some genes in a species can have more than 1 ortholog with another species. This can reflect a gene duplication after a speciation event, whereas the ortholog counterpart in the other genome remains in single copy, namely, a singleton gene.
35
Remm et al
36
referred to these duplicated genes after a speciation event as in-paralogs, developing a dedicated algorithm for their detection called
Another constraint of graph-based methods includes the limit of working with only 2 species at a time and of not being effective for large evolutionary distances 31 because low sequence similarities may not be detected at all. However, although tree-based methods can work on more species and provide more information than pairs or groups of orthologs, such as evolutionary distances and the order of duplication and speciation events, these methods are computationally much more expensive than graph-based algorithms.22,29 Moreover, they may be less reliable when large evolutionary distances occur.
The BBH detection is faster and easy to automate when based on heuristic approaches. 22 To compare results from graph-based methods and tree-based ones, statistical approaches were considered.7,8,15,21,38 By these measures, none of the 2 methods revealed to be optimal. Overall, many BBH algorithms reached higher sensitivity at the cost of specificity, whereas the tree-based methods showed the opposite trend. At short evolutionary distances, instead, graph-based methods and tree-based methods produced similar sets of orthologs. 22 A recent study, 15 however, showed that sometimes more complex tree reconstruction/reconciliation methods are outperformed by pairwise comparison approaches such as BBH. This suggests that tree reconciliation, although more specific in theory, does not reveal to be the best methodology in practice. These issues probably explain why many people prefer to use simple BBH implementations rather than a more complex tree-based approach.15,22
Protein sequences have been the widespread exploited reference to identify orthology relationships, mainly because these analyses are useful to find functional counterparts among species. However, orthology should refer to evolutionary relationships traceable at genome level. Similarity at protein sequence level does not always represent similarity at gene level. Indeed, it does not permit to exploit similarities embedded in untranslated regions (UTRs), or may lead to misinterpretations due to open reading frame (ORF) misassignments, difficult to be detected because of the lack of consistent confirmatory data at proteome level, or to sequencing errors. In a study by Trachana et al, 39 gene annotation emerged as the largest single influencer of the quality of orthology detection procedures, affecting up to 30% of the performance of these methods.
To overcome the limitations due to the poor quality of protein sequence predictions, especially in preliminary gene annotations, we developed
We tested our approach considering 2 plant species,
Materials and Methods
Data sets
Transcriptome and proteome collections for
Similarity detection
Reciprocal all-against-all sequence similarity searches of the protein and messenger RNA (mRNA) collections of the 2 species were performed using the BLASTp and tBLASTx programs of the BLAST package, 44 respectively. The analysis was performed fixing the Expect-value (E-value, E) cutoff at 10−3.
Moreover, an all-against-all protein sequence similarity search using the BLASTp program was performed to validate and compare the results from simple BLASTp-based comparisons and reference ortholog databases. A less stringent E-value cutoff at 1000 was used to include even low similarity in the assignment of reciprocal hits.
Algorithm description
To identify BBHs and eBBHs based on transcript collections, we developed Transcriptologs, a dedicated method consisting of 2 procedures, namely,

Pseudocode of the alignment reconstruction algorithm we developed.

Pseudocode of BBH algorithm we developed. BBH indicates Bidirectional Best Hit; eBBH,
The method considers the 2 resulting files from the reciprocal tBLASTx transcript similarity searches (eg, Species1_vs_Species2.txt and Species2_vs_Species1.txt). The tBLASTx results may include possible different alignments between a query sequence xi and a subject sequence yj from the set of sequences X and Y of the 2 species under comparison, each alignment defined by different fragments fm all belonging to the same frame. To define more extended alignments, we designed a dedicated procedure (alignment_reconstruction, Figure 1) that selects the alignment fragments corresponding to the best scored alignment and then adds other fragments from alignments from different reading frames on the same strand sh, if present. The fragments are added exclusively if they do not overlap regions already considered in the procedure of alignment reconstruction (Figure 3). The score of the extended final alignment is defined as the sum of the scores of the single alignment fragments added during the reconstruction.

Improvement example of the total alignment length. If we have to align 2 sequences AT1G50940.1 and Sb01g002210.2 (highlighted in green), the tBLASTx program provides different alignment fragments (highlighted in gray), each one corresponding to a given reading frame (highlighted in red) of the 2 sequences. In this example, the algorithm we designed is able to rebuild an entire alignment using an alignment fragment with a reading frame of +3/+1 and an alignment fragment with a reading frame of +2/+3 because they do not share overlapping segments of the aligned sequences.
When selecting reciprocal hits, we also implemented the possibility to set a tolerance threshold around the score associated with the BBH to define eBBHs. This permits to define other sequences yk which are similar, in an established range, to the query sequence xi. Therefore, the method can detect not only the best hit that is bidirectional but also other bidirectional hits with score in preferred ranges from the best one (Figure 2). Transcriptologs software is free and available at https://github.com/LucaAmbrosino/Transcriptologs.git.
Results and Discussion
Comparison of reference databases
To compare the orthology relationships available from reference ortholog collections, namely, Ensembl Plant BioMart
5
and PLAZA,
11
with those from a simple BLASTp analysis, we performed a protein-based all-against-all similarity search between

Comparison of results detected by BioMart, PLAZA, and an in-house BLASTp analysis. Venn diagram showing (A) the number of Arabidopsis genes that have a relationship with a Sorghum counterpart, (B) the number of Sorghum genes that have a relationship with an Arabidopsis gene, and (C) the number of exact relationships between Arabidopsis and Sorghum genes.
Although the looser BLASTp E-value cutoff was set to include all the possible relationships among the respective protein sequence collections, interestingly, they were not covering all of the ones described in the considered databases. However, we detected a huge amount of relationships with very high significant E-values (E ⩽ 10−100), which were not found in the results from public collections of orthologs. The comparison here presented highlights that ortholog collections available in open-access databases, although all based on similarity levels among protein sequences, are quite heterogeneous, and they could not be confirmed by simple similarity searches. The limited reproducibility of the results from publicly available collections affects their role as references and requires methods to appropriately address the reliability of the single relationships.
Orthology inference
Transcriptologs results were compared with protein-based sequence similarity searches performing all-against-all independent analyses. Protein sequences (BLASTp) and translated mRNA (tBLASTx) sequences were both analyzed setting an E-value cutoff at 10−3.
We considered translated nucleotide because the protein similarity scoring is more sensitive than the nucleotide-based one. Moreover, the results could be appropriately compared with results from classical protein-based approaches. In addition, this approach would also assess similarity between 2 sequences in the presence of frameshifts due to sequencing errors, annotation limits, or true evolutionary divergence.
For each detected pair of query-subject hit, the tBLASTx provides a list of alignment fragments grouped by frame, corresponding to different alignments with an associated score. The alignment reconstruction algorithm (Figure 1) attempts to reconstruct the most extended alignment between the 2 mRNAs. Indeed, the algorithm collects all the fragments with the same reading frame originated from the BLAST best score alignment. Then, it adds fragments coming from different reading frames as long as they are on the same strand, and they do not overlap the already collected ones. The new alignments and their scores, defined by the sum of the scores of the contributing fragments, are the final results of the alignment reconstruction algorithm.
In the example test we considered, the resulting 82 721 tBLASTx alignments out of 1 181 628 total matches (Table 1) were reconstructed adding at least 1 alignment fragment among those included in the tBLASTx original output. The improved algorithm led to an increase in (a) the average score values of about 54 units compared with the original tBLASTx output, (b) the average number of alignment fragments forming the final complete alignment, (c) the average number of identity matches, and (d) the average alignment length (Table 1).
Comparison of results from tBLASTx and Transcriptologs.
Mean values of the score, number of fragments, number of identities matches, and alignment length, related to the alignments that were refined by our implementation, are shown.
Subsequently, BBHs between

Comparison between Transcriptologs and BLASTp analyses. (A) Pie charts showing some features of BBHs detected only using protein sequences. (B) Venn diagram showing differences and similarities in the number of BBHs detected using protein sequences and transcript sequences. (C) Pie charts showing some features of BBHs detected only using transcript sequences. In the pie chart on the left, the number of alignments that involve UTRs is shown in green, the number of alignments obtained from at least 2 fragments having different reading frames between them is shown in orange, the number of alignments with a different reading frame in comparison with the predicted proteins is shown in gray, the number of alignments with a similarity score less than 100 is shown in blue, and the remaining number of alignments is shown in yellow. BBHs indicates Bidirectional Best Hits; UTRs, untranslated regions.

Comparison between Transcriptologs and protein Bidirectional Best Hits (BBHs). Distribution of the (A) BBH scores detected only using transcript sequences, (B) BBH E-values detected only using transcript sequences, (C) BBH scores detected only using protein sequences, and (D) BBH E-values detected only using protein sequences.
Then, among the BBHs exclusively detected by the BLASTp (2610 matches) and by the tBLASTx (2561 matches) methods, we considered the cases in which the same Arabidopsis gene found a different Sorghum ortholog when considering the transcript-based comparison or the protein-based comparison (Figure 7A), and vice versa (Figure 7B). Plotting the score distribution of the corresponding alignments based on the 2 different approaches, we noticed higher scores of the similarities detected by tBLASTx considering the 2 species (Figure 7). This highlighted that transcript-based comparisons detect alignments with a higher score when compared with the ones obtained from protein sequences, finding more valuable associations. As an example, the dot highlighted in Figure 7A by an arrow indicates the score of an

Comparison between Transcriptologs and protein BBHs. Distribution of the BBH scores detected exclusively using transcript and protein sequences, involving (A) the same

Example of improved similarity detection based on transcript. (A)
In detail, most of the protein BBHs (2479 on 2610) were detected also by the tBLASTx analysis before the selection of the BBHs (Figure 5A and B). Indeed, they were not selected because they were not best reciprocal hits. However, these relationships would have been detected by the method by setting a tolerance threshold around the best hit. However, 131 relationships within these BBHs exclusively detected starting from protein sequences were not found by the similarity search based on transcript sequences (Figure 5A). Interestingly, only in 1 case, the score resulted higher than 100, highlighting the minor relevance of the lost alignments associated with the protein-based approach.
Next, considering details of the transcript BBHs, most of them (2467 on 2561) were included in the results from the BLASTp analysis (Figure 5B and C). Also in this case, a small group of relationships (94) were not found by the similarity search based on protein sequences (Figure 5C). Among them, 78% (64 of 94 matches) had a score higher than 100, indicating relevant similarities to be considered for the BBH definition. To further investigate the reasons of the lack of relationships detected at protein sequence level, we deeply analyzed these 64 matches. We observed that (a) 38 matches had extended alignments when compared with those from proteins because they involved UTRs and (b) 12 alignments involved reading frames not corresponding to the annotated protein-coding regions. Specifically, 7 alignments involved alternative reading frames when compared with the expected protein-coding ones, probably solving an error in the protein prediction, whereas 5 alignments were reconstructed with fragments from different reading frames (Figure 5C). These results highlight that the BBHs based on transcript sequences show an overall better assessment of the sequence similarity of the involved genes. In addition, the presence of these 94 alignments exclusively detected by the transcript-based approach was cross-checked with results from public ortholog databases, namely, PLAZA 11 and BioMart. 5 Despite both databases containing orthologs, the methods used for their detection are different and not exclusively based on the comparison of protein sequences, such as the tree reconciliation of the phylogenetic tree of a gene family in PLAZA or directly the genome comparison in BioMart. In total, 22 orthologs in PLAZA and 21 alignments in BioMart out of 94 found confirmation, in which 16 are common to both resources. This highlights that the transcript-based approach is sensitive as more complex approaches as the ones just cited.
Conclusions
In this work, we presented Transcriptologs, a method for the identification of orthologs exploiting transcript sequences. The Transcriptologs represent orthologs detected by BBHs defined based on revisited tBLASTx results.
As a case study, we tested Transcriptologs to define ortholog genes between
Transcriptologs was implemented for a straightforward exploitation of transcript sequences because of the widespread sequencing of transcriptomes and, therefore, the higher reliability of the transcript region in comparison with protein sequences, which are usually predicted based on similarity and/or ORF detection. Indeed, large-scale definition of transcript sequences is today easily achievable, thanks to classical (expressed sequence tag sequencing) and novel (RNAseq) technologies in comparison with proteome sequencing. Moreover, the revisiting of tBLASTx output performed by Transcriptologs overcomes possible limits in the definition of the correct coding frame. The method also exploits a wider region for the detection of similarities, including the UTRs. Therefore, as demonstrated here, it has a higher sensitivity in the detection of the BBHs.
Although classical approaches and publicly available collections are based on protein sequence similarity searches, we first showed the heterogeneity of the collections available today, including results that are often incomparable. Then, considering the E-value of an alignment between 2 sequences as a surrogate of the alignment quality, we showed that orthology relationships available from these collections can neither be comparable nor can be interpreted based on a BLAST similarity search.
We compared the results from protein-based BBHs and transcript-based BBHs, and we investigated on the main differences between them. We highlighted that similarity searches at transcript level can lead to different results when compared with protein-based analyses. In particular, considering the quality of the alignments, we assert that orthologs detected using transcriptomic data have higher scoring, taking advantages of reconstructed alignments that are extended along the transcripts, including also regions with different coding frames. The proposed approach may also overcome sequencing errors and possible limits in the detection of similarities that could be hidden at protein level. The method may integrate classical approaches because it confirms results from previous ortholog collections based on protein sequences and it can highlight new relationships, thanks to the exploitation of a higher information content.
Moreover, Transcriptologs can support a widespread analytical approach, such as the ortholog detection, exploiting more accessible and reliable data, such as those from transcript sequences. In the era of fast genome and transcript sequencing, draft gene annotations are often released without consistently undergoing human curation. Although these efforts are usually supported by incredible enrichment of transcriptome data sets, the proteome complement is still limited, and alternative approaches for ortholog detection may lead to results that are more reliable.
Footnotes
Acknowledgements
The authors thank the anonymous reviewers for their comments and suggestions, which were helpful in improving the manuscript. This work is the frame of the Cost Action FA1106 and of the Genopom Pro and HORT Projects (Ministero dell’Istruzione, dell’Università e della Ricerca [MIUR], Italy).
Peer review:
Seven peer reviewers contributed to the peer review report. Reviewers’ reports totaled 2116 words, excluding any confidential comments to the academic editor.
Funding:
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Genopom Pro and HORT Projects (Ministero dell’Istruzione, dell’Università e della Ricerca [MIUR], Italy).
Declaration of conflicting interests:
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Author Contributions
LA developed the method and analyzed the data. MLC conceived the approach and directed all the work. Both the authors participated in the study and wrote the manuscript. Both the authors approved the final version of the manuscript.
Disclosures and Ethics
As a requirement of publication, author(s) have provided to the publisher signed confirmation of compliance with legal and ethical obligations including, but not limited to, the following: authorship and contributorship, conflicts of interest, privacy and confidentiality, and (where applicable) protection of human and animal research subjects. The authors have read and confirmed their agreement with the ICMJE authorship and conflict of interest criteria. The authors have also confirmed that this article is unique and not under consideration or published in any other publication, and that they have permission from rights holders to reproduce any copyrighted material. Any disclosures are made in this section. The external blind peer reviewers report no conflicts of interest.
