Abstract
The 3 biological domains delineated based on small subunit ribosomal RNAs (SSU rRNAs) are confronted by uncertainties regarding the relationship between Archaea and Bacteria, and the origin of Eukarya. The similarities between the paralogous valyl-tRNA and isoleucyl-tRNA synthetases in 5398 species estimated by BLASTP, which decreased from Archaea to Bacteria and further to Eukarya, were consistent with vertical gene transmission from an archaeal root of life close to
Keywords
Introduction
Molecular evolution analysis of small subunit ribosomal RNAs (SSU rRNAs) yielded a universal but unrooted tree of life (ToL) that comprises the 3 biological domains of Archaea, Bacteria, and Eukarya.
1
A ToL of transfer RNAs (tRNAs) based on the genetic distances between the 20 classes of tRNA acceptors for different amino acids located the Last Universal Common Ancestor (LUCA) near the hyperthermophilic archaeal methanogen
Materials and Methods
Source of data and materials
Protein and SSU rRNA sequences were retrieved from NCBI GenBank release 231 (ftp://ftp.ncbi.nlm.nih.gov/genomes/).21,22 For species without available SSU rRNA information in NCBI, quality checked SSU rRNA sequences were downloaded from the SILVA database release 132 (https://www.arb-silva.de/). 23 For species with multiple SSU rRNA sequences, the one yielding the highest total bitscore (using BLASTN 24 with “-word_size” flag set to 4) with SSU rRNAs of other species from the same domain was employed for analysis. The accession numbers of SSU rRNAs analyzed were available in File S1 in Supplementary Materials. Eukaryotic mitochondrial DNA-encoded protein sequences were retrieved from the RefSeq mitochondrial reference genomes in the NCBI Protein database (https://www.ncbi.nlm.nih.gov/protein).
Estimation of nuclear or mitochondrial proteome similarity bitscores
When comparing proteome similarities, the proteomes of all subject species were used to construct a local BLAST database using makeblastdb,
24
and every query proteome is searched against the local database using BLASTP with a BLOSUM62 matrix and thresholds setting to evalue <1 × 10−5, percent identity >25%, and query coverage >50%. Only the query and subject sequences that were the best match of each other, viz when query sequence
Estimation of rProt similarity bitscores
To identify rProt sequences in Gla, Trv, Sce, and Hsa (see species name abbreviations in Table 1), eukaryotic proteomes were cleared of mitochondrial or mitochondrial DNA-encoded proteins, and then searched against the Pfam database 25 using RPSBLAST 24 at a threshold set by the “-evalue” flag at 0.01. For each of the 88 rProt families analyzed (Table S1), only the protein sequence from each species that yielded the highest bitscore toward the rProt family was analyzed further. On this basis, 79, 81, 84, and 86 out of the 88 rProt families were found in the Gla, Trv, Sce, and Hsa proteomes, respectively. These eukaryotic rProts were blasted against all the prokaryotic proteomes using BLASTP. Prokaryotic proteins passing the threshold of evalue <0.05 were searched against the Pfam database using RPSBLAST, and false-positive sequences that failed to map to the targeted rProt family were removed. The similarities between the rProt sequences identified from eukaryotes and prokaryotes were estimated based on the maximum BLASTP bitscores.
Partial list of species analyzed.
Note: C. in front of species name stands for Candidatus. Detailed species information is given in Table S2.
Estimation of non-rProt similarity bitscores
To identify Gla-like protein families in various prokaryotes, every sequence in the Gla proteome was blasted against the 82 prokaryotic proteomes in Table 1 (except for Psy from preprint form), and the best matches passing the threshold of evalue <0.05 were mapped to the Pfam database using the NCBI Batch CD-search Tool. 26 To remove false-positive pairs, only cases where both query and subject sequences belonged to the same targeted protein family were analyzed, and the Gla sequences that were relatively rare in prokaryotes, displaying similarity bitscores toward ⩽10 out of the 82 prokaryotic proteomes tested, were classified as Gla-like proteins.
Results and Discussion
Similarity between VARS-IARS paralogues
The relative antiquity of proteins could be approximated, except for proteins that have undergone extraordinarily extensive evolution, based on the increasing divergence of paralogous proteins in time.
27
Accordingly, BLASTP was performed between the intraspecies valyl-tRNA synthetase (VARS) and isoleucyl-tRNA synthetase (IARS) in the genomes of 5398 species in NCBI Genbank. When the bitscores obtained were arranged in descending order (Table S2), or in part on a distribution curve (Figure 1), Mka yielded a top bitscore of 473. BLASTP, which provided indication of similarity but not necessarily phylogenetic relationship,
28
was a fitting tool for evaluating the intracellular divergence of VARS-IARS which carried no phylogenetic implication: 2 neighboring species on the distribution curve could belong to 2 different biological domains. As the 119 highest scoring species were all archaeons, the top-scoring bacterium Mau gave only a bitscore of 378 and the top-scoring eukaryote Esi gave only a bitscore of 240, the smallest VARS-IARS divergences were clearly confined to Archaea, in keeping with the descent of Bacteria from Archaea, and descent of Eukarya from either Archaea or an Archaea-Bacteria collaboration. The foremost antiquity of Mka indicated by its bitscore was in accordance with the Mka-proximal LUCA identified by the genetic distances between alloacceptor tRNAs,
2
and the unchanging environment throughout the ages at the hydrothermal vents inhabited by Mka. It was also consistent with the datings of the

Ranking of similarity bitscores of intraspecies VARS-IARS for various species in descending order (from left to right). The bitscores for 1185 archaeal, 3621 bacterial, and 592 eukaryotic species from NCBI are given in Table S2. IARS indicates isoleucyl-tRNA synthetase; NCBI, National Center for Biotechnology Information; VARS, valyl-tRNA synthetase.
The positions of some of the species analyzed in Figure 1 were indicated on the SSU rRNA tree, with their intraspecies VARS-IARS bitscores expressed in circles colored according to the thermal scale (Figure 2A).

Distribution of similarity bitscores relating to VARS and IARS on SSU rRNA tree. (A) Bitscores for VARS-IARS pairs. (B) Bitscores for VARS (squares), or IARS (triangles), between Gla and other organisms. For building the consensus maximum parsimony tree of SSU rRNAs for 29 archaeal, 31 bacterial, and 19 eukaryotic species using PHYLIP version 3.698, 30 the sequences were aligned in Clustal Omega. 31 One thousand sets of bootstrap-resampled sequence alignments were generated using SEQBOOT and inputted into DNAPARS to construct maximum parsimony trees. The consensus tree was produced based on the 1000 sets of maximum parsimony trees using CONSENSE. The nodes indicate more than 85% bootstrap support (black), more than 50% (gray), or less than or equal to 50% (white). IARS indicates isoleucyl-tRNA synthetase; SSU rRNA, small subunit ribosomal RNA; VARS, valyl-tRNA synthetase.
There was a concentration of euryarchaeons with high VARS-IARS similarity in a “Primitive Archaea Cluster” centered between Pfu and Mac. In the Bacteria domain, there was likewise a concentration of species with high VARS-IARS similarity in an “Ancestral Bacteria Cluster” centered between Det and Hth. The deepest branching species in the Bacteria domain were 2 members of the
Given the relative paucity of HGT effects on VARS-IARS similarity, the parallel prominences of high VARS-IARS similarity-bitscore species in the Primitive Archaea Cluster and the Ancestral Bacteria Cluster were explicable by vertical genetic transmission of the VARS and IARS genes from an Mka-proximal root of life to the archaeal cluster, and in turn to the bacterial cluster. As the top-ranked bacterial bitscore of Mau at 378 was between those of archaeons Mac at 382 and Pfu at 369, the results indicated that the Ancestral Bacteria Cluster branched off from the Primitive Archaea Cluster near the Mka-proximal root of life. The medium VARS-IARS bitscores of Esi, Tps, Bpr, and Cme among the Eukarya (Figure 2A) also pointed to the conservation of intraspecies VARS-IARS similarity in this domain. The much higher VARS (colored squares) and IARS (colored triangles) bitscores between Gla and various bacterial species compared to archaeal species, except for the high similarity exhibited by Gla IARS toward that of Abo, suggests that Eukarya received VARS from Bacteria and IARS from Abo or a bacterium (Figure 2B).
Sequence alignments
The aligned segments of VARS and IARS (Figure 3) from Mka, Mau, and Esi, viz the archaeon, bacterium, and eukaryote displaying the highest VARS-IARS similarity within their respective domains, included 42 of 207 columns where all 6 sequences carried the same amino acid, in support of sequence conservation of this pair of paralogous genes among all 3 living domains. Together with the higher rankings of VARS-IARS similarity attained by archaeons relative to both bacteria and eukaryotes (Figure 1), the sequence conservation observed represented strong evidence for the vertical transmission of the VARS and IARS genes from Archaea to both Bacteria and Eukarya.

Segments of the aligned VARS and IARS sequences of Mka, Mau, and Esi. Sequences were aligned using Clustal Omega, and the numbers indicate the positions of amino acid residues on the complete sequence alignment (Figure S1). Similar amino acids in the same column are colored in orange, and ⩾50% conserved ones in blue. Asterisks mark the 6 positions where a V or L residue is found in all 6 sequences. IARS indicates isoleucyl-tRNA synthetase; VARS, valyl-tRNA synthetase.
Process of eukaryogenesis
Extensive evidence supports that an endosymbiotic event between an archaeal parent and an alphaproteobacterium played a key role in the development of Eukarya.34,35 Proposals regarding the identity of the archaeal parent have focused on a range of archaeons including
Upon BLASTP comparisons of the 79 Gla, 81 Trv, 84 Sce, and 86 Hsa rProt families with prokaryotic rProts, 69/69 Gla, 71/72 Trv, 71/72 Sce, and 71/71 Hsa ones with prokaryotic resemblance showed higher similarity toward archaeons than bacteria; thus, only 1 of 72 of Trv (rProt L29) or Sce (rProt S4) ones showed higher similarity toward bacteria than archaeons (Figures 4A and S2), clearly indicating that eukaryogenesis was hosted by an archaeal parent instead of a bacterial parent.36,37 Those rProts in Table S1 without any prokaryotic resemblance might be derived from a prokaryote not analyzed in this study, invented by the eukaryogenic lineage, or diminished in their resemblances by evolutionary changes to beyond recognition by BLASTP.

Protein sequence similarities between Gla and prokaryotic species. (A) Maximum BLASTP bitscores between Gla rProts and prokaryotic rProts. (B) Bitscores of PEP-utilizing enzyme mobile domain (PF00391) between Gla and prokaryotes. (C) Bitscores between some of the Gla-like proteins from Table S3 and potentially homologous proteins in various prokaryotes. (D) Numbers of the 162 Gla-like proteins found in various prokaryotes. The color coding and order of different prokaryotic species on the x-axis in (B), (C), and (D) are the same as those in (A). PEP indicates phosphoenolpyruvate.
Among the 6502 proteins in the Gla proteome, 3203 of them showed finite similarity bitscores toward the sequences of one or more of the 82 prokaryotes tested, and the phosphoenolpyruvate (PEP)-utilizing enzyme mobile domain of Gla yielded the highest combined BLASTP bitscore of any Gla protein toward prokaryotic protein families, with Acf, Abo, and Mac (2nd, 1st, and 14th red columns from the right in Figure 4B) showing the top 3 archaeal bitscores. The bitscores were high for Tho and Hei but low for Odi and nil for Lok (3rd, 4th, 2nd, and 1st purple columns from the right) among the
Figure 4C shows the distribution of potential archaeal and bacterial homologues of some of the 162 Gla-like proteins that were either ESPs or relatively rare proteins found in less than 10 of the 82 prokaryotes analyzed (Table S3). The
Nature of archaeal parent
Eukaryogenesis could follow a

Inter-proteome similarity bitscores. (A) Total similarity bitscores of Gla and Trv proteomes toward individual prokaryotic proteomes. Relationships of average bitscore per best-match hit (y-axis) with the number of best-match hits (x-axis): (B) between prokaryotic and Gla proteomes and (C) between prokaryotic and Trv proteomes.
Based on the premise that the free-living archaeal parent might still retain recognizable similarity toward eukaryotes, 46 archaeal proteomes were compared regarding their relationships with the proteomes of Gla and Trv. Figure 5B and C showed that the proteome of the
When the bacterial-gene contents of different archaeons were compared regarding their abilities to acquire bacterial genes, Hla, Hgi, and Mac with their large proteomes (3704 to 4469 protein-coding genes) displayed high similarity bitscores toward a wide range of bacteria (Figure 6, left panel). However, when the bitscore of each archaeon was normalized with respect to the number of protein-coding genes in its genome, the normalized bitscores of the smaller Abo, Acf, Mte, Tvo, and Tac (each with <1600 protein-coding genes), Mfe (1283 protein-coding genes), and Mlt (1291 protein-coding genes) became more prominent (Figure 6, right panel). The medium-sized Pfu (2065 protein-coding genes) gave much the same result with or without normalization. Notably, the high similarity bitscores exhibited by these archaeal proteomes toward multiple bacterial proteomes suggest that they had efficiently adopted exogenous genes received by them from HGT into their own genomes. In contrast, the bacterial proteomes of Bja, Tht, Pel, Dth, Tte, and the DNA transformation-active Bsu exhibited only modest bitscores toward smaller number of archaeons. This enhanced ability of some archaeons to adopt exogenous genes may be referred to as an

Similarity bitscores between archaeal proteomes (y-axis) and bacterial proteomes (x-axis) without (left) or with (right) normalization based on the number of protein-coding genes in each archaeon. Data for the heat maps are given in Table S6.
On account of the large variety and numbers of prokaryotic genes to be included in eukaryotic genomes (Figure 5A), it would be essential for the archaeal parent to be highly active in AGA, so that it could assemble beneficial genes from wide ranging prokaryotic sources and incorporate them into its own genome in the course of eukaryogenesis. Besides AGA activity, Abo the first cultivatable archaeon from the “Deep-sea hydrothermal vent euryarchaeotic 2” (DHVE2) group, and its facultatively anaerobic companion species Acf,57,62,63 possess an exceptionally flexible cell surface which can form small blebbing vesicles that bud off and anneal with other cells. While all prokaryotic cells evolve on the basis of
Similarity bitscores displayed by the proteomes of 225 different archaeons, alphaproteobacterial genera, and other bacteria toward the total mitochondrial DNA-encoded proteins of different eukaryotes indicated that the prokaryotic proteomes displaying top similarity toward each of 19 mitochondrial proteomes were all alphaproteobacterial ones (Table S7). The distributions of the bitscores of the prokaryotic proteomes toward the mitochondrial DNA-encoded proteins of

Similarity bitscores between mitochondrial DNA-encoded proteins and prokaryotic proteins. Total bitscores displayed by 46 archaeons, 150 alphaproteobacterial genera, and 29 other kinds of bacteria toward 3 species of mitochondrial DNA-encoded proteins are shown in the 3 panels. In each case, the 3 top-scoring prokaryotes are indicated with their individual total bitscores inside parentheses.
Conclusions
In this study,
Supplemental Material
FigureS1-S2_xyz322784b82bfeb – Supplemental material for Descent of Bacteria and Eukarya From an Archaeal Root of Life
Supplemental material, FigureS1-S2_xyz322784b82bfeb for Descent of Bacteria and Eukarya From an Archaeal Root of Life by Xi Long, Hong Xue and J Tze-Fei Wong in Evolutionary Bioinformatics
Supplemental Material
TableS1-S7_xyz3227876cd61d9 – Supplemental material for Descent of Bacteria and Eukarya From an Archaeal Root of Life
Supplemental material, TableS1-S7_xyz3227876cd61d9 for Descent of Bacteria and Eukarya From an Archaeal Root of Life by Xi Long, Hong Xue and J Tze-Fei Wong in Evolutionary Bioinformatics
Footnotes
Funding:
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by Innovation and Technology Commission of Hong Kong SAR (grant number ITS/113/15FP).
Declaration of conflicting interests:
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Author Contributions
JT-FW and HX conceived the study; XL collected the data and performed computational analysis; and JT-FW, HX and XL wrote the paper. All authors read and approved the final manuscript.
Data Availability
Supporting data for the present study are provided in online Supplementary Materials.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
