Abstract
Protein domains can be regarded as sections of protein sequences capable of folding independently and performing specific functions. In addition to amino-acid level changes, protein sequences can also evolve through domain shuffling events such as domain insertion, deletion, or duplication. The evolution of protein domains can be studied by tracking domain changes in a selected set of species with known phylogenetic relationships. Here, we conduct such an analysis by defining domains as “features” or “descriptors,” and considering the species (target + outgroup) as instances or data-points in a data matrix. We then look for features (domains) that are significantly different between the target species and the outgroup species. We study the domain changes in 2 large, distinct groups of plant species: legumes (Fabaceae) and grasses (Poaceae), with respect to selected outgroup species. We evaluate 4 types of domain feature matrices: domain content, domain duplication, domain abundance, and domain versatility. The 4 types of domain feature matrices attempt to capture different aspects of domain changes through which the protein sequences may evolve—that is, via gain or loss of domains, increase or decrease in the copy number of domains along the sequences, expansion or contraction of domains, or through changes in the number of adjacent domain partners. All the feature matrices were analyzed using feature selection techniques and statistical tests to select protein domains that have significant different feature values in legumes and grasses. We report the biological functions of the top selected domains from the analysis of all the feature matrices. In addition, we also perform domain-centric gene ontology (dcGO) enrichment analysis on all selected domains from all 4 feature matrices to study the gene ontology terms associated with the significantly evolving domains in legumes and grasses. Domain content analysis revealed a striking loss of protein domains from the Fanconi anemia (FA) pathway, the pathway responsible for the repair of interstrand DNA crosslinks. The abundance analysis of domains found in legumes revealed an increase in glutathione synthase enzyme, an antioxidant required from nitrogen fixation, and a decrease in xanthine oxidizing enzymes, a phenomenon confirmed by previous studies. In grasses, the abundance analysis showed increases in domains related to gene silencing which could be due to polyploidy or due to enhanced response to viral infection. We provide a docker container that can be used to perform this analysis workflow on any user-defined sets of species, available at https://cloud.docker.com/u/akshayayadav/repository/docker/akshayayadav/protein-domain-evolution-project.
Keywords
Introduction
Protein domains are independent evolutionary units of proteins that enable proteins to evolve in a modular fashion through domain insertion, deletion, duplication, or substitution, in addition to evolution through point mutations.1,2 In this ability of protein domains to fold and function independently of other domains, they can be considered as “lego bricks” that can be recombined in various ways to build new proteins.3,4 Small proteins are usually made up of just one domain, whereas large proteins are formed by combinations of multiple domains. 5 Roughly two-thirds of the prokaryotic proteins and four-fifths of the eukaryotic proteins are multidomain proteins that are formed through recombination of 2 or more domains.6,7 The “combinability” of domains makes them prime candidates for studying evolution—both of proteins and species. For example, protein domains have been used to study evolution on genome-wide and species-wide scales by examining the protein-domain content of the species.8-10 Protein-domain content is defined by the presence or absence of protein domains in complete genomes of the species. The importance of protein domains in studying evolution can be verified from the ability of protein-domain content in reconstructing the phylogeny of life, in comparison to trees obtained from standard phylogenetic and phylogenomic approaches that utilize information from molecular markers, gene content, and gene order. 10
In this study, we examine the domain combinations present in 2 groups of plant species—the legumes (Fabaceae) and grasses (Poaceae), treating the protein domains as species “features” that may be present or absent in the focal species. Accordingly, a data matrix was defined with rows representing species, columns representing the protein domains, and the cells containing domain feature values for the respective species. We used standard feature selection and statistical testing techniques to identify protein domains that differ between the target set of species and their respective outgroups.
Gain or loss of particular domains in a group of species can provide a means of understanding trait evolution in those species.11,12 Protein domains can duplicate locally, giving significantly different counts of certain domains. This may provide some useful information about functions associated with those domains.13,14 Counts of protein domains can also increase or decrease along with the proteins that they comprise. 15 Finally, “versatile” domains can partner with multiple different domains; and versatility values can be used to study the evolution of associated functions.3,16,17 We evaluated domain evolution using these types of domain feature matrices: domain content, duplication, abundance, and versatility.
We used 2 types of statistical methods: mutual-information (MI) and nonparametric statistical tests. MI measures mutual dependence between 2 random variables by quantifying the amount of information communicated about one random variable from another random variable. 18 MI has been routinely used for selecting meaningful features, in classification and pattern recognition problems.19-21 Here, we used MI to quantify the mutual dependence between domain feature values and the classification between target and outgroup species. We also employed tests for significance of differences in domain feature values between the target and outgroup species. We applied Fisher’s exact tests 22 for feature matrices containing discrete values, and Wilcoxon rank-sum tests 23 for feature matrices containing continuous values.
Material and Methods
We used 2 sets of plant species to study the species-level changes in protein-domain characteristics for a given set of target species (Figures 1 and 2). The first set consisted of 14 legumes (from the Papilionoideae subfamily within the legume/Fabaceae family), and 10 outgroup species defined concerning the legumes (Table 1).24-45 The second set consisted of 10 grass species (Poaceae) and 9 outgroup species defined concerning the grasses (Table 2).27,36,37,39,40,42-54

Phylogeny of legumes with legume outgroups (left) and table (right) showing the 4 types of domain changes analyzed in this study using example domains mentioned in the second row of the table.

Phylogeny of grasses with grass outgroups (left) and table (right) showing the 4 types of domain changes analyzed in this study using example domains mentioned in the second row of the table.
Legumes and legume outgroups used to study protein-domain evolution in the legumes.
Grasses and grass outgroups used to study protein-domain evolution in the grasses.
All target proteomes from legumes and grasses, together with their respective outgroup proteomes, were searched against domain Hidden Markov Models (HMMs) from the Pfam database (release 32)
55
to assign domains to the protein sequences. The
Calculation of domain feature matrices
The domain content matrix was calculated to represent the presence or absence of domains in target and outgroup species. Columns of the content matrix represent individual Pfam domains and rows represent species. Each cell was assigned a value of “1” if the corresponding domain was detected in the species, else the cell was a value of “0.” Columns with domains that were present in all the target and outgroup species were uninformative and therefore removed.
The domain duplication matrix contains the most frequent copy number of each Pfam domain in species, which was calculated as the modal value of list all possible copy counts of that domain in the corresponding species. The modal value of the list was added to each domain column and corresponding species row. Columns with constant duplication values across all the species (target + outgroup) were removed from the matrix. Also, columns with domain duplication values ⩽1 across all the rows were removed.
The domain abundance matrix was built to represent the abundance value of protein domains in target and outgroup species. Here, we define the abundance value of each domain in each species as the proportion of protein sequences from the entire proteome that contains the domain. The abundance value of each domain in each species is calculated using the inverse domain frequency (IDF) function (equation 1) which is inspired by the inverse document frequency function used in text mining and natural language processing (NLP) applications
where N(
The domain versatility matrix was calculated to represent the changes in the versatility values of the domains across the species. Versatility value (equation 2) for a given domain and species combination was calculated as the reciprocal of the number of domains immediately adjacent to the given domain in protein sequences in the corresponding species. Here too, the columns with constant versatility values across all species (target + outgroup) were removed from the matrix
where
Finally, an additional “species label” column containing value “1” for target species and “0” for outgroup species was attached to all 4 domain feature matrices to represent the classification between target and outgroup species.
Statistical analysis of domain feature matrices
We applied 2 types of statistical analyses to the domain feature matrices. The MI function (equation 3) was used to calculate the MI score for each domain feature by comparing it against the species label column. The MI quantity measures how much information, on average, is communicated in the domain feature column about the classification between target and outgroup species (species label column). Feature columns of the duplication and abundance matrices were subjected to “
We also tested feature columns for significance, calculating
Results
All 4 types of domain feature matrices were calculated for 2 sets of plants—the first containing 14 legume and 10 outgroup species, and the second containing 10 grass and 9 outgroup species. For all feature matrices, we applied MI scoring and significance testing.
Domain content analysis
In legumes and grasses, 13 and 55 domains, respectively, showed significant presence/absence differences relative to their respective outgroups. The results show a loss of 12 domains and gain of the
Domains gained or lost in legumes concerning legume outgroups (top 11 by MI score).
Abbreviations: MI, mutual information; FDR, false discovery rate.
Domains gained or lost in grasses concerning grass outgroups (top 10 by MI score).
Abbreviations: MI, mutual information; FDR, false discovery rate.
Among the top 10 protein domains in grasses, 6 were detected as gained and 4 were detected as lost concerning the grass outgroups. There were 3 domains with unknown functions—
The 4 domains
Domain duplication analysis
Application of MI-scoring and Fisher’s exact tests on domain features of duplication matrices revealed a single domain (of unknown function) in legumes and 8 types of domains in grasses that were significantly different (FDR ⩽ 0.05) in their copy numbers as compared to the copy numbers observed in their respective outgroup sets. The domain
Domains with significant differences in copy numbers between grasses and grass outgroups (top 10 by MI score).
Abbreviations: MI, mutual information; FDR, false discovery rate.
The domains
Domain abundance analysis
The analysis of domain abundance matrices revealed 111 domains in legumes and 497 domains in grasses that have expanded or contracted significantly (FDR ⩽ 0.05), as compared to their respective outgroup sets. In the legumes relative to outgroups, 51 domains have expanded significantly in abundance and 60 domains have contracted. In the grasses, 196 domains have expanded significantly in abundance and 301 domains have contracted. The top 10 significantly expanded or contracted domains in legumes and grasses are listed in Tables 6 and 7.
Domains with significant differences in abundance values between legumes and legume outgroups (top 10 by MI score).
Abbreviations: MI, mutual information; FDR, false discovery rate.
Domains with significant difference in abundance values between grasses and grass outgroups (top 11 by MI score).
Abbreviations: MI, mutual information; FDR, false discovery rate.
Among the top 10 domains showing expansions or contractions in abundance in the legumes, the
The
Among the top 11 domains in grasses to have expanded or contracted in abundance relative to outgroups, only 3 have expanded—specifically, sequences containing the
Sequences containing the
Domain versatility analysis
The analysis of domain versatility matrices revealed a single domain in legumes and 12 domains (Table 8) in grasses with significantly increased or decreased versatility values with respect to their outgroup sets. In legumes, the
Domains with significant differences in versatility values between grasses and grass outgroups.
Abbreviations: MI, mutual information; FDR, false discovery rate.
The
Among the protein domains that have lost domain partners in grasses as compared to the outgroups, the
Domain-centric gene ontology enrichment analysis
To check if the significantly evolving domains (FDR ⩽ 0.05), selected from an analysis of feature matrices, map to any particular gene ontology (GO) terms, we used “dcGO,” the domain-centric ontology database that provides associations between GO terms and protein domains from Pfam. 103 The GO enrichment analysis was performed on domain lists obtained from the content, duplication, abundance, and versatility matrices from both the species sets, to check for significantly enriched GO terms from the 3 GO subontologies: biological process (BP), cellular component (CC), and molecular function (MF).
GO enrichment analysis was performed for the 13 domains from legumes and 55 domains from grasses, that were identified from the analysis of content matrices. Separate enrichment analyses were performed for domains that were detected as gained in target species and domains that were detected as lost in the target species. No GO term enrichment was found for the single
Enriched GO terms from protein domains that were detected as gained in grasses as compared to grass outgroups.
Abbreviation: FDR, false discovery rate.
As a single domain of unknown function (
In domain-centric GO analyses of domains showing significant increase in abundance values, in legumes, enrichment of 3 BP terms and 5 CC terms (Table 10) was found. There is enrichment in biological metabolic processes involving glycosyl compounds (GO:1901659, FDR = 4.80e−03), ribonucleosides (GO:0009119, FDR = 1.39e−02), and isoprenoids (GO:0008299, FDR = 1.39e−02), with involvement in organelle membranes (GO:0098805, FDR = 1.27e−03).
Enriched GO terms from protein domains that show significant increase in abundance values in legumes as compared to legume outgroups.
Abbreviation: FDR, false discovery rate.
GO analyses of domains that showed significant decrease in abundance values between legumes and legume outgroups found enrichment of 10 BP terms and 11 MF terms (Table 11). Among the BP terms, strongest enrichment was found for purine nucleobase metabolic process (GO:0006144, FDR = 9.85e−07) and hydrogen peroxide metabolic process (GO:0042743, FDR = 1.25e−03). Among the MF terms, very strong enrichment was observed for specific MF terms such as xanthine dehydrogenase activity (GO:0004854, FDR = 8.10e−10), oxidoreductase activity, acting on CH or CH2 groups, oxygen as acceptor (GO:0016727, FDR = 8.10e−10), oxidoreductase activity, acting on the aldehyde or oxo group of donors, oxygen as acceptor (GO:0016623, FDR = 8.10e−10), molybdopterin cofactor binding (GO:0043546, FDR = 8.10e−10) and 2 iron, 2 sulfur cluster binding (GO:0051537, 8.25e−08).
Enriched GO terms from protein domains that show significant decrease in abundance values in legumes as compared to legume outgroups.
Abbreviation: FDR, false discovery rate.
In grasses, GO enrichments of 16 BP, 5 CC, and 4 MF terms were found for domains that showed significant increase in abundance values in comparison to the abundance values in grass outgroups (Table 12). Strongest enrichment was observed for specific BP term chromatin silencing (GO:0006342, FDR = 2.02e−05) with relatively moderate enrichments for BPs including protein unfolding (GO:0043335, FDR = 4.26e−03), negative regulation of translational initiation (GO:0045947, FDR = 4.12e−03), positive regulation of nuclear-transcribed mRNA poly(A) tail shortening (GO:0060213, FDR = 4.26e−03), miRNA-mediated inhibition of translation (GO:0035278, FDR = 5.63e−03), small RNA loading onto RISC (GO:0070922, FDR = 5.87e−03), production of siRNA involved in RNA interference (GO:0030422, 7.51e−03), mRNA cleavage (GO:0006379, FDR = 7.51e−03) and pre-miRNA processing (GO:0031054, FDR = 7.51e−03). Enrichments in the CC terms correlated with the BP terms, with general and specific CCs like polysome (GO:0005844, FDR = 8.05e−03), RNAi effector complex (GO:0031332, FDR = 2.93e−03), microribonucleoprotein complex (GO:0035068, FDR = 2.93e−03), RISC-loading complex (GO:0070578, FDR = 2.93e−03) and mRNA cap-binding complex (GO:0005845, FDR = 3.23e−03) showing moderate enrichments. In addition to BP and CC terms, enrichment for specific MF terms such as endoribonuclease activity, cleaving siRNA-paired mRNA (GO:0070551, FDR = 2.06e−04), diphosphotransferase activity (GO:0016778, 3.14e−04) and RNA 7-methylguanosine cap binding (GO:0000340, FDR = 8.12e−04) was found, with strongest enrichment observed for MF involving ubiquitin-activating enzyme activity (GO:0004839, FDR = 2.87e−05).
Enriched GO terms from protein domains that show significant increase in abundance values in grasses as compared to grasses outgroups.
Abbreviations: FDR, false discovery rate; RISC, RNA-induced silencing complex.
For domains that showed significant decrease in abundance value in grasses, GO enrichment for 6 BP and 2 MF terms were observed (Table 13). Among the BP terms, there was moderate enrichments for the specific process, acetyl-CoA metabolic process (GO:0006084, FDR = 2.61e−03) and 2 highly specific processes, namely cellular response to azide (GO:0097185, FDR = 5.64e−03) and cellular response to copper ion starvation (GO:0035874, FDR = 5.64e−03).
Enriched GO terms from protein domains that show significant decrease in abundance values in grasses as compared to grasses outgroups.
Abbreviation: FDR, false discovery rate.
Finally, the domain-centric GO-enrichment analyses of domains that have significant different versatility values in legumes and grasses concerning their outgroup species did not show enrichment of GO terms from any of the 3 subontologies.
Discussion
In this study, we describe evolutionary patterns in species from 2 large plant families: legumes and grasses, by tracking changes in their species-level protein-domain characteristics relative to selected outgroup species. We analyzed 4 types of domain characteristics to study gain and loss of domains, changes in duplication counts of domains along the sequences, expansion and contraction of domains, and changes in the partnering tendency of domains.
The work presents a generic framework for studying evolution of a chosen set of target species using protein domains as a unit of evolution instead of entire protein sequences. The feature-selection techniques used in data science and machine learning like the MI and statistical tests like Fisher’s exact test and Wilcoxon rank-sum test can be used to select or filter-out significantly evolving domains in the target set of species relative to an outgroup set of species, which can be mapped to gain/loss or increase/decrease of particular biological functions in the target species. We have also containerized this entire analysis workflow inside a docker container which can be downloaded from the following URL: https://cloud.docker.com/u/akshayayadav/repository/docker/akshayayadav/protein-domain-evolution-project. The container is designed to accept user-defined set of target and outgroup proteomes along with the Pfam domain database and output domain sets for all 4 feature categories that have significantly different domain feature values (FDR ⩽ 0.05) in target species as compared to the outgroup species.
It should be noted that the FDR-adjusted
Domain content analysis in legumes shows a striking loss of protein domains from FA pathway, the pathway which is responsible for the repair of interstrand DNA crosslinks. The FA pathway consists of a core complex that ubiquitinates the FANCD2-FANCI complex, which then localizes to the site of DNA repair. It seems that all the proteins from FA core complex and the FANCI protein, except the FANCD2 nuclease, are lost in the legumes. Although one of the repair proteins (FANCD2) is present in legumes, the core complex protein (FANCL) that monoubiquitinates the FANCD2, is absent. As ubiquitination of the FANCD2 60 is an indispensable part of the DNA repair process, this could mean that legumes might have lost the ability to repair interstrand DNA crosslinks or that the FA-mediated repair of interstrand DNA crosslinks is carried out without the ubiquitination of FANCD2. In grasses, domains showing gains include those involved in flavonoid biosynthesis (well-studied in maize), as well as structural proteins found in gluten and male florets. The domains that were detected as lost in grasses are involved in functions such as peptidoglycan biosynthesis, wound repair in sieve tubes, and fatty acid synthesis. Fatty acid synthesis may be reduced in the sampled monocots, due to relatively greater production of carbohydrates in grass seeds, and the differences in sieve tube structure in monocots as compared to dicots. 104
Analyses of duplication feature matrices revealed a single domain of unknown function to have significantly decreased in copy number in legumes sequences. In grasses, an increase in copy number of domains such
Domains with significantly increased abundance values in legumes were found to be associated with functions involving Thylakoid formation, Glutathione metabolism, and enriched with GO terms related to biosynthetic/metabolic processes involving glycosyl compounds, ribonucleosides, and isoprenoids. For domains that showed significant decrease in abundance values in legumes, GO terms related to specific BPs and MFs involving oxidation of purine nucleobase xanthine were found to be significantly enriched. A study on xanthine oxidizing enzymes isolated from leaves of legumes confirms that these oxidoreductases do not react with molecular oxygen and are essentially dehydrogenases. 106 The decrease in abundance of domains involved in purine catabolism may also be attributed to the availability of fixed nitrogen and remobilization of nitrogen from breaking down purine rings is no longer required. 107 In grasses, domains showing significant increase in abundance values revealed domains involved in functions related to gene silencing with GO terms such as chromatin silencing, regulation of translational initiation, protein unfolding, micro/si-RNA-mediated gene regulation, showing significant enrichment. The micro-RNA-related enrichments could be attributed to the regulation of floral organ genes in grasses such as rice and maize influencing various features of flower structure. 108 An increase in gene-silencing-related domains could also be attributed to polyploidy in grasses 109 or enhanced response to viral infection. 110 On the contrary, domains with significant decrease in abundance values, in grasses, showed involvement in functions such as cell adhesion, intracellular chloroplast movement, interfascicular fiber differentiation, DNA synthesis, and pectin metabolism with enrichment of GO terms such as acetyl-CoA metabolism and response to azide.
Finally, the increase in the versatility of the zinc-binding domain in legumes could be related to root nodule symbiosis and compound leaf morphology. Nitrogen fixation through root nodule symbiosis is one of the salient features in legumes and studies have shown the involvement of nodule-specific zinc-binding domain-containing proteins in symbiosis establishment and nodule function.
111
The zinc-finger domain-containing transcription factor has also been shown to be involved in trifoliate compound leaf morphology in
This method can be effectively used to study characteristic biological functions/processes for a selected group of species by filtering out protein domains that seem to have differently evolved in the group, with respect to an outgroup set of species. By closely studying the most significantly evolved protein domains and GO terms associated with significantly evolved protein domains, we might be able to explain the molecular mechanisms responsible for characteristic biological features observed in our target group of species.
Footnotes
Funding:
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported in part by the NSF project “Federated Plant Database Initiative for the Legumes,” award #1444806, and by the US Department of Agriculture, Agricultural Research Service, project 5030-21000-069-00D. Mention of trade names or commercial products in this publication is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the US Department of Agriculture. USDA is an equal opportunity provider and employer.
Declaration of conflicting interests:
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Author Contributions
AY concieved and designed the research and analysis, and drafted the manuscript. SC and DFB supervised the analysis. All authors read, edited, and approved the manuscript.
