Abstract
Gene duplication has been proposed to serve as the engine of evolutionary innovation. It is well recognized that eukaryotic genomes contain a large number of duplicated genes that evolve new functions or expression patterns. However, in mollusks, the evolutionary mechanisms underlying the divergence and the functional maintenance of duplicate genes remain little understood. In the present study, we performed a comprehensive analysis of duplicate genes in the protein kinase superfamily using whole genome and transcriptome data for the Pacific oyster. A total of 64 duplicated gene pairs were identified based on a phylogenetic approach and the reciprocal best BLAST method. By analyzing gene expression from RNA-seq data from 69 different developmental and stimuli-induced conditions (nine tissues, 38 developmental stages, eight dry treatments, seven heat treatments, and seven salty treatments), we found that expression patterns were significantly correlated for a number of duplicate gene pairs, suggesting the conservation of regulatory mechanisms following divergence. Our analysis also identified a subset of duplicate gene pairs with very high expression divergence, indicating that these gene pairs may have been subjected to transcriptional subfunctionalization or neofunctionalization after the initial duplication events. Further analysis revealed a significant correlation between expression and sequence divergence (as revealed by synonymous or nonsynonymous substitution rates) under certain conditions. Taken together, these results provide evidence for duplicate gene sequence and expression divergence in the Pacific oyster, accompanying its adaptation to harsh environments. Our results provide new insights into the evolution of duplicate genes and their expression levels in the Pacific oyster.
Introduction
Gene duplication plays key roles in organismal evolution.1,2 Duplicate genes initially have identical sequences but diverge in regulatory and coding regions during subsequent evolution. Divergence in regulatory regions could result in changes in expression levels, whereas changes in coding regions may lead to the acquisition of new functions.3–5 The rapid development of next-generation sequencing technology in the past decade provides unique opportunities to study the general pattern from the whole genome and expression level. Indeed, several studies have attempted to characterize the correlation between expression patterns and genomic divergence for duplicate genes in human being, yeast, Arabidopsis, and cow.5–9 Therefore, a general picture of the patterns of expression divergence in the evolution of duplicated genes is emerging. For instance, a positive correlation between synonymous sequence divergence and expression divergence was reported for human and yeast duplicate genes. 10 Likewise, the analysis of bovine duplicate genes also revealed that expression changes were correlated with sequence divergence. 11 On the other hand, unambiguous evidence of weak correlation between synonymous sequence divergence and expression divergence has been found in a case study of Arabidopsis duplicate genes. 12 Owing to these inconsistent findings within this limited number of species, there is still a need to characterize the relationship between expression and sequence divergence, especially in the nonmodel organism.
The Pacific oyster Crassostrea gigas is a representative species of phylum Mollusca, belonging to a large taxonomic group of protostomes and the group of marine animals with the largest number of identified species. Despite the species richness of this phylum, the genomes of Mollusca have only recently been examined. The whole-genome sequence and various developmental and stress-response RNA-seq transcriptomes for Pacific oyster were released recently,13,14 rendering this species more suitable for the study of the evolution of gene duplication. Thus, several studies have discussed the structural and expression divergence of some rapidly expanding immune gene families in this species.15,16 However, the patterns of divergence of duplicate genes with roles outside of immune function have been largely ignored.
In the present study, we selected a set of genes with broad importance in cell signaling, the protein kinase superfamily, to analyze the evolutionary pattern between sequence and gene expression for duplicate genes. The protein kinase superfamily is one of the largest gene families in eukaryotes, comprising 2%–4% of all genes in human being and in several model species. 17 The protein kinases are well known for regulating the majority of cellular pathways, especially those involved in signal transduction.18,19 Thus, studying the evolutionary history of protein kinases provides a window to the evolution of many organism's signaling pathways. Therefore, we undertook the present study to identify duplicated protein kinase genes in oyster and to characterize the pattern of divergence between sequence and expression. A total of 64 putative duplicate gene pairs were identified from 320 protein kinase family members. We first show that these duplicate genes have experienced stronger selective constraints. We then find unequal distributions of correlation coefficients between duplicate genes for expression patterns under each of the five different developmental and stress-induced conditions. Finally, we investigated the relationship between sequence divergence and expression divergence and found that positive correlation exists between sequence divergence and expression divergence in each of the four conditions, suggesting that sequence divergence may generally explain the expression divergence under those conditions.
Results and Discussion
Identification of duplicate genes from the protein kinase family
Protein kinases represent one of the largest gene families in eukaryotes, and an enormous number of members have been reported in the model species. For example, there are 516, 238, and 425 kinase genes in human being, fruit fly, and nematode, respectively. 19 In the present study, a total of 320 protein kinase family members are identified from the genome of Pacific oyster, which account for about 1.1% of all predicted genes.
Based on rigorous phylogenetic and reciprocal BLAST analyses, a total of 64 pairs are identified as putative duplicated gene pairs (Fig. 1 and Table 1). Those 128 genes are located on 111 scaffolds, indicating the scattered and wide distribution on the whole genome. Nine of the pairs are present on the same scaffold instead of being present on two different scaffolds, indicating a pattern of tandem duplication. However, we note that the frequency of tandem duplication in protein kinase family may be underestimated because of incomplete genome assembly and annotations.
Phylogenetic relationship of protein kinases from Pacific oyster. NJ topology was represented and bootstrap values were shown for the clades with more than 50% support. The scale bar indicates the number of amino acid substitutions per site. The genes with red circles represent the identified duplicate paralogs. Identified duplicate protein kinase gene pairs and related information.
Comparisons of duplicate gene pairs showed that 13 pairs (20.3%) have equal exon numbers, and 14 pairs (21.9%) have almost identical exon numbers (their exon number difference ≤2) (Table 1). In the remaining 37 pairs (57.8%), the exon–intron structures are divergent between duplicate paralogs, indicating that exon–intron structural divergence is a common occurrence in oyster protein kinase genes. By analyzing 612 pairs of sibling paralogs from Arabidopsis and rice, Xu et al. 7 demonstrated that exon–intron structural variation is prevalent in three gene families and the proposed three mechanisms, including exon/intron gain/loss, exonization/pseudoexonization, and insertion/deletion, might contribute to the formation of structural changes. Their findings suggest that such structural divergences have played a vital role during the evolution of duplicate genes, and our findings are consistent with these studies.
Sequence divergence between duplicate gene pairs
To estimate the sequence divergence between duplicate gene paralogs, we calculate the synonymous (Ks) and nonsynonymous (Ka) substitution rates of coding sequences for each gene pair. The synonymous substitution rate Ks can be recognized as a proxy of divergence time between duplicated genes. The distribution of Ks has two major peaks around 0.2 and 4.1 in the density plot (Fig. 2A), indicating that those gene pairs originated at two major different stages and differed by evolutionary time. In addition, more than a half (61%) of duplicate pairs have Ks larger than 4, indicating highly diverged sequences and relatively long evolutionary time. In contrast, 25% of duplicate pairs have Ks less than 1, representing recently duplicated genes and relatively little sequence divergence.
The sequence divergence between duplicate pairs. (A) The density distribution of synonymous rate (Ks) for all duplicate gene pairs. (B) The comparisons of KaIKs and Ks values, where Ks is a proxy of divergence time between duplicated genes.
The ω (Ka/Ks) values reflect selection pressure during evolution. For all studied pairs, the ω values were lower than 1, suggesting that those pairs were all evolving under purifying selection with putative functional constraints (Fig. 2B). Moreover, in recent duplicated pairs, there are some gene pairs with ω values higher than 0.4, indicating that the evolutionary constraint might be relaxed in some degree. Those genes subjected to relaxed purifying selection may tend to accumulate more mutations, altering gene structure and expression. Intriguingly, in our recent study of the oyster TNF superfamily, we also found that recently originated duplicate genes were under purifying selection. 15
Expression patterns of duplicate gene pairs
In order to characterize the expression divergence for all gene pairs, the RNA-seq data collected from 69 developmental and stress-induced RNA-seq datasets have been analyzed (expression values and detailed information are given in Supplementary File 3). The Pearson's correlation coefficient r was calculated to quantify correlation between duplicate genes at the level of expression.
We also identified two main clades based on expression data from the RNA-seq data, which displayed opposite patterns. Clade 1 (upper in Fig. 3A), which include over two-thirds of the pairs, exhibited large positive r values under the majority of expression conditions, suggesting consistent expression patterns and similar functionality within each pair after their duplication. In contrast, Clade 2 (lower in Fig. 3A) showed a majority of negative r values, indicating that the paralogs in each pairs had divergent expression under most conditions.
Expression patterns of duplicate gene pairs. (A) The heatmap was performed using Pearson's correlation coefficient of gene expression under five conditions, and red to blue blocks indicate high-to-low correlation levels. (B) The boxplot represents the distribution of correlation coefficient values for each expression condition.
For each condition of developmental and stress-induced transcriptomes, the median value of Pearson's r was positive. This indicated that most pairs have correlated expression patterns (Fig. 3B), suggesting that the genes in these pairs evolved under some functional constraint. Nonetheless, there is still a proportion of gene pairs also exhibiting negative r values, suggesting expression divergence within those gene pairs (Fig. 3B). Compared with the other four conditions of transcriptomes, the distribution of Pearson's r value under the heat condition was slightly lower, with a considerable proportion of negative values. These results suggest that those duplicated genes may have gained novel functions via subfunctionalization and/or neofunctionalization after their duplication. We hypothesize that these protein kinase genes have evolved to adapt to various stress environments or specialized developmental roles via expression divergence in oyster.
Positively correlated sequence divergence and expression divergence
The relationship between sequence divergence and expression divergence was investigated in all the five conditions of developmental and transcriptional transcriptomes (Fig. 4). We found significant negative correlation between transformed r’ and Ka (or Ks) in four of the five conditions (P < 0.05, Fig. 4B–E). Interestingly, there is a significant positive correlation between expression divergence and sequence divergence among duplicate pairs under those conditions. However, for the set of transcriptomes comparing relative tissue abundance, the correlation was not statistically significant (P = 0.201 for Ka and P = 0.436 for Ks, Fig. 4A), suggesting less correlation between expression divergence and genome divergence. This pattern is mostly consistent with previous studies in yeast, human being, and cow.8,9,11
The relationship between the correlation coefficient (R) of gene expression and Ka (or Ks) in duplicate genes. (
Gene duplications are widely present in eukaryotic genomes, providing increased opportunities for nonreciprocal recombination and allowing redundant genes to evolve new functions. However, the fate of duplicate genes is a widely discussed topic of genome evolution. Recently, the subfunctionalization and neofunctionalization models have been invoked to explain the retention of duplicate genes.2,20 In this study, we found that most gene pairs exhibited consistent expression patterns and underwent purifying selection. Sequence and expression divergence were positively correlated under four conditions, consistent with the hypothesis of sequence divergence driving expression divergence. For the transcriptomes of tissue expression levels, we observed non-significant correlation between expression divergence and genome divergence, possibly because of genome divergence in noncoding regions not being reflective of the pattern seen in coding regions (as exemplified by Ka and Ks), a complicated expression divergence pattern, or the limited sample size used in our analysis. Therefore, we hypothesize that the functional constraints of protein kinase genes may contribute to the evolution of the duplicate paralogs in oyster. To sum up, our results provide insights into duplicate gene sequence and expression divergence in the Pacific oyster and may help to elucidate its adaptation to different environments and development processes. Our results may also help to understand the mechanisms for the retention of duplicate genes in other gene families in Pacific oyster.
Methods
Identification of duplicate gene pairs
The Pacific oyster genome sequences were downloaded from OysterDB (http://oysterdb.cn/home.html). The hidden Markov model (HMM) method was carried out to retrieve sequences containing a protein kinase domain (PF00069). The presence of a protein kinase domain (PF00229) was validated by SMART (http://smart.embl-heidelberg.de/) and Pfam (http://pfam.sanger.ac.uk/), and a total of 320 protein kinase sequences were identified from the Pacific oyster genome (see Supplementary File 1 for details). Phylogenetic analyses and reciprocal BLAST were applied to identify the duplicate gene pairs and relationships. First, phylogenetic analysis was carried out to identify the sequence pairs with close evolutionary relationships. We used the HMMalign program 21 to generate sequence alignments for protein kinase domain regions (provided as Supplementary File 2). The phylogenetic tree was reconstructed using the neighbor-joining (NJ) method from the MEGA 5.01 program 22 and a total of 75 paralog pairs (Fig. 1) were identified. Second, the sibling paralog relationship for each pair was further confirmed by reciprocal best BLAST hits analysis. The BLASTP program was used to compare each protein in identified pairs against all other proteins with the E-value cutoff of 1e-5. As a result, a total of 64 duplicate gene pairs were identified (Table 1), and their genome location and annotation were parsed from GFF files by a custom Python script.
Sequence alignment and divergence analysis
Protein sequences for each duplicated gene pair were first aligned by ClustalW. 23 Then, the PAL2NAL program was used to generate the codon alignment. 24 The synonymous (Ks) and nonsynonymous (Ka) substitution rates of coding sequences for duplicate pairs were calculated by the KaKs_Calculator 25 using the modified Yang-Nielsen algorithm (MYN). 26
Expression profile analysis
The available expression values for C. gigas protein kinase genes under five varied condition and 69 RNA-seq transcriptome datasets (nine tissues, 38 developmental stages, eight dry treatments, seven heat treatments, and seven salty treatments) were obtained from the transcriptome dataset of oyster genome project (http://oysterdb.cn/home.html, Zhang et al.
13
). The Reads per Kilobase of exon Model per million mapped reads (RPKM) values were calculated to indicate the expression levels of each gene. The pairwise expression patterns for each gene pair under different conditions were visualized through heatmap in R version 2.13. Pearson's correlation coefficient r of expression level was calculated to measure the correlation among the duplicate gene pairs at the level of expression. Following previous studies, the Pearson's correlation coefficient r was further transformed to r’ using equation
Author Contributions
Conceived and designed the experiments: DG, LW. Analyzed the data: DG, LW. Wrote the first draft of the manuscript: DG, DK, GY, LW. Contributed to the writing of the manuscript: DG, DK, XT, LW. Agreed with manuscript results and conclusions: DG, DK, XT, GY, LW. All the authors reviewed and approved the final manuscript.
Supplementary Material
Supplementary File 1
Full protein coding sequences of identified Pacific oyster protein kinase proteins genes (FASTA format).
Supplementary File 2
Amino acid alignment of identified Pacific oyster protein kinase proteins sequences (FASTA format).
Supplementary File 3
RPKM values for identified Pacific oyster protein kinase genes from five different developmental and stimuli-induced datasets (xlsx format).
Footnotes
Acknowledgments
We thank Dr. Benjamin Redelings for helpful suggestions and language editing. We also thank three anonymous reviewers for their valuable comments.
