Abstract
Synonymous mutations do not change the amino acid but do change the synonymous codon usage. In genomes of different organisms, the gene conversion process is biased toward GC, which is irrespective of mutation bias. In the coding region, this trend is especially obvious and it is possibly caused by the preference on G/C-ending codons over the A/T-ending ones. If the G/C-ending codons are advantageous, then the synonymous mutations that change A/T to G/C would be “optimal” compared to the opposite ones. In theory, one should observe signals of positive selection on these optimal synonymous mutations. The recently released single-nucleotide polymorphism (SNP) data from the 1001 genome project of
Introduction
Synonymous mutations are those mutations in the coding sequence (CDS) that do not change the amino acids (AAs). However, this does not mean that synonymous mutations are free from natural selection.1-3 It was reported that some synonymous mutations could affect messenger RNA (mRNA) splicing 4 and the splicing patterns might consequently affect the biological processes. 5 These splicing-related synonymous mutations were likely to be selected against. 4 Another well-known impact of synonymous mutation is the influence on synonymous codon usage. The G/C-ending synonymous codons seem to appear more frequently in the genome compared to the neutral expectation. 6 Theories were established to explain the preference for G/C-ending codons. One potential advantage of the G/C-ending synonymous codons is that they are decoded at higher rates during mRNA translation elongation 7-10 due to the putatively higher transfer RNA (tRNA) availability. Accordingly, G/C-ending synonymous codons usually have higher codon adaptation index (CAI) 11 or tRNA adaptation index (tAI) 12 values than their A/T-ending counterpart. Codon adaptation index describes the relative usage of synonymous codons in the genome and tAI incorporates the tRNA copy number of the corresponding codon. Faster translation rates might provide higher probability for the G/C-ending codons to be selectively maintained by natural selection.
In addition, there are also other potential impacts of synonymous mutations, such as the changes in thermodynamic stability of the secondary structures of mRNA (termed minimum free energy [MFE]) or methylation contexts. RNA structures could affect many aspects of RNA biology, such as the RNA-binding proteins (RBP) binding efficiency and the movement of ribosomes on RNAs. If a synonymous mutation altered the structure of RNAs, this change could be either deleterious, beneficial, or neutral and consequently subjected to natural selection.
Despite the established hypotheses explaining the putative advantage of G/C-ending codons, it would be interesting to directly verify the selection force acting on those synonymous mutations that change the codon usage patterns. If the G/C-ending codons are advantageous, then the synonymous mutations that change A/T to G/C would be “optimal” compared to the opposite ones. In theory, signals of positive selection should be observed on these “optimal” synonymous mutations.
I fully take advantage of the single-nucleotide polymorphism (SNP) data from 1135
Moreover, it is necessary to test whether the observed pattern on synonymous DAF is affected by population structures. Population structure reflects the ancestry of different groups within the population. If an expected pattern is unbiased, it should not be affected by population structure and should be observed across all the representative subpopulations.
In this study, by analyzing the frequency spectrum of the synonymous mutations, I draw to the conclusion that synonymous mutations in natural populations are not strictly neutral: the synonymous mutations that increase GC content (from A/T to G/C) tend to have higher DAF and therefore are likely to be positively selected. This pattern is not affected by the population structure as it is observed in all representative subpopulations. My current study suggests that synonymous mutations are not strictly neutral and should not be automatically ignored in evolutionary analyses.
Materials and Methods
Data collection
I download the SNP data as well as the genome-wide
Inferring the DAF using an outgroup
I aligned the CDSs sequences between
If the nucleotide in
If the nucleotide in
If the nucleotide in
According to these classification criteria, the direction of mutation is from ancestral to derived. However,
The allele counts used to calculate allele frequencies are provided in the SNP file. The vcf format contains the position and mutation-type columns as well as the “info” column. The “info” column includes information of annotation (genic or intergenic, coding or noncoding, missense or synonymous) and allele counts (the reference and alternative allele counts from the reads mapped to this position).
Classification of synonymous mutations
Synonymous mutations were classified into 3 categories according to whether they increase (from A/T to G/C), decrease (from G/C to A/T), or maintain (the remaining) the GC content. Among the 12 types of mutations, A-to-C, T-to-C, A-to-G, and T-to-G increase the GC content. C-to-A, C-to-T, G-to-A, and G-to-T decrease the GC content. C-to-G, G-to-C, A-to-T, and T-to-A do not affect the global GC content.
Regression analysis
The regression analysis was performed using “lm(
Codon adaptation index
11
and tAI
12
were defined by early studies, which were the parameters for codon bias and described the synonymous codon preference and tRNA availability of each codon. My group and other group(s) have previously done works investigating the selection patterns on CAI and tAI16-18 and the calculations were performed with the same pipeline. I folded the CDSs of
Statistical analysis
All statistical analyses and graphic work were conducted in the R environment (http://www.R-project.org/). When comparing 2 sets of mutations (eg, missense vs synonymous mutations), if synonymous mutations have a globally higher DAF spectrum than missense mutations, then this would indicate a stronger purifying selection on missense mutations. In other words, synonymous mutations are “less harmful” (more advantageous) than missense mutations. Likewise, when comparing different sets of synonymous mutations, the group with higher DAF is likely to be advantageous and positively selected. The statistical tests (comparing DAF values) could be the non-parameter tests like Wilcoxon rank sum tests.
Results and Discussion
Variations in the 1135 natural inbred lines of Arabidopsis thaliana
A total number of 11 609 631 SNP sites and 1 271 972 indels are included in the vcf file that I have downloaded (see Materials and Methods section). According to the annotation provided in the vcf file, 1 135 084 SNPs are missense (nonsynonymous), 795 623 SNPs are synonymous, and 27 813 SNPs are nonsense mutations (that introduce in-frame stop codons in the main CDS). Apart from the mutations in the coding region, 319 647 SNPs take place in 5′ UTR and 465 647 SNPs are located in 3′ UTR. The remaining variations are non-exonic, including intronic and intergenic variations.
Synonymous mutations that increase GC content have higher DAF
Synonymous mutations were classified into 3 categories (see Materials and Methods section). I first looked at the fractions of the 12 types of mutations (Figure 1A). The most prevalent mutation type is C-to-T and G-to-A. This is reasonable as transitions take place more frequently than transversions. Next, I defined the DAF using the information provided in the SNP files and an outgroup (see Materials and Methods section). I found that those synonymous mutations that increase the GC content (from A/T to G/C) have significantly higher DAF than other synonymous mutations (Figure 1B, Wilcoxon rank-sum tests). This indicates that these “optimal” mutations are likely to be positively selected. To make the relationship between mutation and selection clearer, I tested the Spearman correlation between the fractions shown in Figure 1A and the median DAF plotted in Figure 1B (each of the 12 boxes in Figure 1B has a median value plotted horizontally in the boxes). No correlation was observed between these 2 aspects (Figure 1C), suggesting that my observed pattern on DAF might not be caused by the mutation bias.

Landscape of synonymous mutations among the 1135 lines. The mutations that increase the GC content (from A/T to G/C) are labeled in orange. The mutations that decrease the GC content (from G/C to A/T) are labeled in purple. (A) Fractions (
Relative contribution of different features to the frequency spectrum
To better decipher the relative contribution of different features to the observed allele frequency spectrum of synonymous mutations, I perform multiple regression analysis:

Regression analysis of relative contribution of different features to the DAF spectrum. (A) Scatterplot of CG content of host genes (
In the multiple regression analysis, the regression coefficients showed us that CAI, tAI, and the change in CG content have a remarkably larger contribution to the frequency spectrum compared to other features (Figure 2C). This indicates a role of GC content in determining the frequency spectrum. It could also be inferred from this result that the selection patterns on CG content itself might be related to the CAI and tAI parameters. Note that the contribution of MFE to DAF is negative, suggesting that the synonymous mutations on genes with lower MFE (stronger structure) tend to have higher DAF. One speculation is that the “more structured” genes have already experienced many structural changes during evolution so that the newly emerged structural changes on them tend to be less harmful. At this stage, this observation is correlative, and the detailed reason remains unexplored. One certain thing is that the increase in CG content positively contributes to the DAF spectrum.
No correlation is observed between Fst and the prevalence of optimal synonymous mutations
It is necessary to discuss and test whether the observed pattern on synonymous DAF is affected (or caused) by population structures.

No correlation is observed between
With the same window size (10 000 bp) of
The biased DAF spectrum is observed in all representative subpopulations
To further prove the robustness of the observed DAF spectrum of different synonymous mutations, I set out to test this pattern in different subpopulations. The admixture groups defined by the original work
13
were retrieved (Figure 4A). I chose 3 representative subpopulations with adequate accessions with alternative alleles (Figure 4A and B). To correct the direction of mutations by using the outgroup

The DAF spectrum is consistent across different subpopulations. (A) The map labeled with sample collection information (from Figure 5A of Alonso-Blanco). 13 (B) DAF spectrum of the 12 mutation types in 3 representative subpopulations. The name of the subpopulation is colored with the same color shown in panel A. DAF indicates derived allele frequency.
As mentioned in the Introduction section, the advantage of these optimal synonymous mutations (from A/T to G/C) is likely conferred by the faster translation rate during tRNA decoding. The proposed advantage might act as the selection force shaping the DAF spectrum of synonymous mutations. Nevertheless, the pure evolutionary analyses in this study did not include any functional tests. The observed differences in DAF of those mutations serve as evidence to speculate the advantage of the optimal synonymous mutations. At this stage, the link between the biased DAF and function remains to be explored.
Conclusions
By analyzing the frequency spectrum of the synonymous mutations, I draw to the conclusion that synonymous mutations in natural populations are not strictly neutral: the synonymous mutations that increase GC content (from A/T to G/C) tend to have higher DAF and therefore are likely to be positively selected.
Footnotes
Acknowledgements
I thank all members of my Lab for their constructive suggestions to this project.
Funding:
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was financially supported by the National Natural Science Foundation of China (Grant no. 31770213). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Declaration of conflicting interests:
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Author Contributions
LW designed and supervised this research. LW analyzed the data and wrote this article.
Data Accessibility
All data used in this study have been described in the Materials and Methods section, which are free to access.
