Abstract
The massive worldwide spread of the SARS-CoV-2 virus is fueling the COVID-19 pandemic. Since the first whole-genome sequence was published in January 2020, a growing database of tens of thousands of viral genomes has been constructed. This offers opportunities to study pathways of molecular change in the expanding viral population that can help identify molecular culprits of virulence and virus spread. Here we investigate the genomic accumulation of mutations at various time points of the early pandemic to identify changes in mutationally highly active genomic regions that are occurring worldwide. We used the Wuhan NC_045512.2 sequence as a reference and sampled 15 342 indexed sequences from GISAID, translating them into proteins and grouping them by month of deposition. The per-position amino acid frequencies and Shannon entropies of the coding sequences were calculated for each month, and a map of intrinsic disorder regions and binding sites was generated. The analysis revealed dominant variants, most of which were located in loop regions and on the surface of the proteins. Mutation entropy decreased between March and April of 2020 after steady increases at several sites, including the D614G mutation site of the spike (S) protein that was previously found associated with higher case fatality rates and at sites of the NSP12 polymerase and the NSP13 helicase proteins. Notable expanding mutations include R203K and G204R of the nucleocapsid (N) protein inter-domain linker region and G251V of the viroporin encoded by ORF3a between March and April. The regions spanning these mutations exhibited significant intrinsic disorder, which was enhanced and decreased by the N-protein and viroporin 3a protein mutations, respectively. These results predict an ongoing mutational shift from the spike and replication complex to other regions, especially to encoded molecules known to represent major β-interferon antagonists. The study provides valuable information for therapeutics and vaccine design, as well as insight into mutation tendencies that could facilitate preventive control.
Introduction
The first case of ‘coronavirus disease 2019’ (COVID-19) was identified in the Chinese city of Wuhan in December 2019. Since then, the novel virus has rapidly spread to 188 countries and territories, infecting more than 36 million people and causing over one million deaths.1,2 COVID-19 patients develop a ‘severe acute respiratory syndrome’ analogous to that of the 2002 to 2003 SARS epidemic that spread to 23 countries, infected −8000, and killed 774 people. The COVID-19 virus was named SARS-CoV-2 by the WHO, and is the seventh coronavirus known to infect humans.2,3 Currently, there are no vaccines or antiviral drugs capable of preventing or treating human infections. 4
SARS-CoV-2 belongs to the
SARS-CoV-2 has a −30 kb genome, which was first reported on January 5, 2020. 8 The genome encodes both structural and non-structural proteins. The leader sequence and ORF1ab encode non-structural proteins (NSPs) functioning in replication and transcription. 11 Together with accessory proteins, the structural proteins are encoded in the downstream regions of the genome. They include the spike (S) protein of the viral ‘corona’, the envelope (E) protein, the membrane (M) protein, and the RNA-binding nucleocapsid (N) protein. Coronaviruses infect human cells by using the homotrimeric spike glycoprotein, known as S-protein, to bind the angiotensin-converting enzyme 2 (ACE2) receptor, which is located in the epithelia of the lung and small intestine of humans.12,13 SARS-CoV-2 has an optimized receptor-binding domain (RBD) that binds with high affinity to ACE2 in human and animals. 14 High receptor homology was supported by a series of structural and biochemical studies.14-19 When compared with SARS-CoV, S-protein amino acids at positions 455 (leucine) and 486 (phenylalanine) on SARS-CoV-2 showed an enhanced interaction with hot spot 31 and viral binding to human ACE2. 16 These studies tested the origin of the virus, challenging the suggestion that SARS-CoV-2 represents an artificially designed manipulation. Instead, it likely arose as a novel recombinant virus transmitted from both horseshoe bats and pangolins. 20 Thus, transmission to and among humans appears a result of natural selection.14-16,21
Viral genome sequence data has been collected in real-time from COVID-19 patients at significant pace. By May 7, 2020, the GISAID database (https://www.gisaid.org/) gathered 15 366 full sequences of human coronavirus. This data provided opportunities to explore when, where and how mutations happen within the SARS-CoV-2 genome. The majority of mutational studies thus far have mainly focused on the S-protein. These studies revealed that the RBD is the most variable region, with several RBD amino acids showing critical ACE2 receptor binding functions.8,14-16,22 S-protein mutations adjusted the binding efficiently to its human receptor. In contrast, mutations in other genomic regions have been rarely studied. Besides the notable S-protein, mutations on the N-protein that makes the nucleocapsid could alter virulence and virus spread. The N-protein (50 kDa) is the most abundant in both viruses and virus-infected cells, playing multiple roles in the replication and transcription of the virus, as well as in the assembly of the viral genome. 23 The N-protein binds to the viral RNA genome at its N-terminal end, forming a ribonucleoprotein complex that plays essential roles in maintaining a functional RNA conformation. 24 Due to it being highly immunogenic and possessing a normally conserved amino acid sequence, the SARS-Cov-2 N-protein is an optimal target for both diagnostic assays and vaccine formulations.23,25
Here we study pathways of molecular change in the genomes of the expanding SARS-CoV-2 population that can help identify molecular culprits of virulence and virus spread. Using the mutational entropy of nucleotide and amino acid sequences as a measure of molecular diversity over time, we study the evolutionary trajectory of crucial viral proteins such as the S- and N-proteins and several NSP and accessory proteins as we trace mutational changes in the sequences of 12 606 SARS-CoV-2 genomes. Remarkably, we find that while virulence-associated mutations in the S-protein are becoming fixed in time, nucleotide changes in the N-protein and the viroporin protein 3a that we find are associated with protein intrinsic disorder are establishing themselves as more prominent. Here we explore their significance.
Materials and Methods
Data
The SARS-CoV-2 reference sequence, accession NC_045512.2 (version March 30, 2020; previously ‘Wuhan seafood market pneumonia virus’) was obtained from the NCBI Virus repository on May 4, 2020. 8 A total of 15 366 sequences were acquired on May 7, 2020 from GISAID 26 and its initiative’s platform EpiCoV (see Supplemental Table S1 for sequence information). Metadata was acquired from NextStrain. 27 The sequence data and metadata were mapped using the GISAID EPI_ISL number as the primary key. The metadata revealed 24 non-human host sequences, which were removed. Any primary keys not appearing in both the metadata and sequences were also removed.
The remaining 15 342 sequences were then aligned with the reference sequence, removing any gaps caused by the initial inclusion of non-human host sequences. The head (<g.266) and tail (>g.29 684) sections were removed, as these sites lie outside all known coding regions and are highly variable in composition and length across sequences, providing little salient information for protein analysis. The sequences were then split into coding region sequences (CDS) corresponding to NSPs, structural and accessory protein regions, as listed by the reference sequence GenBank document. These sub-sequences were translated into proteins using BioPython,28,29 requiring replacement of all gap characters in the nucleotide sequence with ‘N’ characters, and for additional ‘N’ characters to be added to the end if the end contained a split codon. The protein sequences were also categorized by month of collection, using the NextStrain metadata. The complete data workflow is summarized in Figure 1. The blue sections are the main workflow steps, while the annotations in gray show details or notes about each section.

General data workflow of the analysis of SARS-CoV-2 genomes, including a breakdown of details that occur during each step. Main steps are indicated in blue, while details per step are indicated in gray.
Analysis
The per-position amino acid frequencies and Shannon entropies were calculated across the CDS for each sampled month of the initial period of the pandemic (December through April). The number of genomes sampled (% total) increased in time, 18 genomes (0.1%) in December, 390 in January (2.5%), 612 in February (4.0%), 11 407 in March (74.4%) and 2915 in April (19.0%). While the sample count per month varied widely, the difference should not affect the comparisons of per-location proportionality. To confirm the March dataset was not introducing noise due to the disproportionate size, the original entropy calculation as well as the entropies of 10 randomly sampled March subsets of 2915 sequences (April’s sequence count) were compared using cosine similarity. The mean similarity between the original March entropy and the sampled entropy arrays was 0.997 (min: 0.9964, max: 0.9973). Likewise, 10 entropic samples of 612 March sequences (February’s count, 5.4% compared to March) had a mean similarity of 0.9851 (min: 0.9825, max: 0.9875).
Characters not included in the IUPAC standard protein alphabet (i.e. B, Z, J, U, O, and X) were ignored. This was justified through reasoning that (a) the amount of non-standard characters per sequence-set was negligible, and (b) the independent nature of the entropy of each location removes the chance of a cascading error. This exclusion also permitted the use of all obtained sequences, as sequences with high proportions of ambiguous characters would only contribute information if a valid amino acid were present at the analyzed location.
Calculation of pairwise distance metrics used a subset of distinct sequences per protein, allowing computation in reasonable time while maintaining a global view of the dataset, as all identical sequences will share pairwise distances. Sequences containing more than 5% ambiguous characters per protein were removed to further reduce the set and remove excess noise. Identical sequences were discovered through comparing hashes (SHA256) of the remaining sequence strings, producing a subset of distinct representative sequences. Pairwise distance calculations were performed using a BLOSUM80 (Block Substitution Matrix 80%) model.
30
This model was chosen based on the similarity between sequences was expected to be ⩾80% as the set was a pre-made multiple sequence alignment containing
Downstream analyses included performing exploratory data analysis with principal component analysis (PCA), tracing mutations in crystallographic models, exploring if they fell in conserved or variable regions of the molecules, and analyzing of intrinsic disorder and binding potential in the proteins of the entire proteome. Dimensionality reduction was conducted using classical PCA on the distance matrices, capturing the landscape of observed distinct sequence variants. Mutations were traced onto published crystallographic models, or in their absence, i-Tasser
Results
Entropic variation
Informational entropy describes the amount of variation in discrete per-location nucleotide or amino acid composition data. Here we study the evolutionary diversification of SARS-CoV-2 with an entropy-based strategy that quantifies diversity and selection in viral populations with 2 independent state variables. 38 First, we focus on mutational entropy as a measure of molecular biodiversity of the SARS-CoV-2 proteome. Mutational entropy is maximal when all amino acids in an amino acid site of the viral population are equally represented. Entropy is zero when only one amino acid overtakes that site. Increases in entropy imply dilution of mutations in the viral population while decreases signal fixation of those mutations. Second, we then consider a relative measure of entropy, ‘relative entropy delta’, which measures selection pressure by comparing viral diversity at 2 different time points of the pandemic and identifying entropy reversal trends suggestive of fitness advantage unfolding in time-space or entropy expansions suggesting other evolutionary forces are at play.
An initial analysis of mutational entropy at nucleotide level (Figure 2a) led to an evaluation of entropy at amino acid level (Figure 2b). Analyses identified several genomic sites of high protein entropy during at least one of the initial months of the pandemic. These were located for example in ORF1a, ORF1b, N, M, S and some accessory proteins (Figure 2b). Only 13 out of −29 proteins of the viral proteome had mutations of significance (Figure 2c). Table 1 lists 27 distinct residues that feature an entropy greater than 0.1 bits, significant in the month of March. In all cases, the incidence of the reference amino acid (ranging 63.6-98.7%) did not increase/decrease more than 18.6%. However, 2 contrasting pathways of mutational change were evident in these sites, one in which entropy usually increased and then decreased and the other in which entropy increased constantly (Figure 3):
(1)
(2)

Analysis of mutational entropy at nucleotide (a) and amino acid (b) levels defines an evolving SARS-CoV-2 proteome of 13 proteins with significant mutational change (c). Amino acid locations are only labeled for sites with mutational entropic levels above 0.1 bits in the month of March. Molecules that exhibit significant entropic levels have their atomic 3-dimensional models unshaded in panel C.
Amino acid sites with major entropy changes between the months of March and April.
Site numbers that are underlined indicate locations where an amino acid variant is currently dominant but differs from the RefSeq entry. Site numbers in bold indicate locations where a variant is currently non-dominant but is ascendant by >1% between March and April. Relative entropic delta values in bold indicate entropic decreases.

Pathways of mutational change involve mutational entropy reversals and expansions. Entropy reversals occur when entropy increases and then decreases in the timeline of pandemic. Entropy expansions occur when there is only a pattern of increase, which signals continued diversification of amino acid sequences.
Geospatial analysis
We explored how the 27 entropically-significant mutations distributed throughout regions of the world to determine if entropy reversals and expansions were operating globally or locally at continental level (Figure 4). NSP variants originated in Europe, Oceania and North America during January and February and distributed broadly. The P153S mutation of NSP3 was a second variant that originated in April in Asia. With the exception of the P13L mutation of the N-protein, all variants of structural proteins originated preponderantly in Europe during the months of January and February but expanded globally to other continents in subsequent months. The spread of the D614G mutation in the S-protein tightly followed that of the P323L mutation of NSP12, confirming their haplotypic relationship. With 2 exceptions in ORF3a and ORF8, variants of accessory proteins appeared early during January and February and then spread throughout continents.

Distribution of the 27 entropically-significant mutations throughout regions of the world along the initial timeline of the SARS-CoV-2 pandemic. Regions included South America (SA), Oceania (O), North America (NA), Europe (E), Asia (A), and Africa (AF). The proportion of amino acid variants were plotted for each month. Blue, orange, and green bars depict initial variants for non-structural, structural, and accessory proteins, respectively.
Principal component analysis
To explore how pathways of mutational change are affecting the viral quasispecies along the timeline of the early pandemic, we performed conventional PCA analysis of genomic samples using the BioPython28,29 Cluster package, identifying mutation examples of main entropic reversals and expansions (Figure 5). Two distinct elongated clouds described the proteomic make up of evolving viruses, one depicting the departure from the sequence makeup of the reference Wuhan strain (star symbol located on the leftmost part of the cloud) that signals the start of the pandemic and another matching variants in site 614 of the S-protein or sites 203 and 204 of the N-protein. Clouds contained a patchwork of genomes collected throughout the different months of the pandemic, which diversified in sequence along the first component. This is consistent with entropic expansion caused by diversification and drift. However, the temporal incidence throughout the elongated clouds seem to contract toward the location of the reference strain, likely signaling mutational pathways of active fixation are at play. As expected, the genomes harboring the S-protein variant were more numerous than those that preserved the original mutation, which is consistent with the rapid expansion of the D614G mutation and its associated haplotype. In turn, genomes harboring the original N-protein mutations were more abundant, given the slow but steady spread of the variants in the new mutational pathway. We note that the removal of duplicate sequences decreases the variance caused by sequence frequency, causing multidimensional methods to over-represent rare and under-represent common sequences. These are only undesirable if frequency is a variable of interest. As frequency and proportion were already accounted for in the entropy analysis, PCA describes the differences in amino composition, not the likelihood of any given sequence. The inclusion of duplicate sequences would act as noise, skewing the results for the projected dimensions. Regardless of these caveats and justifications, PCA should be merely regarded as a descriptive tool, which satisfactorily performed the role for analysis of the contribution of sequence variances.

Principal component analysis of SARS-CoV-2 genomes with original and variant sites in the S- and N-proteins.
Structural analysis
To make inferences of possible molecular functions affected by the mutants, we traced mutations onto crystal and cryo-EM structures of SARS-CoV-2 proteins and when not available, viral proteins modeled with I-Tasser. 32 Supplemental Figure S1 describes results. We found that most of the 27 mutation sites were located in loop regions (81%) that were generally on the surface of the molecules (89%). Out of 27 sites, 22 were in loop (also known as turn) regions and 5 in helical regions. No sites were in strand structures. A total of 24 sites were located on the surface of the molecules suggesting an important role in intermolecular interactions. Only 2 sites were buried (the 203 and 204 mutants of the N-protein) and one faced the pore of accessory viroporin 3a protein. A total of 18 were in ordered regions of the molecules while 7 were in disordered regions (in the N, NSP3 and 3a proteins). Intrinsic disorder and binding propensity scores confirm the disorder of these regions. Sites in NSP4 and protein 3a were located in trans-membrane (TM) regions, consistent with the close association with membranes of these viral proteins.
Intrinsic Disorder
We performed a global analysis of intrinsic disorder with IUPred2A of the SARS-CoV-1 proteome (Figures 6–8; Supplemental Figure S1). It revealed that while intrinsic disorder was variable, the entire protein repertoire was ‘highly structured’. The only notable exceptions were the ‘highly unstructured’ N-protein, the disordered hypervariable (HVR) domain in NSP3, the largest encoded protein of the virus, and disordered terminal regions of viroporins. An analysis of protein intrinsic disorder showed areas exhibiting high disorder scores within the N-protein, indicating significant levels of intrinsic disorder is present in the nucleocapsid (Figure 8). Remarkably, sites 203 and 204 of the N-protein that showed increasing entropy and their entropically variable neighborhoods aligned with areas of high disorder. In contrast, sites that experienced no entropic variation (high-conservation) tended to have low disorder. These sites also had high binding site score following an analysis with Anchor2. The conservation of binding sites seems intuitive due to the important functional nature of those regions. In fact, binding site scores peaked around sites 20, 220 and 400 in intrinsic disordered regions and in sites 50, 170, 275 and 355, flanking the 2 RNA-binding domains (Figure 7). Variable areas of intrinsic disorder raise several possibilities, ranging from being truly disordered functionally unimportant regions to an area experiencing significant functional change.

Major SARS-CoV-2 protein molecules experiencing mutational entropic reversals. (A) The coronavirus spike is a trimer of S-protein protomers, each harboring an N-terminal S1 subunit sequence with an N-terminal domain (NTD) and a receptor-binding domain (RBD) and a C-terminal S2 subunit holding a ‘fusion’ region with fusion peptide (FD) and internal fusion peptide sequences, 2 heptad repeat (HR) sequences, and a transmembrane (TM) domain. The subunits are processed by host proteases upon viral entry. The SARS-CoV-2 atomic model of a dimer (PDB entry 6VXX) shows the D614G mutation of the S1 domain eliminates a hydrogen bonding interaction with site 859 of the S2 domain of another protomer (colored in orange in the inset). An RMSD versus Z-score plot describes the DALI structural neighborhood of 6VXX, which contains 2310 structures. Alignments of multiple random samples of 10 structures along a transect from 6VXX to the main cloud with low Z-scores of structural similarity (colored red in the plots) consistently show 614 is part of a loop that falls in molecular regions that are poorly conserved at both sequence (Seq) and structural (Str) levels. Blue hues indicate larger conservation than green-to-red hues in the protomer cartoon models of the alignment example. (B) The NSP12 is the main RNA dependent RNA polymerase of the virus. It is encoded by ORF1b and is responsible for the synthesis of viral RNA. Examining the SARS-CoV-2 structure in complex with NSP7 and 8 cofactors (PDB entry 6M71) revealed that the P323L mutation sits in an ‘interface’ region (spanning residues 250-365) between the N-terminal nidovirus-unique NiRAN domain with nucleotidyltransferase activity and the C-terminal polymerase domain that harbors the fingers, palm and thumb subdomains. The mutation is in a helical region at the surface of a pocket formed by the NiRAN, interface and fingers structures (inset). DALI structural neighborhood analysis (1157 structures) confirmed the site is in a region that is poorly conserved at sequence and structure levels, but borderline with the highly conserved regions that harbor polymerase activity. (C) The NSP13 is the helicase of the viral replication complex. NSP13 has an N-terminal Zn binding domain (ZBD) followed by a stalk domain and 3 Rec-A domain structures 1A, 2A and 2B, which form the triangular base of a pyramid. L504P and C541Y are in loop regions located on the surface of the middle of the 2B domain. DALI structural neighborhood analysis of PDB entry 6JYT (1356 structures) showed C541Y is in regions of the molecule that are structurally conserved, while the L504P region was variable at both sequence and structure levels.

The mutational diversification of SARS-CoV-2 viroporin encoded by ORF3a. (a) The structure of the protein 3a molecule (PDB entry 6XDC) has 2 domains, a N-terminal transmembrane domain (TD) and a C-terminal cytosolic domain (CM). Mutation Q57H is located in the first of the 3 transmembrane helices at the major hydrophilic constriction of the pore important for channel activity. Mutation G197V forms part of a loop at the surface of the CD. Terminal amino acids 1-38 and 239-275 and a 175-180 in CD could not be modeled because they were weakly resolved. They hold mutations 13 and 251. (b) View from the lumen side of the channel pore (P) in ribbon and atom stick representation. Note that the pore is only 1 Å wide. (c) A DALI structural neighborhood analysis (10 088 structural neighbors) returned significant hits to small fragments (Z ⩽ 9.2; RMSD ⩾ 1.3) that formed a single cluster in the RMSD versus Z-score plot. Structural alignment of the 92 hits with Z ⩾ 7 (red dots) revealed that all hits matched the TD structures and were well conserved at structure (Str) but less at sequence (Seq) levels. The best structural match to the TD was the Orai protein channel (PDB entry 6BBG) responsible for Ca2+ influx pathways in metazoan cells and involved in immune responses and cancer. (d) The mapping of intrinsic disorder (UIPred2, red line) and gain-loss of binding energy (Anchor2, blue line) along the sequence confirmed the significant intrinsic disorder (scores ⩾ 0.5) of the C-terminal linker. A comparison of the different mutants and reference viral strain with a delta score revealed that mutations G196V and G251V decreased disorder.

Pathways of mutational diversification of SARS-CoV-2 involve intrinsic disordered regions of the nucleocapsid (N) protein. (a) The N-protein has 2 major RNA-binding domains, an N-terminal domain (NTD) and a C-terminal domain (CTD), both connected to a central linker and flanked by terminal sequences, all of which have been reported to be intrinsically disordered regions (IDRs). Mutations were traced onto a SARS-CoV-2 N-protein structure modeled with I-Tasser. They occurred in position 13 of the N-terminal IDR and positions 193, 197, 203 and 204 of the linker IDR, all of them in loop regions of the molecule. Mutations 203 and 204 were the only sites that were buried in the molecule. (b) A DALI structural neighborhood analysis against the modeled structure (88 structural neighbors, including many from SARS-CoV-2) showed 2 clusters in the RMSD versus Z-score plot, one reflecting structural match to the NTD domain and the other to the CTD domain. Structural alignment plots of the 88 structures supported the veracity of the modeled RNA-binding domains and revealed that the NTD is more conserved at sequence (Seq) and structure (Str) levels. (c) The mapping of intrinsic disorder (UIPred2, red line) and gain-loss of binding energy (Anchor2, blue line) along the sequence confirmed the significant intrinsic disorder and binding (scores ⩾ 0.5) of linker and terminal regions. A comparison of the R203K mutant and reference viral strain with a delta score revealed that the mutation increased disorder. A similar outcome was obtained with the G204R mutant.
Discussion
RNA viruses are known for their high mutation rates, and hence, for rapid genome evolution.
39
One consequence of these rates is that viral populations become ‘quasispecies’, highly diverse collectives of closely related viruses expressing vast numbers of distinct genotypes.
40
Coronaviruses harbor the largest known non-segmented RNA genomes reported to date (−27-32 kb in length) and a proteome with a repertoire of −29 proteins, many with published atomic structures (Figure 2). They adapt to host environments by relying on the low fidelity of a replication complex, which assembles around the NSP12 polymerase with its extra N-terminal β-hairpin domain in interaction with an hexadecameric complex of disordered NSP7 and NSP8 cofactors that are currently targets of COVID-19 therapeutics (e.g.
Most entropically-significant mutations in NSPs showed entropy ‘reversals’, in which entropy increases were followed by decreases along the timeline of the pandemic (Figure 3). In turn, mutations in structural and accessory proteins followed both modes, entropy reversals and entropy expansions. Most entropically-significant mutations spread throughout regions of the world (Figure 4). We first illustrate entropy reversals with 3 high-entropy mutations, one in the structural protein of the coronavirus spike and the other 2 with mutations in NSP proteins important for virus replication:
(1)
(2)
(3)
The coordinated reversals of entropic expansions provide support to the fixation of mutations in the spike, polymerase, and helicase proteins, some of which fostered virulence and viral loads. Several mutations that we do not discuss followed this same entropic mode but achieved lower entropy levels (Table 1; Figure 3). For example, the large NSP3 scaffold and protease that initiates cleavage of the ORF1a/ORF1ab protein and mediates genome replication and transcription, 11 harbors a mutation at the hypervariable region (HVR), which decreases disorder in only the disordered domain of the protein (Supplemental Figure S2). While HVR appears dispensable for viral infection, its role is currently unknown. 11 In parallel, we also identified significant entropic regions with persistent tendencies of entropic expansion that could be important determinants of disease progression (Table 1; Figure 3). Here we focus on high-entropy mutations affecting 2 important molecules, the accessory viroporin 3a protein and the N-protein, which we posit represent new pathways of mutational change that involve intrinsic disorder:
(1)
(2)
Our study explores pathways of mutational change in highly entropic sites of the SARS-CoV-2 proteome, quantifying diversity and fixation of variants with the two-state variable strategy we modified from Pan and Deem. 38 The analysis does not explore phylogenetic relationships at genomic level with distance or parsimony-based reconstruction methods, which for example are systematically pursued in the Nextstrain portal. 27 Instead, we recognize the difficulties of studying the multidimensional landscape of rapidly evolving and recombinogenic coronavirus genomes with tree-based approaches that do not dissect processes of horizontal exchange of genetic information, do not model the short-timescale accumulation of mutations, and are poorly powered to resolve the existence of positive selection. We do trace mutation accumulation in the expanding SARS-CoV-2 population of the pandemic to uncover significant pathways of proteome diversification. A conventional PCA analysis conducted on a distance matrix comprised of distinct amino acid sequences, excluding duplicates, revealed a complex data structure with 2 distinct diversification clouds unfolding fundamentally on the first dimension as new proteome sequence variants depart from the makeup of the original Wuhan reference virus. These 2 clouds reflected the 2 mayor entropic pathways of change we uncovered in our analysis. A geospatial analysis across regions of the world reveal that these pathways manifested quickly at global level (Figure 4).
Conclusions
A number of important mutations that foster viral spread, such as the haplotype that affects the S-protein, follow an entropic reversal mode that suggest they are being actively fixed in the expanding viral population of the pandemic. Here, we describe new pathways of entropic expansion of mutations involving intrinsically disordered regions of the SARS-CoV-2 proteome and interactions with replicating genomes and endoplasmic membranes that are needed for virus assembly and release from infected cells. Pathways involve intrinsically disordered regions of the N-protein, a structural protein that forms complexes with genomic RNA, interacts with the M-protein during virus assembly, enhances the efficiency of virus transcription and assembly, and helps overcome the host innate immune response.
62
Pathways also involve viroporins. Viroporins of enveloped viruses such as SARS-CoV-2 insert into membranes to break chemoelectrical barriers by channeling ions across membranes and dissipating membrane potential, a property that stimulates budding and resembles that of depolarization-dependent exocytosis.
50
The close homology of SARS-CoV-2 protein 3a to the Orai proteins suggests membrane potential dissipation involves store-operated channeling mechanisms that control Ca2+ cellular levels. These new mutational pathways may be responsible for new symptomatic manifestations of the COVID-19 disease. For example, asymptomatic SARS-CoV-2 infection has been reported to account for −50% of total coronavirus cases and the majority of the patients with mild infection symptoms can recover by themselves.63,64 In turn, the COVID-19 disease mechanism suggests that the severe symptoms of COVID-19 involve the uncontrolled immune response of the host.
65
Indeed, mutational pathways involve virus molecules that can subvert the immune response, specifically the interferon response. For example, the N-protein, protein 3a and accessory protein 6 are the 3 β-interferon antagonists operating in coronavirus disease.
66
Mutations in 2 of these molecules are high entropy in our mutational set. In addition, the SARS-CoV-2 new virus tends to be more rapidly spreading and less lethal than SARS-CoV and MERS-CoV, a fact that demands explanation. As predicted by the director of Centers for Disease Control and Prevention of the USA on March 25,
Supplemental Material
Figure_S1_xyz5177238212247 – Supplemental material for New Pathways of Mutational Change in SARS-CoV-2 Proteomes Involve Regions of Intrinsic Disorder Important for Virus Replication and Release
Supplemental material, Figure_S1_xyz5177238212247 for New Pathways of Mutational Change in SARS-CoV-2 Proteomes Involve Regions of Intrinsic Disorder Important for Virus Replication and Release by Tre Tomaszewski, Ryan S DeVries, Mengyi Dong, Gitanshu Bhatia, Miles D Norsworthy, Xuying Zheng and Gustavo Caetano-Anollés in Evolutionary Bioinformatics
Supplemental Material
Figure_S2_xyz51772bf688ed9 – Supplemental material for New Pathways of Mutational Change in SARS-CoV-2 Proteomes Involve Regions of Intrinsic Disorder Important for Virus Replication and Release
Supplemental material, Figure_S2_xyz51772bf688ed9 for New Pathways of Mutational Change in SARS-CoV-2 Proteomes Involve Regions of Intrinsic Disorder Important for Virus Replication and Release by Tre Tomaszewski, Ryan S DeVries, Mengyi Dong, Gitanshu Bhatia, Miles D Norsworthy, Xuying Zheng and Gustavo Caetano-Anollés in Evolutionary Bioinformatics
Supplemental Material
Table_S1_xyz517728e1451f5 – Supplemental material for New Pathways of Mutational Change in SARS-CoV-2 Proteomes Involve Regions of Intrinsic Disorder Important for Virus Replication and Release
Supplemental material, Table_S1_xyz517728e1451f5 for New Pathways of Mutational Change in SARS-CoV-2 Proteomes Involve Regions of Intrinsic Disorder Important for Virus Replication and Release by Tre Tomaszewski, Ryan S DeVries, Mengyi Dong, Gitanshu Bhatia, Miles D Norsworthy, Xuying Zheng and Gustavo Caetano-Anollés in Evolutionary Bioinformatics
Footnotes
Acknowledgements
This study began as a class research project in CPSC 567, a course in bioinformatics and systems biology taught by G.C.-A. at the University of Illinois in the spring of 2020. We dedicate this work to the frontline medical professionals who have been saving the life of others with limited protective equipment, selflessly, and at their own peril. We also thank public health professionals and scientists for making real-time data and sequences readily accessible to the public. COVID-19 research in the laboratory of G.C.-A is supported by the Office of Research and Office of International Programs in the College of Agricultural, Consumer and Environmental Sciences at the University of Illinois at Urbana-Champaign.
Funding:
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests:
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Authors’ Contributions
MD proposed the research idea. TT and RSD processed the data. TT conceived experiments and analyzed genomic and intrinsic disorder data with help from GC-A. GC-A analyzed protein structural data. All co-authors wrote, edited and approved the manuscript.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
