Sage Journals: Discover world-class research

Abstract

Code-message coevolution (CMC) models represent coevolution of a genetic code and a population of protein-coding genes (“messages”)- Formally, CMC models are sets of quasispecies coupled together for fitness through a shared genetic code. Although CMC models display plausible explanations for the origin of multiple genetic code traits by natural selection, useful modern implementations of CMC models are not currently available. To meet this need we present CMCpy, an object-oriented Python API and command-line executable front-end that can reproduce all published results of CMC models. CMCpy implements multiple solvers for leading eigenpairs of quasispecies models. We also present novel analytical results that extend and generalize applications of perturbation theory to quasispecies models and pioneer the application of a homotopy method for quasispecies with non-unique maximally fit genotypes. Our results therefore facilitate the computational and analytical study of a variety of evolutionary systems. CMCpy is free open-source software available from http://pypi.python.org/pypi/CMCpy/.

Keywords

quasispecies CUD A genetic code homotopy perturbative method

Introduction

Code-Message Coevolution (CMC) models were introduced to facilitate the study of co-evolutionary systems in which a large population of individuals share an evolvable genetic code coupled to many and/or long protein-coding genes (“messages”). Initial study of these models demonstrated that base mutations in protein-coding genes can significantly influence the fitness of genetic codes.¹ More detailed study of CMC models has yielded formal demonstrations for how natural selection likely contributed to the origins of codon redundancy² and non-random patterns of amino acid assignments^3,4 in genetic codes. CMC models have been extended to study the effects of population structure and gene sharing on coevolving systems of codes and messages.⁵

In their original formulations, CMC models are deterministic evolutionary genetic models that couple together sets of large populations of asexually reproducing genotypes evolving under mutation and natural selection. One such population is called a “quasispecies”,⁶ a concept reviewed recently both generally⁷ and specifically in connection to population genetics theory.^8,9 In CMC models, each of multiple quasispecies represents a large population of codons evolving to meet the same physicochemical requirements of a “site-type” in proteins under translation by a given genetic code. The coupling of multiple quasispecies through a genetic code occurs through reuse of the same codon type (or allele) in different site-types. CMC models follow deterministic trajectories that alternate between equilibration of messages to an established genetic code by mutation and selection, and locally adaptive “gradient ascent” hill-climbing of the genetic code through single codon assignments and reassignments. This “quasistatic” dynamic continues until no single codon reassignment yields higher fitness with the current message population. Published CMC models always converge to a stable local fitness optimum–-a process called “code freezing.”

Because CMC models compound quasispecies models, analytical solutions to quasispecies models are relevant to their study. Previous applications of perturbation theory for approximate solutions to quasispecies models assume uniqueness of maximally fit genotypes and small perturbation parameters (small mutation rates).^10–12

In this work, we present two analytical methods that relax these assumptions. The first is a perturbative method that clarifies, extends and generalizes the perturbative approach to quasispecies models. Our presentation provides a derivation to arbitrary order, which relaxes any restriction on the magnitude of the perturbation parameter. The second method is the first published application of homotopy methods to quasispecies models, which handles cases with non-unique maximally fit genotypes. Numerical tests show that, for two example problems, both algorithms converge. Notably, the error of the perturbative method decreases exponentially in the number of iterations, suggesting its potential to outperform the power method. Both approaches are flexible and can be extended to a variety of quasispecies mutation models and fitness schemes. Also, both methods have been formulated so as to facilitate future connections between quasispecies models and established results in eigenvalue perturbation theory^13–15 and homotopy methods.¹⁶

In addition to these analytical results, we present a new code-base implementing CMC models. The dissemination and further study of CMC models has been sorely hindered by the lack of a modern code-base implementing them. The original C/C++ CMC code-base has been rendered obsolete by evolution of compilers and language standards from the time of its writing in the late 1990s. This has created a need for reimplementation of the original models in an easy-to-use and program, powerful, and efficient, high-level scripting language like Python.

Here we present CMCpy, a free open-source code-base that implements CMC models in an easy to use object oriented Python API. The code-base comes with a front-end command-line executable called cmc that can drive the exploration of a variety of CMC models and reproduce published results.

Implementation

CMCpy was developed in Python 2.7. Figure 1 shows the organization of classes in CMCpy. A rich class hierarchy allows convenient specification of a wide-range of CMC models with very few lines of Python code.

Figure 1.

Overview of class hierarchy and containment relationships in CMCpy.

An element not shown in Figure 1 is that the ArdellSellaEvolver class is an abstract base class for a variety of subclasses implementing different strategies to solve dominant eigenpairs. The default solver relies on the Numpy eig() function for its speed and accuracy, but for reasons of flexibility, verification, and to experiment with different methods, we include a legacy central processing unit (CPU)-based power method, a multicore CPU-based power method implementation and an experimental graphics processing unit (GPU) implementation of the power method that relies on pyCUDA.¹⁷

The pyCUDA solver is implemented in a CUDA C kernel with supportive Python code. The Python code uses NumPy for simple operations, to cast data to types acceptable by the CUDA platform, and to reshape matrices to one-dimensional data frames, for use by CUDA C. The CUDA C implementation of the power method approximately solves the eigenvector of each site-type, in parallel. Each site-type is assigned to one virtual “block” of the GPU, each of which corresponds to a physical processor when the number of blocks is below a hardware limit. This organization of the parallel workload was chosen to stay below CUDA specification limits, access GPU memory efficiently, and avoid excess complexity. Ultimately, the CUDA power method implementation is faster than the CPU power method implementation. Figure 2 shows system clock execution times of various methods on a 3.0 GHz Core 2 Duo with an Nvidia GeForce 460 GTX GPU running Ubuntu 12.04. The benchmark script in R that generated this figure is provided as supplementary data.

Figure 2.

Comparison of wall-clock execution times of the cmc executable with three different eigensystem solvers using the double ring model⁴ with eight codons, μ 0.1 and 0 = 0.25.

CMCpy comes with a front-end command-line executable in Python called cmc. This executable provides users, including non-programmers, the capability to reproduce (at least qualitatively) all of the published results on CMC models^1–4 as well as run individual and batch simulations of other models and parameter spaces. In Table 1 we list command-line options to the cmc executable and their corresponding model parameters.

Table 1.

Options to the cmc executable and corresponding model parameters.

Option	Long version	Description	Default
-a	—numaas	Number of amino-acids/site-types (AA/ST)	10
-d	—numdims	Dimensionality of AA/ST space	1
-s	— seed	Seed for random initialization of AA/ST coordinates	42
-t	—numtrials	Num. trials with reinitialized AA/ST spaces	1
-c	—numcodons	Num. codons in a “double ring” model	N/A
-b	—numbases	Alphabet size for word-based codon model	4
-P	—numpositions	Length of codons for word-based codon model	N/A
-m	—mu	Base (word models) or codon mutation rate	0.1
-k	—kappa	Transition/transversion mutation bias ratio	1
-f	—phi	Missense tolerance parameter	0.25
-r	—misreading	Misreading parameters	N/A

A variety of observables and statistics are implemented including the Normalized Encoded Range² for one-dimensional amino acid/site-type spaces. Although CMC models are deterministic, exact quantitative differences may arise in results with CMCpy depending on floating point representation differences by platform and differences in convergence thresholds using power method-based eigensolvers.

It may be useful to restate the assumptions of the “Ardell-Sella” models currently implemented and available in CMCpy. These include the following: 1.

The numbers and/or lengths of protein-coding genes, or “messages,” translated by a common genetic code, are large with respect to every possible site-type in proteins.

For every possible site-type in proteins there corresponds a uniquely most fit amino acid.

The machineries to decode codons, and to associate any codon to any amino acid, pre-exist the evolution of the genetic code.

Fitness contributions of amino acids across sites are independent and multiplicative.

Fitness contributions of different amino acids within the same site are independent and additive.

Bases in messages mutate independently of one another.

Messages are haploid and asexually reproducing.

Genetic codes evolve much slower than messages, through discrete and independent assignments or reassignments of amino acids to codons.

Analytical Methods for Quasispecies Solutions

In this section we develop two different analytical methods to solve for the equilibrium growth rate and genotype distributions for a wide range of quasispecies models. The Matlab/Octave code used to implement these solutions are provided as supplementary data. A version of the homotopy method is also implemented in CMCpy for ring models. Full implementions of both methods will be incorporated into CMCpy at a later date.

Perturbative method: quasispecies with unique fittest genotype

We start with a ring mutation model discussed in prior work.¹ Let μ denote the N x N mutation matrix

μ = (\begin{matrix} 1 - 2 μ & μ & μ \\ μ & 1 - 2 μ & μ \\ μ & ⋱ & ⋱ \\ ⋱ & 1 - 2 μ & μ \\ μ & μ & 1 - 2 μ \end{matrix}),

(1)

where blank spaces should be interpreted as zeros. Throughout this derivation and the following one, we assume 0 ≤ μ < 1 and N > 1. Let w denote the N × N fitness matrix

w = (\begin{matrix} w_{1} \\ ⋱ \\ w_{n} \end{matrix}) .

We assume that w₁ is the unique maximum of the finite set (w₁ …, w_N}. We also assume w > 0 for each i, consistent with prior work.¹

Our goal in this derivation is to determine the leading eigenpair, consisting of the largest eigenvalue together with its corresponding eigenvector, of the matrix

\tilde{Q} (μ) = μ w .

(2)

We write $\tilde{Q}$ (μ) to emphasize the fact that since μ is a function of μ, so is Q. The diagonal matrix w is constant with respect to μ. Next we define

Q (μ) = w^{1 / 2} \tilde{Q} (μ) w^{- 1 / 2} = w^{1 / 2} μ w^{1 / 2} .

(3)

Since the matrices Q and Q are related by a similarity transformation, their eigenpairs are closely related. One can check that

Q v = λ v if and only if \tilde{Q} w^{- 1 / 2} v = λ w^{- 1 / 2} v

(4)

We focus our attention on Q because it is symmetric, i.e., Q(μ)^T = Q(μ) for all μ.

For Q(μ), the leading eigenpair λ(μ), v(μ)) must satisfy

Q (μ) v (μ) = λ (μ) v (μ) .

(5)

When μ = 0, the matrix μ reduces to the N x N identity matrix. Therefore, Q(0) = w, and the leading eigenpair of Q(0) is given by

λ (0) = w_{1} and v (0) = e_{1} .

(6)

Heree_j denotes the j-th basis vector in N-dimensional space, i.e., the vector with all zeros except for 1 in the j-th slot.

Guiding principle

Since Q(μ) is symmetric for all real μ, standard theoretical results in eigenvalue perturbation theory¹⁴ guarantee that both the eigenvalue λ(μ) and eigenvector v(μ) are analytic functions of μ. This means that there exists M > 0 such that for μ ∈ (-M, M), the following power series expansions converge:

λ (μ) = \sum_{j = 0}^{\infty} λ_{j} \frac{μ^{j}}{j!}

(7a)

v (μ) = \sum_{j = 0}^{\infty} v_{j} \frac{μ^{j}}{j!}

(7b)

In this notation, (6) can be written as λ₀ = w₁ and v₀ = e₁. Note also that we have arranged the coefficients so that

{\frac{d^{j}}{d μ^{j}} |}_{μ = 0} λ (μ) = λ_{j}

(8a)

{\frac{d^{j}}{d μ^{j}} |}_{μ = 0} v (μ) = v_{j}

(8b)

Our strategy now will be to derive from (5) a recursive set of equations for the coefficients λ_j, μ_j. Once we have these coefficients for j = 0, 1, …, J, we have an approximation to the leading eigenpair.

Let I denote the n × n identity matrix. Then μ = 1 + μs, where

s = (\begin{matrix} - 2 & 1 & 1 \\ 1 & - 2 & 1 \\ 1 & ⋱ & ⋱ \\ ⋱ & - 2 & 1 \\ 1 & 1 & - 2 \end{matrix}) \cdot

This implies that

\frac{d^{j}}{d μ^{j}} μ = {\begin{array}{l} s & j = 1 \\ 0 & j \geq 2. \end{array}

By (3), we have

\frac{d^{j}}{d μ^{j}} Q (μ) = {\begin{array}{l} w^{1 / 2} s w^{1 / 2} & j = 1 \\ 0 & j \geq 2. \end{array}

(9)

Perturbative solution: part I

Armed with the above facts, we differentiate (5) with respect to μ on both sides:

Q' (μ) v (μ) + Q (μ) v' (μ) = λ' (μ) v (μ) + λ (μ) v' (μ) .

We then set μ = 0 and use (8), (3), (9), and (6) to obtain

w^{1 / 2} s w^{1 / 2} e_{1} + w v_{1} = λ_{1} e_{1} + w_{1} v_{1}

(10)

Multiply this equation on the left by the row vector $e_{1}^{T}$ :

e_{1}^{T} w^{1 / 2} s w^{1 / 2} e_{1} + e_{1}^{T} w v_{1} = λ_{1} + w_{1} e_{1}^{T} v_{1} .

(11)

Note that $e_{1}^{T} W = w_{1} e_{1}^{T}$ , which implies $e_{1}^{T} {Wv}_{1} = w_{1} e_{1}^{T} v_{1}$ . Using this equality in (11), we have

e_{1}^{T} w^{1 / 2} s w^{1 / 2} e_{1} = λ_{1} .

This shows that if we already know (λ₀, v₀), we can determine λ₁. To determine v₁ we return to (10) except now we treat λ₁ as known. Rearranging the equation, we have

[w - w_{1} I] v_{1} = [λ_{1} I - w^{1 / 2} s w^{1 / 2}] e_{1} .

We substitute our definition of w on the left-hand side to obtain

(\begin{matrix} 0 \\ w_{2} - w_{1} \\ w_{3} - w_{1} \\ ⋱ \\ w_{N} - w_{1} \end{matrix}) v_{1} = [λ_{1} I - w^{1 / 2} s w^{1 / 2}] e_{1} .

Suppose that v₁ = (v₁⁽¹⁾,v₁⁽²⁾,…, v₁^(N)). We set v₁⁽¹⁾=0. For the remaining components, we solve the above matrix-vector system to obtain

v_{1}^{(k)} = \frac{1}{w_{k} - w_{1}} e_{k}^{T} [λ_{1} I - w^{1 / 2} s w^{1 / 2}] e_{1}, for k = 2, 3, \dots, N .

We have completed the loop, showing how to proceed from the zeroth-order eigenpair (λ₀, v₀) to the first-order eigenpair (λ₁,v₁).

In the next section, we show how to iterate this procedure to generate the j-th order eigenpair (λ_j, v_j) from the previously obtained eigenpairs.

Perturbative solution: part II

We return to (5) and take j derivatives with respect to μ on both sides. Using the general Leibniz rule, we have

\sum_{m = 0}^{j} (\begin{matrix} j \\ m \end{matrix}) \frac{d^{m}}{d μ^{m}} Q (μ) \frac{d^{j - m}}{d μ^{j - m}} v (μ) = \sum_{m = 0}^{j} (\begin{matrix} j \\ m \end{matrix}) \frac{d^{m}}{d μ^{m}} λ (μ) \frac{d^{j - m}}{d μ^{j - m}} v (μ)

After differentiating, we set μ = 0 and use (8) to obtain

\sum_{m = 0}^{j} (\begin{matrix} j \\ m \end{matrix}) \frac{d^{m}}{d μ^{m}} Q (μ) v_{j - m} = \sum_{m = 0}^{j} (\begin{matrix} j \\ m \end{matrix}) λ_{m} v_{j - m}

Applying (3) and (9) yields

w v_{j} + j w^{1 / 2} s w^{1 / 2} v_{j - 1} = \sum_{m = 0}^{j} (\begin{matrix} j \\ m \end{matrix}) λ_{m} v_{j - m} .

We peel off the m = 0 term from the right-hand side:

w v_{j} + j w^{1 / 2} s w^{1 / 2} v_{j - 1} = λ_{0} v_{j} + \sum_{m = 1}^{j} (\begin{matrix} j \\ m \end{matrix}) λ_{m} v_{j - m} .

(12)

Multiplying on the left by $e_{1}^{T}$ and using $e_{1}^{T} w v_{j} = w_{1} e_{1}^{T} v_{j} = λ_{0} e_{1}^{T} v_{j}$ , we see that the first term on the left-hand side cancels the first term on the right-hand side. We are left with

j e_{1}^{T} w^{1 / 2} {sw}^{1 / 2} v_{j - 1} = \sum_{m = 1}^{j - 1} (\begin{matrix} j \\ m \end{matrix}) λ_{m} e_{1}^{T} v_{j - m} .

Now on the right-hand side, we peel off the m =j term—note that this the only term in which λ_j appears. Hence

j e_{1}^{T} w^{1 / 2} s w^{1 / 2} v_{j - 1} = \sum_{m = 1}^{j - 1} (\begin{matrix} j \\ m \end{matrix}) λ_{m} e_{1}^{T} v_{j - m} + λ_{j} e_{1}^{T} v_{0} .

We use (6) and $e_{1}^{T} e_{1} = 1$ , and we solve for λ_j:

λ_{j} = j e_{1}^{T} w^{1 / 2} s w^{1 / 2} v_{j - 1} - \sum_{m = 1}^{j - 1} (\begin{matrix} j \\ m \end{matrix}) λ_{m} e_{1}^{T} v_{j - m} .

(13)

We have therefore shown that if we already know (λ_m, v_m) for m = 0, 1, …,j − 1, we can solve for λ_j. Now, using this λ_j, we can solve for v_j in the same way as before. We go back to (12), isolate all terms involving v_j, and apply (6) to derive

[w - w_{1} I] v_{j} = - j w^{1 / 2} s w^{1 / 2} v_{j - 1} + \sum_{m = 1}^{j} (\begin{matrix} j \\ m \end{matrix}) λ_{m} v_{j - m}

The right-hand side is clearly valid only for j ≥ 1. The matrix on the left-hand side is the same one that appeared earlier:

(\begin{matrix} 0 \\ w_{2} - w_{1} \\ w_{3} - w_{1} \\ ⋱ \\ w_{N} - w_{1} \end{matrix}) v_{j} = - j w^{1 / 2} s w^{1 / 2} v_{j - 1} + \sum_{m = 1}^{j} (\begin{matrix} j \\ m \end{matrix}) λ_{m} v_{j - m}

We again set $v_{j}^{(1)} = 0$ . For the remaining components, we have

v_{j}^{(k)} = \frac{1}{w_{k} - w_{1}} (- j w^{1 / 2} s w^{1 / 2} v_{j - 1} + \sum_{m = 1}^{j} (\begin{matrix} j \\ m \end{matrix}) λ_{m} v_{j - m})

(14)

for k = 2, 3, …,N.

Algorithmic improvements

Equations (13) and (14) complete the step of deriving both (λ_j, v_j) using only the previously derived eigenpairs (λ_m, v_m for m = 0, 1, …j − 1, giving us a recursive solution procedure. Once we have determined (λ_j, v_j) for j = 0, 1, …, J, we can use these coefficients in (7) and thereby obtain approximations to the the leading eigenpair (λ(μ), v(μ)). Hence we view (13) and (14) as an algorithm for computing the leading eigenpair.

Turning to the numerical implementation, we now describe two improvements to the algorithm given by (13) and (14).

First, we note that in (14), we always set $v_{j}^{(1)} = 0$ . This implies that $e_{1}^{T} v_{j} = 0$ for all j ≥ 1. This means that all terms under the summation symbol in (13) vanish, yielding the simplified update formula:

λ_{j} = j e_{1}^{T} w^{1 / 2} s w^{1 / 2} v_{j - 1} .

(15)

Second, we note that (14) contains a binomial coefficient that becomes prohibitively large to compute for large j. A natural question is whether these large coefficients are compensated by the inverse factors of j! in (7). To quantify this, we define

{\hat{λ}}_{j} = \frac{λ_{j}}{j!} and {\hat{v}}_{j} = \frac{v_{j}}{j!} .

(16)

We then substitute $λ_{j} = j! {\hat{λ}}_{j}$ and $v_{j} = j! {\hat{v}}_{j}$ in (14) and derive

\begin{matrix} v_{j}^{(k)} = \frac{1}{w_{k} - w_{1}} (- j (j - 1)! w^{1 / 2} s w^{1 / 2} {\hat{v}}_{j - 1} + \sum_{m = 1}^{j} (\begin{matrix} j \\ m \end{matrix}) m! (j - m)! {\hat{λ}}_{m} {\hat{v}}_{j - m}) \\ = \frac{j!}{w_{k} - w_{1}} (- w^{1 / 2} s w^{1 / 2} {\hat{v}}_{j - 1} + \sum_{m = 1}^{j} {\hat{λ}}_{m} {\hat{v}}_{j - m}) \cdot \end{matrix}

Dividing through by j! and using (16), we derive, again for j ≥ 1,

{\hat{v}}_{j}^{(k)} = \frac{1}{w_{k} - w_{1}} (- w^{1 / 2} s w^{1 / 2} {\hat{v}}_{j - 1} + \sum_{m = 1}^{j} {\hat{λ}}_{m} {\hat{v}}_{j - m}) \cdot

(17)

Applying the same substitutions in (15), we derive

{\hat{λ}}_{j} = e_{1}^{T} w^{1 / 2} s w^{1 / 2} {\hat{v}}_{j - 1}

(18)

Using (16) in (7), we obtain

λ (μ) = \sum_{j = 0}^{\infty} {\hat{λ}}_{j} μ^{j} and v (μ) = \sum_{j = 0}^{\infty} {\hat{v}}_{j} μ^{j} .

(19)

Examining the equations in the $\hat{λ}$ and $\hat{v}$ variables, we see that all large binomial coefficients and factorials have disappeared. For this reason, in our Octave implementation of the perturbative method, we use the recursive system given by (18) and (17), together with the summation formula (19). Finally, applying (4), our answer for the leading eigenpair of the matrix $\tilde{Q}$ (μ) defined by (2) is (λ(μ) w-^−1/2v(μ)).

Example

Let us give an example of the perturbative method in practice. We set N = 5, μ = 0.01, and w equal to a diagonal matrix whose entries along the diagonal are (φ, φ^d, φ^2d, φ^2d, φ^2d, φ^d) where d = 0.2 and φ = 0.32768.

Let J denote the total number of iterations we run the perturbative method. Starting from ${\hat{λ}}_{0} = w_{1}$ and ${\hat{v}}_{0} = e_{1}$ as in (6), we iterate using (18) and (17) from j = 1 up to j = J. We then evaluate the solution using (19) truncated at j = J, giving us an approximate solution that we denote (λ^J(μ), w^−1/2v^J(μ)).

For an N-dimensional vector x = (x₁, x₂, …, x_N), let ‖x‖_∞ = max_1≤i≤N, the infinity-norm of x, with respect to which the error of the approximate solution after J iterations is

{error}^{J} = {‖ μ w (w^{- 1 / 2} v^{J} (μ)) - λ^{J} (μ) (w^{- 1 / 2} v^{J} (μ)) ‖}_{\infty} .

In Figure 3, we plot (in circles) the log₁₀ of the error as a function of the number of iterations J, and (in solid black) the least-squares line of best fit to the data. From J = 1 to J = 14, the log₁₀ errors show a strongly linear trend, confirmed by the R² = 0.9958 value for the regression line. The slope of the line is approximately −0.9944, implying

{error}^{J} \propto 10^{- 0.9944 J} .

Figure 3.

For a particular eigenvalue problem, we plot (in circles) the log₁₀ of the error committed by the the perturbative method after J iterations, where J goes from 1 to 14.

Machine epsilon in Octave is approximately 2.2204 × 10⁻¹⁶, and the error after J = 14 iterations is 2.2590 × 10⁻¹⁶. Therefore, for this particular example, Figure 3 shows that the perturbative method converges exponentially to a solution with error on the order of machine epsilon.

Extension to base/codon/word mutation models

Consider now a matrix μ_B defined as follows:

μ_{B} = (\begin{matrix} 1 - μ & \frac{k μ}{k + 2} & \frac{μ}{k + 2} & \frac{μ}{k + 2} \\ \frac{k μ}{k + 2} & 1 - μ & \frac{μ}{k + 2} & \frac{μ}{k + 2} \\ \frac{μ}{k + 2} & \frac{μ}{k + 2} & 1 - μ & \frac{k μ}{k + 2} \\ \frac{μ}{k + 2} & \frac{μ}{k + 2} & \frac{k μ}{k + 2} & 1 - μ \end{matrix}),

with constant parameters 0 < μ < 1 and k≥1, and the indexing of both rows and columns corresponding to bases in the ordered set B = (A, G, C, T). This matrix was employed in prior work³ and corresponds to the Kimura two-parameter base mutation model.¹⁸ When k = 1, μ_B represents the Jukes-Cantor mutation model.¹⁹

Matrix μ_B shares with matrix μ the properties of linearity in parameter μ, and reduction to the identity matrix when μ = 0. The methods of this section therefore apply to μ_B directly.

More biologically realistic CMC models^2–4 employ codon mutation models. Let C = B^p, the p-th Cartesian product of the set B. A codon c ∈ C is a string of bases b_lb₂…b_p of pre-specified length p with b_i ∈ B.

The codon mutation models studied by Ardell and Sella assume independence of mutation of bases within codons and that all bases mutate according to the same model of evolution μ_B. With these assumptions, mutation from any codon c₁, ∈ C to another codon c₂ ∈ C is represented by a matrix μ_c that is the p-th Kronecker power of a matrix μ_B as follows:

μ_{C} = \otimes_{i = 1}^{p} μ_{B},

where (· ⊗ ·) is the Kronecker product, and codons are indexed in both rows and columns in lexicographic order.

If λ is the leading eigenvalue μ_B then λ_c = λ^p is the leading eigenvalue of μ_c. Similarly, if v is the eigenvector corresponding to the leading eigenvalue of μ_B, then $v_{C} = \otimes_{i = 1}^{p} v$ is the eigenvector corresponding to the leading eigenvalue of μ_c.

Therefore, the methods of this section allow calculation of the leading eigenpairs of matrices μ_c as Kronecker powers of leading eigenpairs of μ_B.

Homotopy method: quasispecies with multiple most fit genotypes

We now give a second method for finding the leading eigenpair of the matrix Q defined in (3). This method, which we call the homotopy method, is motivated by the desire to handle a fitness matrix w that does not have a unique maximal element along its diagonal. The homotopy method produces accurate approximations of the leading eigenpair for such problems.

Problem formulation

Our goal is still to find the leading eigenpair of $\tilde{Q}$ defined in (2). As before, we will instead focus our attention on the symmetric matrix Q defined by (3). We define the matrix-valued function

F (\in) = μ + \in (w^{1 / 2} μ w^{1 / 2} - μ) .

(20)

Note that F(0) = μ and F(1) = Q. The function F smoothly deforms μ into Q—such a function is often called a homotopy in the mathematical literature.¹⁶ Note that for all ∈ ∈ [0,1], the leading eigenpair (λ(∈), v(e)) of F must satisfy

F (\in) v (\in) = λ (\in) v (\in) .

(21)

The basic idea behind the homotopy method is to use F to form a bridge between μ, a matrix whose leading eigenpair we already know, and Q, a matrix whose leading eigenpair we seek.

When ∈ = 0, (21) reduces to μv(0) = λ(0)v(0). By the results provided in supplementary materials, we know that the leading eigenvalue of F(0) = μ is 1 with corresponding eigenvector $\vec{1} = {(1, 1, \dots, 1)}^{T}$ , the column vector of N ones. This implies that

λ (0) = 1 and v (0) = \vec{1} .

(22)

When ∈ = 1, (21) reduces to Qv(1) = λ(1)v(1). Thus the question is how we can use our knowledge of (λ(0), v(0)) and the function F(∈) to derive (λ(1), v(1)), the leading eigenpair of Q.

Unlike the perturbative solution, at no point will we assume that the entries of w have a unique maximum. To make the derivation easier to read, we define

P = w^{1 / 2} μ w^{1 / 2} - μ,

(23)

so that F(∈) = μ + ∈P and

F' (\in) = \frac{d}{d \in} F (\in) = P .

(24)

Homotopy solution

We differentiate both sides of (21) once with respect to ∈ and obtain

F' (\in) v (\in) + F (\in) v' (\in) = λ' (\in) v (\in) + λ (\in) v' (\in)

(25)

Since F(∈) is symmetric for all ∈, we see that the transposition of (21) can be written v(∈)^TF(∈) = λ(∈) v(∈)^T. Thus, after multiplying (25) through on the left by v(∈)^T, the second term on the left-hand side cancels the second term on the right-hand side, leaving

λ' (\in) = \frac{v {(\in)}^{T} P v (\in)}{v {(\in)}^{T} v (\in)} .

(26)

Now let us substitute (26) back into (25). Solving for v'(∈), we have

v' (\in) = - {[F (\in) - λ (\in) I]}^{- 1} F' (\in) v (\in) .

(27)

We now recognize (26) and (27) as a system of ordinary differential equations (ODEs) with ∈ playing the role of a time-like independent variable:

\frac{d}{d \in} (\begin{matrix} λ (\in) \\ v (\in) \end{matrix}) = (\begin{matrix} \frac{v {(\in)}^{T} P v (\in)}{v {(\in)}^{T} v (\in)} \\ - {[μ + \in P - λ (\in) I]}^{- 1} P v (\in) \end{matrix})

(28)

We also recognize (22) as the initial conditions for this system of ODEs. Let us now describe an elementary algorithm for solving this system: 1.

Set λ = 1 and $v = \vec{1}$ . Fix an integer number of steps n_steps and then set Δ∈ = 1/n_steps. Also set ∈ = 0, initially.

Whiles ∈ < 1:

Compute λ′ using (26), i.e., λ′ = (v^TPv)/(v^Tv).

Set λ ← λ + (Δ∈)λ.

Using this updated λ, compute v' using (27), i.e., v' = -[μ+∈P-λI]⁻¹ Pv.

Set v ← v+(Δ∈)v'.

Set ∈ ← ∈ Δ∈.

The algorithm will terminate in n_steps steps, yielding an approximation to the leading eigenpair of Q that is stored in λ and v.

To obtain the leading eigenvector of $\tilde{Q}$ , we compute $\tilde{v}$ = w^−1/2v • The leading eigenpair of $\tilde{Q}$ is then (λ, $\tilde{v}$ ).

Example

We now give test results for the homotopy method applied to a particular problem. We set N = 8, μ = 0.1, and w equal to a diagonal matrix whose entries along the diagonal are (a, b, b, b, b, b, b, b)^T with a = 0.63631836 and b = 0.73306514. We also set the parameter, φ = 1.

We repeatedly run the algorithm given above for different values of n_steps; specifically, we take n_steps = 10^j' j = 1, 2, 3, 4, 5, 6. For each value of n_steps, we compute an approximation to the leading eigenpair, which we denote by (λ ^j , v^j. We then evaluate the residual error of this approximation using

{error}^{j} = ‖ μ w (w^{- 1 / 2} v^{j}) - λ^{j} (w^{- 1 / 2} v^{j}) ‖ \infty .

In Figure 4, we plot (in circles) the log₁₀ of the error as a function of the log₁₀ of the number of steps. We also plot (in solid black) the least-squares line of best fit to the data; for this line, R² = 0.9899 and the slope is approximately −1.0115, implying

{error}^{j} α n_{s t e p s}^{- 1.0115} \approx n_{s t e p s}^{- 1} = Δ \in .

Figure 4.

For a particular eigenvalue problem, we plot (in circles) the log₁₀ of the error committed by the the homotopy method using a number of steps given by n_steps = 10^j, where j goes from 1 to 6.

Note that with n_steps = 10³, the error is approximately 2 × 10⁻⁵. This level of error is acceptable if we seek to use the homotopy method to reproduce, for example, Figure 4 in a previously published paper.⁴ Hence we use this value of n _steps as the default value in the CMCpy implementation of the homotopy method.

Further note that when n_steps = 10⁶, even the elementary algorithm described above to solve the system of nonlinear ODEs (28) is capable of producing a residual error of approximately 2 × 10⁻⁸.

For this example, the homotopy method displays convergence that is linear in Δ∈—-this relatively slow rate can be improved dramatically by using more sophisticated methods to solve the system of nonlinear ODEs (28), an issue we leave for future work.

Extension to base/codon/word mutation models

In order to apply the homotopy method to models where the mutation matrix is given by μ_B as defined earlier, there is one requirement: we must be sure that μ_B has a unique maximal eigenvalue of 1. For the μ_B matrix, it turns out that we can explicitly derive all eigenvectors and eigenvalues, for general values of both μ and k. With the constraints 0 < μ < 1 and k ≥ 1, the derivation that we give in supplementary materials below proves that the μ_B matrices have a unique largest eigenvalue of 1. Matrices μ_B therefore fulfill the minimum requirements for applicability of the methods of this section, after which the leading eigenpair for corresponding matrices μ_C may be calculated using Kronecker powers as before.

Future outlook

We succeeded in programming a GPU-based power method implementation that is faster than its CPU analogue; however, the performance gain that we obtained with it was not as great as we had hoped. Furthermore, even though our implementation is correct, we could not completely eliminate divergence in evolutionary trajectories in power method implementations arising from differences in machine number representations and precision across platforms. We believe that this arises from deviations in the way double precision floating point numbers are represented, computed on, and rounded in CUDA compute capability 1.3. Perhaps utilization of the cuBLAS library in the future would make performance closer to Numpy and better conform to IEEE standards.

Furthermore, while the power method converges linearly, our new perturbation method provides exponential convergence with the same accuracy. However, this method cannot handle the case of non-unique most fit genotypes that occurs in CMC models at their initialized state. On the other hand, our new homotopy method does handle this case, yet as currently implemented it also converges linearly, rather slower than the power method (results not shown). Incorporation of more sophisticated methods to solve systems of nonlinear ordinary differential equations should dramatically improve performance of our new homotopy method application. We leave further development of both methods and their implementation in CMCpy for future work; perhaps other CPU or GPU implementations of them will compete with Numpy. More generally, our analytical results greatly expand the domain of quasispecies models that can be accurately solved using analytical approaches, particularly multi-site models with biologically realistic mutation parameters.

A variety of open problems remain concerning CMC models and in the field of the evolution of the genetic code.^20–24 CMCpy can easily be extended to implement the model studied by Vetsigian et al. (2006)⁵ with variations, or alternative observables, such as the “evenness” of amino acids.²³ Current models of the genetic code have not yet integrated a theory for the origin of translation per se.^5,21 We believe that extensions to CMC models will better address such fundamental questions and hope that CMCpy and our analytical solutions to quasispecies models will play a role in that work.

Funding

DHA acknowledges the UC Merced division Graduate Research Council for a UC Faculty Research and Chancellor's awards that supported undergraduate BPS, as well as the NSF-funded program in Undergraduate Research in Computational Biology at UC Merced (DBI-1040962), which supported and trained PJB and paid publication costs.

Competing Interests

Author(s) disclose no potential conflicts of interest.

Author Contributions

Conceived CMCpy project: DHA. Wrote software: DHA, PJB, BPS. Analyzed data: DHA, HSB, PJB, BPS. Analytical results: HSB, with minor contributions from DHA. Wrote first draft of the manuscript: DHA, HSB. Contributed to the writing of the manuscript: DHA, HSB. Agree with manuscript results and conclusions: DHA, HSB, PJB, BPS. Developed the structure and arguments for the paper: DHA, HSB. Made critical revisions and approved final version: DHA, HSB, PJB. All authors reviewed and approved of the final manuscript.

Disclosures and Ethics

As a requirement of publication author(s) have provided to the publisher signed confirmation of compliance with legal and ethical obligations including but not limited to the following: authorship and contributorship, conflicts of interest, privacy and confidentiality and (where applicable) protection of human and animal research subjects. The authors have read and confirmed their agreement with the ICMJE authorship and conflict of interest criteria. The authors have also confirmed that this article is unique and not under consideration or published in any other publication, and that they have permission from rights holders to reproduce any copyrighted material. Any disclosures are made in this section. The external blind peer reviewers report no conflicts of interest.

Footnotes

Acknowledgements

The authors would like to thank members of the Ardell Lab, Jason Davis and Suzanne Sindi, for their comments and suggestions on this work, as well as Prof. Masakatsu Watanabe and Prof. Michael Colvin for their leadership in creating and running the Undergraduate Research in Computational Biology (URCB) program at UC Merced, without which the collaboration of undergraduate Peter Becich would not have been possible.

Supplementary Materials

References

Sella

, Ardell

D.H.

The impact of message mutation on the fitness of a genetic code. J Mol Evol. 2002; 54(5): 638–51.

Ardell

D.H.

, Sella

On the evolution of redundancy in genetic codes. J Mol Evol. 2001; 53(4–5): 269–81.

Ardell

D.H.

, Sella

No accident: genetic codes freeze in error-correcting patterns of the standard genetic code. Philos Trans R Soc Land B Biol Sci. 2002; 357(1427): 1625–42.

Sella

, Ardell

D.H.

The coevolution of genes and genetic codes: Crick's frozen accident revisited. J Mol Evol. 2006; 63(3): 297–313.

Vetsigian

, Woese

, Goldenfeld

Collective evolution and the genetic code. Proc Natl Acad Sci USA. 2006; 103(28): 10696–701.

Eigen

Molecular self-organization and the early stages of evolution. Q Rev Biophys. 1971; 4(2): 149–212.

Bull

J.J.

, Meyers

L.A.

, Lachmann

Quasispecies made simple. PLoS Comput Biol. 2005; 1(6): e61.

Higgs

P.G.

Error thresholds and stationary mutant distributions in multi-locus diploid genetics models. Genet Res. 1994; 63: 63–78.

Wilke

C.O.

Quasispecies theory in the context of population genetics. BMC Evol Biol. 2005; 5: 44.

10.

Swetina

, Schuster

Self-replication with errors. A model for polynucleotide replication. Biophys Chem. 1982; 16(4): 329–45.

11.

Eigen

, McCaskill

, Schuster

Molecular quasi-species. J Phys Chem. 1988; 92: 6881–91.

12.

Hiroshi

Application of Eigen's evolution model to infinite population genetic algorithms with selection and mutation. Complex Systems. 1996; 10: 345–66.

13.

Rellich

Pertubation Theory of Eigenvalue Problems. New York: New York University, Courant Institute of Mathematical Sciences; 1954.

14.

Tosio

Pertubation Theory for Linear Operators. Berlin: Springer-Verlag; 1995.

15.

Baumgartel

Analytic Perturbation Theory for Matrices and Operators. Basel: Birkhäuser Verlag; 1985.

16.

Chu

M.T.

A simple application of the homotophy method to symmetric eigenvalue problems. Linear Algebra Appl. 1984; 59: 85–90.

17.

Klöckner

, Pinto

, Lee

, Catanzaro

, Ivanov

, Fasih

PyCUDA and PyOpenCL: a script-based approach to GPU run-time code generation. Parallel Comput. 2012; 38: 157–74.

18.

Kimura

A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol. 1980; 16(2): 111–20.

19.

Jukes

, Cantor

Evolution of protein molecules. In: Munro

, editor. Mammalian Protein Metabolism, III New York: Academic Press; 1969: 21–132.

20.

Higgs

P.G.

A four-column theory for the origin of the genetic code: tracing the evolutionary pathways that gave rise to an optimized code. Biol Direct. 2009; 4: 16.

21.

Koonin

E.V.

, Novozhilov

A.S.

Origin and evolution of the genetic code: the universal enigma. IUBMB Life. 2009; 61(2): 99–111.

22.

Tlusty

A colorful origin for the genetic code: information theory, statistical mechanics and the emergence of molecular codes. Phys Life Rev. 2010; 7(3): 362–76.

23.

Philip

G.K.

, Freeland

S.J.

Did evolution select a nonrandom “alphabet” of amino acids?

Astrobiology. 2011; 11(3): 235–40).

24.

Caporaso

J.G.

, Knight

New insight into the diversity of life's building blocks: evenness, not variance. Astrobiology. 2011; 11(3): 197–8.

CMCpy: Genetic Code-Message Coevolution Models in Python

Abstract

Keywords

Introduction

Implementation

Analytical Methods for Quasispecies Solutions

Perturbative method: quasispecies with unique fittest genotype

Guiding principle

Perturbative solution: part I

Perturbative solution: part II

Algorithmic improvements

Example

Extension to base/codon/word mutation models

Homotopy method: quasispecies with multiple most fit genotypes

Problem formulation

Homotopy solution

Example

Extension to base/codon/word mutation models

Future outlook

Funding

Competing Interests

Author Contributions

Disclosures and Ethics

Footnotes

Acknowledgements

Supplementary Materials

References