Reconstruction of the median genome consisting of linear chromosomes from three given genomes is known to be intractable. There exist efficient methods for solving a relaxed version of this problem, where the median genome is allowed to have circular chromosomes. We propose a method for construction of an approximate solution to the original problem from a solution to the relaxed problem and prove a bound on its approximation error. Our method also provides insights into the combinatorial structure of genome transformations with respect to appearance of circular chromosomes.
In the course of evolution, genomes become a subject to a number of large-scale evolutionary events such as genome rearrangements that shuffle genomic architectures, and gene insertions and deletions (indels) that insert or remove continuous intervals of genes. Since these evolutionary events are rare, the number of them between two genomes is used in phylogenomic studies to measure the evolutionary distance between them. Such measurement is often based on the maximum parsimony assumption, implying that the evolutionary distance can be estimated as the minimum number of events between genomes. A convenient model for the most common genome rearrangements is given by the double-cut-and-join (DCJ) operations,1 also known as 2-breaks,2 which make two “cuts” in a genome and “glues” the resulting genomic fragments in a new order. Namely, DCJs mimic reversals (that inverse contiguous segments of chromosomes), translocations (that exchange tails of the two chromosomes), and fissions/fusions (that split/join chromosomes), while indels can be modeled by the DCJs on certain artificial circular chromosomes called prosthetic.3,4
The maximum parsimony assumption enables addressing the ancestral genome reconstruction problem, which asks to reconstruct ancestral genomes from given extant genomes, by minimizing the total distance between genomes along the branches of the phylogenetic tree. The basic case of this problem with just three given genomes is known as the genome median problem (GMP), which asks for a single ancestral genome (median genome) at the minimum total distance from the given genomes.
The GMP is NP-hard under a number of models of genome rearrangements, such as reversals-only5 and DCJ.6 While these problems can be posed for both circular genomes (consisting of circular chromosomes) and linear genomes (consisting of linear chromosomes), the DCJ model allows appearance of circular chromosomes in transformations between linear genomes. Correspondingly, a solution to the GMP under the DCJ model may contain circular chromosomes even if the given genomes are linear. Since appearance of circular chromosomes in the reconstructed ancestral genomes of extant linear genomes represents an artifact and inadequately describes the biological reality, it is important to distinguish between the GMP and the linear genome median problem (L-GMP), where the latter is restricted to linear genomes only.
To the best of our knowledge, there exist no solvers for the L-GMP, while there are some advanced GMP solvers,7–9 which allow the median genome to contain circular chromosomes. This deficiency inspired us to pose the problem of using the solution for the GMP to obtain a linear genome approximating the solution to the L-GPM. In this study, we propose an algorithm that linearizes chromosomes of a given GMP solution in a certain optimal way as described in the “Background” section. Our approach also provides insights into the combinatorial structure of genome transformations by DCJs and indels with respect to appearance of circular chromosomes. We remark that a similar linearization problem appears in adjacency-based reconstructions of median genomes and is known to be intractable,10 forcing the existing approaches10–14 to solve its relaxation and allowing the constructed median genomes to contain circular chromosomes.
The article is organized as follows. In the “Background” section, we describe the graph-theoretical representation of genomes, DCJs, and indels. In the “Main Results” section, we formulate main theorems providing an approximate solution for L-GMP. In the “Methods” section, we develop necessary machienery and prove our main theorems. We conclude the article with the “Discussion” section.
Background
DCJ-Indel distance and genome graphs
In this study, we focus on genomes with no duplicated genes. Let P be a genome, which may contain both circular and linear chromosomes. We represent a circular chromosome consisting of n genes as a graph cycle with n directed gene edges encoding genes and their strands, which alternate with n undirected edges connecting the extremities of adjacent genes. Similarly, we represent a linear chromosome consisting of n genes as a path with n directed gene edges alternating with undirected edges, where undirected edges connect extremities of adjacent genes, and two more undirected edges connect each endpoint extremity to its own special vertex labeled corresponding to a telomere (Figure 1A). We label each gene edge with the corresponding gene x and further label its tail and head endpoints with and , respectively (Figure 1A). We define the operation as and . A collection of cycles and paths representing the chromosomes of P forms the genome graph. The undirected edges in are called P-edges. We denote by the gene content of P (ie, the set of genes present in P) and by the set of regular (non-) vertices of .
(A) A genome graph for a linear genome . (B) A genome graph for a genome consisting of circular and linear chromosomes is obtained by a DCJ that splits the linear chromosome into two chromosomes. (C) A genome graph for a genome is obtained by an insertion of gene sequence . Dotted directed edges correspond to inserted genes.
A DCJ transforming a genome P into a genome corresponds to one of the following operations transforming into (Figure 1A and B):
(internal reversals, translocations)
(reversals at chromosome ends, translocations involving a whole chromosome)
(fusions)
(fissions)
where x, y, u, and v are distinct vertices from .
A DCJ scenario between genomes P and Q with equal gene content (ie, ) is a sequence of DCJs transforming P into Q. We define the DCJ distance between genomes P and Q as the length of a shortest DCJ scenario between them.
To transform a genome P into a genome Q with unequal gene content, one needs to consider gene insertion and deletion operations (indels) in addition to DCJs. An insertion transforming a genome P into a genome corresponds to one of the following operations transforming into (Figure 1A and C):
replace a P-edge with a path (including the case of either or ),
add a path ,
add a cycle ,
where the edges alternate between -edges and gene edges with , resulting in .
A deletion can be viewed as an event reversing an insertion. A deletion transforming a genome P into a genome corresponds to one of the following operations transforming into (Figure 1A and C):
replace a path with a -edge (including the case of either or ),
remove a path ,
remove a cycle ,
where the edges alternate between P-edges and gene edges , resulting in .
A DCJ-Indel scenariot between genomes P and Q is a sequence of DCJs and indels transforming P into Q, where deletions delete genes from and insertions insert genes from (ie, no gene can be inserted and then deleted, or deleted and then inserted), denoted as . We also find it convenient to represent t as , where each is a DCJ or an indel. We define the DCJ-Indel distance as the length of a shortest DCJ-Indel scenario transforming genome P into genome Q. It is easy to see that any DCJ-Indel scenario transforming P into Q can be reversed (turning each insersion into a deletion, and vice versa) to obtain a DCJ-Indel scenario transforming Q into P, implying that .
A circular chromosome C in P is a singleton with respect to genome Q if it is composed of genes absent in Q, ie, . Let be the number of singletons in P with respect to Q. The total number of singletons in P and Q with respect to each other is . The following lemma describes an important property of singletons.
Let be a shortest DCJ-Indel scenario. Let C be a singleton in with respect to . Then for any and any chromosome D in such that , we have .
Genome median problem
We pose the GMP under the DCJ-Indel model as follows.
Genome median problem (GMP)
Given genomes , find a genome M with that minimizes the DCJ-Indel median score:
Since the GMP is posed under the DCJ-Indel model, a median genome for given linear genomes may contain circular chromosomes. To address the issue of circular chromosome presence, we pose the following problem.
Linear genome median problem (L-GMP)
For given linear genomes , , and , find a linear genome M with minimizing the DCJ-Indel median score .
While we are not aware of efficient algorithms (let alone, software solvers) for solving the L-GMP, we pose the problem of constructing an approximate solution for the L-GMP from the given solution for GMP.
Results
Chromosome linearization
Let be a DCJ-Indel scenario and C be a circular chromosome in . For each , let be a collection of all circular chromosomes in such that (). We call a meta-chromosome of C in and note that itself may be viewed as a genome, for which , , and are defined. In particular, we have . Below, we describe an important property of circular chromosomes appearing in DCJ-Indel scenarios (Figure 3).
Definition 1
A circular chromosome C is linearized within a DCJ-Indel scenario (or t linearizes C) if the following three conditions hold:
(E1) C is present in P;
(E2) ;
(E3) , whereis the meta-circular chromosome of C in Q.
Equivalently, a circular chromosome C of genome P is linearized within if there exists a gene in C that resides on a linear chromosome in Q, or together with a gene from another chromosome of P resides on a circular chromosome in Q.
We extend Definition 1 to a particular event in a DCJ-Indel scenario as follows.
Definition 2
Let be a DCJ-Indel scenario that linearizes a circular chromosome C. We say that an event linearizes C within t if C is linearized within and C is not linearized within for any k<i.
The following theorem shows that for given linear genomes, all circular chromosomes in their median genome are linearized within the corresponding DCJ-Indel scenarios.
Lemma 3
Let , , and be linear genomes, and M be a genome such that . Let be a shortest DCJ-Indel scenario between M and for . Then each circular chromosome in M is linearized in at least one of the DCJ-Indel scenarios .
Proof
Assume that there is a circular chromosome C in M that is not linearized in either of . Then at least one of conditions (E2) or (E3) does not hold for each . Since , for each circular chromosome in M, we have . Hence, the condition (E2) must hold for at least one . So, there is such that the condition (E3) does not hold for the genome (ie, ), where is the meta-chromosome of C in . In other words, there exist circular chromosomes in the genome , which contradicts its linearity.□
The following theorems represent a key to proving our main results on linearization of median genomes. Proofs of these theorems are rather technical and given in the “Methods” section.
Theorem 1
Let be a DCJ-Indel scenario that linearizes a circular chromosome C. Then there exists a DCJ-Indel scenario such that r is a DCJ linearizing C within and .
For DCJ scenarios, we have a somewhat stronger result.
Theorem 2
Let be a DCJ scenario that linearizes a circular chromosome C. Then there exists a DCJ scenario such that r is a DCJ linearizing C within and .
Linearization of median genomes
For a genome P, let be the number of circular chromosomes in genome P. Our main results on linearization of median genomes are given by the following theorems.
Theorem 3
Let , , and be linear genomes, and M be a given median genome. Then for any , there exists a genome such that , , and
Proof
We prove the theorem by induction on n. If , the theorem trivially holds for .
We assume that the theorem holds for . Then there exists a genome such that , , and . Let be a circular chromosome in . Since , we have . Let be a shortest DCJ-Indel scenario between and for (Figure 2). By Lemma 3, there is at least one of the DCJ-Indel scenarios , , and that linearizes , say . By Theorem 1, we obtain a DCJ-Indel scenario of the form such that linearizes within and . Clearly, . By the triangle inequality, for , we have Hence, we have Thus, we have
□
Linear genomes , , and and their median genome M represented as vertices. A genome containing () circular chromosomes is represented as vertex, and the corresponding shortest transformations , , and are represented as directed dashed edges. We construct a shortest transformation from to composed of and such that results in a genome with and . The corresponding shortest transformations from to and are represented as bold directed edges and denoted by and .
For the GMP under the DCJ model, we can immediately improve the derived upper bound as follows.
Theorem 4
Let , , and be linear genomes with equal gene content, and M be a given median genome. Then for any , there exists a genome such that , and
Proof
The proof proceeds as the proof of Theorem 3 with the following difference. We use Theorem 2 instead of Theorem 1 to obtain a DCJ scenario of the form such that linearizes within and . Hence, we have □
Methods
This section is devoted to the proof of Theorems 1 and 2.
We call any two DCJ-Indel scenarios between the same pair of genomes equivalent. Let be a DCJ-Indel scenario that linearizes a circular chromosome C. First, in Lemma 4, we will show that there exists an event r within t that linearizes a circular chromosome C. Second, in Theorem 5, we will show that r is a DCJ. Third, we will show how to obtain equivalent pair of events (i.e., a DCJ-Indel scenario of length 2) from adjacent events in t, where and linearize C.
We will distinguish the pair of adjacent events based on their dependency. Namely, two adjacent events and in a DCJ-Indel scenario are called independent if the edges removed by are not created by . Otherwise, when removes edge(s) created by , we say that depends on . We will assume that is a DCJ if not stated otherwise. We will consider the following cases:
and are independent events (addressed in Lemma 6);
depends on a deletion (addressed in Lemma 8);
depends on a DCJ (addressed in Lemma 7);
depends on an insertion (addressed in Lemmas 9 to 11).
Eventually, results of Lemmas 6 to 11 will enable us to prove Theorems 1 and 2.
Circular chromosomes and DCJ-Indel scenarios
The following lemma shows the connection between Definitions 1 and 2.
Lemma 4
Let be a DCJ-Indel scenario that linearizes a circular chromosome C. Then there exists an event r that linearizes C within t.
Proof
Suppose that t is of the form: . For each , let be the meta-chromosome of C in . In particular, . Then, the equality
holds for but not for (since C is linearized within t). Hence, there exists such that equation (2) holds for but not for . Moreover, it is clear that and . By Definition 1, C is not linearized within and is linearized within . Thus, linearizes C within t.□
An event linearizing a circular chromosome C can also be described in terms of removing edges in genome graphs as follows (Figure 3).
Illustration of a linearized circular chromosome C within a DCJ-Indel scenario and Theorem 5. Dashed gray and black edges denote newly inserted genes and arbitrary gene sequences, respectively. Dotted edges represent genes that do not belong to meta-chromosomes of C. (A) Initial genome graph, where C is a linearized circular chromosome and D is a chromosome of any type. (B) The intermediate genome graph resulted from a DCJ-Indel scenario , where is a meta-chromosome of C and is a chromosome obtained from D. (C) The resulting graph after a fission on a circular chromosome . (D) The graph resulted from a DCJ that combines a circular chromosome and a chromosome .
Theorem 5
Let be a DCJ-Indel scenario that linearizes a circular chromosome C. Let be the meta-chromosome of C in for each . Then linearizes C within t if and only if is a DCJ with a minimal index k such that one of the following conditions holds:
removes edgesand;
removes a single edge.
Proof
Assume that is a DCJ and the above conditions (i) or (ii) holds, where k is the smallest such index. Since C is linearized within t, (E1) and (E2) hold for DCJ-Indel scenarios and . Now, we need to show that (E3) holds for , but not for . We consider the following two cases.
If condition (i) holds, then a belongs to a circular chromosome and b belongs to a chromosome (Figure 3D). If is circular, then creates a new circular chromosome such that (ie, is a fusion of circular chromosomes). By Lemma 2, we have . Since , we have , ie, (E3) holds. If is linear, then turns into a linear chromosome. Hence, we have . Since , we have , ie, (E3) holds. Since k is the smallest index and , (E3) does not hold for .
If condition (ii) holds, the proof is similar (Figure 3C).
Now, assume that linearizes C within t. Then the equality
holds. There are three possible types of , namely, insertion, deletion, and DCJ. First, we assume that is an insertion. Then . Recall that inserts genes from . In particular, since C is present in (ie, ), does not insert any genes from . Thus, we have . Since insertions cannot change the chromosome types, we have . By equation (3), we have a contradiction. Thus, is not an insertion.
Second, we assume that is a deletion. Then and . Let
Note that since , we have for all . Then . Let
Our goal is to prove that . Since A is the subset of genes removed by , . In particular, . Hence, . By equation (3), we have that . Now, let . Note that , , and . Since deletion cannot change the chromosome types, it follows that g is removed by . Then . By equation (3), , and thus we have . Since the choice of g was arbitrary, we have proved that . Note that and . Therefore, , a contradiction to linearizing C. Thus, is not a deletion.
We proved that is a DCJ. Then . Hence,
Thus, holds. Since does not change the gene content, either breaks one circular chromosome , or combines circular chromosomes and into a single circular chromosome, or combines a circular chromosome and linear chromosome into a single linear chromosome. In the first case, removes a single edge that belongs to (Figure 3C). In the last two cases, among the two edges removed by , one must belong to and the other does not belong to (Figure 3D).□
The following lemma describes an important property of meta-chromosomes.
Lemma 5
Let be a DCJ-Indel scenario, where linearizes a circular chromosome C within t. Let , and and be the meta-chromosomes of C in and , respectively. Then for any vertex , if , then .
Proof
Let be the gene corresponding to x. Note that if and only if for . Since and , cannot be inserted or removed by . Suppose that , ie, . We consider two cases depending on whether is an indel or a DCJ.
First, assume that is an indel. Since , there is a circular chromosome such that . Let be a chromosome in such that , ie, . If (ie, does not affect ), we have , implying that . If , we have either or . Since , in both cases, . Therefore, .
Second, assume that is a DCJ. Then, since does not linearize C, operates on four vertices that belong to . Since is a DCJ, . Hence, these four vertices belong to . Thus, if then .□
From Lemma 5, the following corollary follows immediately.
Corollary 1
Let be a DCJ-Indel scenario, where linearizes a circular chromosome C. Let and be vertices from such that and . Let and be the meta-chromosomes of C in and , respectively. If , then .
Independent adjacent events
In this section, we address the case , ie, and are independent events. It is easy to see that the order of any two adjacent independent events in a DCJ-Indel scenario can be changed without affecting the starting and ending genomes.15
Lemma 6
Let be a DCJ-Indel scenario that linearizes a circular chromosome C, where and are independent events. If linearizes C within t, then also linearizes C within the DCJ-Indel scenario .
Proof
Let and be the meta-chromosomes of C in and , respectively. Since linearizes C within t, by Theorem 5, is a DCJ. If removes two edges in , say and , then since and are independent, the edges and are present in . By Corollary 1, we have and . By Theorem 5, linearizes C within the DCJ-Indel scenario . If removes a single edge, the proof is similar.□
DCJ depends on a deletion
In this section, we consider case , ie, a DCJ depends on a deletion . For such pair of events the following lemma holds.
Lemma 7
Let be a DCJ-Indel scenario that linearizes a circular chromosome C, where DCJ depends on deletion . If linearizes C, then there exists a DCJ-Indel scenario , where linearizes C and is a deletion.
Proof
Let and be the meta-chromosomes of C in and , respectively. Let be the path in that is replaced with in by .
Suppose that removes two edges. Since depends on , we can assume that removes edges , in and creates , in (Figure 4A, B, and D). By Theorem 5, without loss of generality, we assume that and . We define as the DCJ that removes edges , in , and creates , in , where is the genome resulted from . Moreover, we define as the deletion that replaces a path in with an edge in (Figure 4A, C, and D). Since depends on , is present in . By Corollary 1, and . Thus, by Theorem 5, linearizes C within .
Illustration of Lemma 8. (A) Initial genome graph, where the dashed edges denote arbitrary gene sequences. is a circular chromosome linearized by within DCJ-Indel scenario , where is a deletion of gene sequence and is a DCJ. (B) The intermediate genome resulted from deletion . (C) The intermediate genome resulted from DCJ . (D) The graph resulted from the equivalent pair of DCJ-Indel scenarios and , where is linearized by DCJs and , and are deletions.
Suppose that removes a single edge a. Since depends on , we have . We define as the DCJ that removes a single edge and creates , , and as the deletion that replaces a path with an edge . The proof that linearizes C within is similar.□
DCJ depends on a DCJ
In this section, we address case , ie, DCJ depends on a DCJ . Let A be the set of edges created by , and B be the set of edges removed by . Since depends on , . We say that strongly depends on if and weakly depends on otherwise (such pairs of adjacent DCJs are also known as enchained15). In a genome graph, a pair of adjacent dependent DCJs replaces three edges with three other edges on the same six vertices (this operation is known as a 3-break2). It is easy to see that for a pair of weakly dependent DCJs, there exist equivalent pairs of weakly dependent DCJs.15 Then the following lemma holds.
Lemma 8
Let be a DCJ-Indel scenario that linearizes a circular chromosome C, where and are dependent DCJs. If linearizes C, then there exists a pair of DCJs equivalent to such that linearizes C within the DCJ-Indel scenario .
Proof
Let and be the meta-chromosomes of C in and , respectively. Let A be the set of edges created by , and B be the set of edges removed by . We consider two cases depending on whether strongly depends or weakly depends on .
First, assume that strongly depends on (ie, ). If , then let and be the edges removed by in . By Theorem 5, without loss of generality, we assume that and . Since and are created by , the edges and , or and are present in . In both cases, we have a contradiction to Corollary 1. If , the proof is similar.
For the rest of the proof, we assume that weakly depends on (ie, and ). We consider two cases depending on the number of edges removed by .
If removes two edges in , let and be these edges, and and be the edges created by in . By Theorem 5, without loss of generality, we assume that and . Since weakly depends on , either or is created by (Figure 5A, B, and E). We consider these two subcases below.
Illustration of Lemma 7. (A) Initial genome graph, where the dashed edges denote arbitrary gene sequences. The dashed edge and black undirected edge between and form a circular chromosome C that is linearized within by . (B-D) The intermediate genomes after first DCJs in the three equivalent pairs of weakly dependent DCJs. (E) The resulting genome graph after the equivalent pairs of DCJs, where C is linearized by DCJs and either or (depending on the belonging other chromosomes to meta-chromosome corresponding to C).
Suppose that is created by . If creates a single edge, then removes edges and in , a contradiction to Corollary 1. Thus, we assume that removes two edges, say and . By Corollary 1, both and belong to . Since and , by Corollary 1, . We define as a DCJ that removes and in and creates and in . We further define as a DCJ that removes and in (Figure 5A, C, D, and E) and creates and in . Then by Theorem 5, linearizes C within .
Suppose that is created by . Let us first assume that removes two edges and . Since , by Corollary 1, and do not belong to . Moreover, since and , we have . We define as the DCJ that removes and in and creates and in . We define as the DCJ that removes and in and creates and in (Figure 5). By Theorem 5, linearizes C within . If removes a single edge, then the proof is similar.
If removes a single edge in , then by Theorem 5, . Since creates , it removes two edges. We assume that these edges are and . By Corollary 1, and belong to . We define as a DCJ that removes a single edge in , say , and creates two edges and in . We define as a DCJ that removes and in and creates and in . By Theorem 5, linearizes C within . It is easy to see that by construction, in all cases, is equivalent to , which completes the proof.□
DCJ depends on an insertion
In this section, we consider case , ie, a DCJ depends on an insertion . We say that strongly depends on if removes two edges created by . If removes one edge created by , we say that weakly depends on . In contrast to cases and , when weakly depends on , there may not always exist an equivalent pair , where is a DCJ and is an insertion.
To better capture and analyze the combinatorial structure of events in a DCJ-Indel scenario t, we construct the dependency graph16 (also known as overlap graph17,18), whose vertices are labeled with events from t and there is an arc whenever an event depends on an event . We remark that a DCJ can weakly depend on at most two insertions in a DCJ-Indel scenario. The following definition describes DCJs in t for which the pair of adjacent events does not have an equivalent pair , where are DCJs and are an insertion.
Definition 3
A DCJ in a DCJ-Indel scenario t is called upper-movable if the following property holds:
If there exists exactly one insertion in t such that there is a path from α to β in , say , then removes either the first or the last edge of the path inserted by .
First, we consider the case when a DCJ depends on two insertions (Figure 6). Second, we address the case when a DCJ is upper-movable and depends on only one insertion (Figure 7). Finally, we consider the case when a DCJ is not upper-movable.
Illustration of Lemma 9. (A) Initial genome graph, where the dashed edges denote arbitrary gene sequences. is a circular chromosome linearized by within DCJ-Indel scenario , where is a DCJ and are insertions of gene sequences and . (B) The intermediate genomes before and after an insertion . (C) The intermediate genomes before and after an insertion . (D) The resulting graph after the equivalent pair of DCJ-Indel scenarios and , where is linearized by DCJs and , and are insertions.
Illustration of Lemma 10. (A) Initial genome graph, where the dashed edges denote arbitrary gene sequences. is a circular chromosome linearized by within DCJ-Indel scenario , where is an insertion of gene sequences and is a DCJ. (B) The intermediate genome after an insertion . (C) The intermediate genome after a DCJ . (D) The resulting graph after the equivalent pair of DCJ-Indel scenarios and , where is linearized by DCJs and , and are insertions.
Lemma 9
Let be a DCJ-Indel scenario that linearizes a circular chromosome C, where DCJ weakly depends on insertions and . If linearizes C, then there exists a DCJ-Indel scenario , where linearizes C and , are insertions.
Proof
Let , , and be the meta-chromosomes of C in , , and , respectively. Let and be paths inserted by and , respectively. Since DCJ weakly depends on insertions and , without loss of generality, we assume that removes and , and creates and for and (Figure 6A, B, and D). By Theorem 5, we have and . Then, all edges in belong to and all edges in do not belong to . If depends on (ie, and for some ), then all edges in and belong to , a contradiction. Thus, and are independent events. We define as a DCJ that removes and in and creates and in . We define and as insertions that replace in with a path in and in with a path in , respectively (Figure 6A, C, and D). By Corollary 1, and and, moreover, and . By Theorem 5, linearizes C within a DCJ-Indel scenario .□
Lemma 10
Let be a DCJ-Indel scenario that linearizes a circular chromosome C, where DCJ depends on insertion , and there is no such that is insertion connected by a path to in . If is upper-movable and linearizes C, then there exists a DCJ-Indel scenario , where linearizes C and is an insertion.
Proof
Let and be the meta-chromosomes of C in and , respectively. Let be a path inserted by .
Assume that strongly depends on . Let and be the edges removed by in . Then and are inserted by , and thus belong to the same chromosome. Then by Theorem 5, cannot linearize C, a contradiction.
For the rest of the proof, we assume that weakly depends on . Since there is no insertion connected by a path to in , is the only insertion in t that has a path to in . Since is upper-movable, removes or . We consider two cases depending on the number of edges removed by .
If removes two edges, then without loss of generality, we assume that removes and in and creates and in (Figure 7A, B, and D). Let us define as a DCJ that removes and in and creates and in . We define as an insertion that replaces the edge in with a path in (Figure 7A, C, and D). By Theorem 5, without loss of generality, and . By Corollary 1, and . Thus, by Theorem 5, linearizes C within .
If removes a single edge, the proof is similar.□
Lemma 11
Let be a DCJ-Indel scenario that linearizes a circular chromosome C, where DCJ weakly depends on insertion . If linearizes C within t and is not upper-movable, then there exists a DCJ-Indel scenario , where linearizes C and are insertions.
Proof
Let and be the meta-chromosomes of C in and , respectively. Let be a path inserted by . Since weakly depends on and is not upper-movable, breaks into two non-trivial subpaths. We consider two cases depending on the number of edges removed by DCJ .
If removes two edges, then without loss of generality, we assume that removes edges () and in and creates edges and in . By Theorem 5, we can assume that and . Note that and are present in . By Corollary 1, and . We define as a DCJ that removes and in and creates and in . We define and as insertions that replace edges in and in with paths in and in , respectively. By Theorem 5, linearizes C within .
If removes a single edge, the proof is similar.□
Proof of Theorems 1 and 2
We remark that for each pair of adjacent events , there is an equivalent pair of adjacent events , where are insertions and have the same type. Below we prove Theorem 6, which will imply Theorems 1 and 2.
Theorem 6
Let be a DCJ-Indel scenario that linearizes a circular chromosome C. Then there exists a DCJ-Indel scenario such that r is a DCJ linearizing C, and if C is linearized by an upper-movable DCJ within t, then , otherwise .
Proof
We prove the theorem statement by induction on . If , then by Lemma 4 and Theorem 5, the statement trivially holds.
For an integer , we assume that the theorem holds for all . Suppose that t has length n, ie, t has the form . We consider two cases depending on whether linearizes C within t.
Case 1. does not linearize C within t. By Lemma 4, there exists an event for that linearizes C within t. By induction, we obtain a DCJ-Indel scenario , where r linearizes C and . We let . It is clear that if is upper-movable, and otherwise.
Case 2. linearizes C within t. By Theorem 5, is a DCJ. We consider two cases depending on whether depends on . If does not depend on , then, by Lemma 6, we obtain a DCJ-Indel scenario , where and , and linearizes C. If depends on and is a DCJ or a deletion, then by Lemma 7 or 8, we obtain a DCJ-Indel scenario , where linearizes C. In both cases, applying the induction to , we obtain a DCJ-Indel scenario , where r linearizes C and . Now, we let . It is clear that if is upper-movable, and otherwise.
It remains to consider the case when DCJ depends on , is an insertion, which we split into two subcases depending on whether is upper-movable.
Case 2.1. is upper-movable. Here we consider two cases depending on whether there exists any insertion other than that is connected by a path to in .
Case 2.1.1. is a single insertion such that there is a path to in . By Lemma 10, we obtain a DCJ-Indel scenario , where linearizes C. Since is upper-movable in , by induction, we obtain a DCJ-Indel scenario , where r linearizes C and . We let to complete the proof.
Case 2.1.2. There exists an insertion with connected by a path to in . We consider two subcases depending on whether . If , then by Lemma 9, we obtain a DCJ-Indel scenario , where linearizes C and . Since is upper-movable in , by induction, we obtain a DCJ-Indel scenario , where r linearizes C and . We let to complete the proof. If , then we replace the pair of adjacent events in t with an equivalent pair of adjacent events , where is an insertion and has the same type as , resulting in . By Lemmas 6 to 8 for the pair of adjacent events (depending on the type of and dependency with ), we obtain a DCJ-Indel scenario , where linearizes C and . Since is upper-movable in , by induction, we obtain a DCJ-Indel scenario , where r linearizes C and . We let to complete the proof.
Case 2.2. is not upper-movable. By Lemma 11, we obtain a DCJ-Indel scenario , where linearizes C and . Since there is no insertion in the DCJ-Indel scenario connected by a path to in , is upper-movable in . By induction, we obtain a DCJ-Indel scenario , where r linearizes C and . We let to complete the proof.
Theorems 1 and 2 immediately follow from Theorem 6.
Discussion
For three given linear genomes and their DCJ median genome M (which may contain circular chromosomes), we described an algorithm that constructs a linear genome such that the approximation error of (ie, the difference in the DCJ median scores of and M) is bounded by twice the number of circular chromosomes in M.
We claim (and will prove elsewhere) that the bound in Theorem 3 is tight. We illustrate this claim with Figure 8, where each of the linear genomes , , and can be obtained from the genome M by an insertion followed by a fission. Note that all the pairwise DCJ distances between equal 4. We claim that the DCJ-Indel median score of M is 6, while any linearization of M has the DCJ-Indel median score at least 8, implying that the bound in Theorem 3 is tight.
A circular median genome M on genes of three unichromosomal linear genomes , , and on genes , , and , respectively, with specified pairwise DCJ-Indel distances, where are inserted genes.
At the same time, it was previously observed by Xu8 on simulated data that the number of circular chromosomes produced by their GMP solver is typically very small, implying negligible approximation error for our algorithm.
The proposed algorithm is implemented in the AGRP solver MGRA2.19
Footnotes
Acknowledgements
The authors thank the anonymous reviewers for their thoughtful comments regarding the earlier version of this paper. Some preliminary results of the present work appeared in the Proceedings of the 14th Workshop on Algorithms in Bioinformatics (WABI 2014).
Funding:
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Science Foundation under the grant no. IIS-1462107.
Declaration of Conflicting Iinterests:
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Author Contributions
All authors participated in the manuscript preparation.
ORCID iD
Max A Alekseyev
References
1.
YancopoulosSAttieOFriedbergR.Efficient sorting of genomic permutations by translocation, inversion and block interchange. Bioinformatics. 2005;21:3340–3346. doi:10.1093/bioinformatics/bti535.
2.
AlekseyevMAPevznerPA.Multi-break rearrangements and chromosomal evolution. Theor Comput Sci. 2008;395:193–202. doi:10.1016/j.tcs.2008.01.013.
3.
BragaMDVWillingEStoyeJ. Genomic distance with DCJ and indels. In: MoultonVSinghM, eds. Algorithms in Bioinformatics. Vol 6293. Berlin, Germany: Springer; 2010:90–101. doi:10.1007/978-3-642-15294-8_8.
CapraraA.The reversal median problem. INFORMS J Comput. 2003;15:93–113.
6.
TannierEZhengCSankoffD.Multichromosomal median and halving problems under different genomic distances. BMC Bioinformatics. 2009;10:120.
7.
XuAW.A fast and exact algorithm for the median of three problem: a graph decomposition approach. J Comput Biol. 2009;16:1369–1381.
8.
XuAW. DCJ median problems on linear multichromosomal genomes: graph representation and fast exact solutions. In: CiccarelliFDMiklosI, eds. Comparative Genomics. Vol 5817. Berlin, Germany: Springer; 2009:70–83. doi:10.1007/978-3-642-04744-27.
9.
ZhangMArndtWTangJ.An exact solver for the DCJ median problem. Paper presented at: Pacific Symposium on Biocomputing; November 5-9, 2009; Big Island, HI:138–149. Singapore: World Scientific.
10.
MaňuchJPattersonMWittlerRet al. Linearization of ancestral multichromosomal genomes. BMC Bioinformatics. 2012;13:S11.
11.
MaJZhangLSuhBBet al. Reconstructing contiguous regions of an ancestral genome. Genome Res. 2006;16:1557–1565.
12.
MaJRatanARaneyBJet al. DUPCAR: reconstructing contiguous ancestral regions with duplications. J Comput Biol. 2008;15:1007–1027.
13.
MuffatoMLouisAPoisnelCEet al. Genomicus: a database and a browser to study gene synteny in modern and ancestral genomes. Bioinformatics. 2010;26:1119–1121.
14.
AvdeyevPAlexeevNRongYet al. A unified ILP framework for genome median, halving, and aliquoting problems under DCJ. Paper presented at: Proceedings of the 15th Annual Research in Computational Molecular Biology Satellite Workshop on Comparative Genomics (RECOMB-CG); October 4-6, 2017; Barcelona, Spain. Vol 10562:156–178. Berlin, Germany: Springer. doi:10.1007/978-3-319-67979-29.
15.
BragaMDStoyeJ.The solution space of sorting by DCJ. J Comput Biol. 2010;17:1145–1165.
16.
AvdeyevPJiangSAlekseyevMA.Implicit transpositions in DCJ scenarios. Front Genet. 2017;8:212. doi:10.3389/fgene.2017.00212.
17.
Ozery-FlatoMShamirR.Sorting by translocations via reversals theory. Paper presented at: Proceedings of the 4th RECOMB International Workshop on Comparative Genomics (RECOMB-CG); September 24-26, 2006; Montreal, QC, Canada. Vol 4205:87–98. Berlin, Germany: Springer. doi:10.1007/118641278.
18.
OuangraouaABergeronA.Combinatorial structure of genome rearrangements scenarios. J Comput Biol. 2010;17:1129–1144.
19.
AvdeyevPJiangSAganezovSet al. Reconstruction of ancestral genomes in presence of gene gain and loss. J Comput Biol. 2016;23:150–164. doi:10.1089/cmb.2015.0160.