Sage Journals: Discover world-class research

Abstract

Reconstruction of the median genome consisting of linear chromosomes from three given genomes is known to be intractable. There exist efficient methods for solving a relaxed version of this problem, where the median genome is allowed to have circular chromosomes. We propose a method for construction of an approximate solution to the original problem from a solution to the relaxed problem and prove a bound on its approximation error. Our method also provides insights into the combinatorial structure of genome transformations with respect to appearance of circular chromosomes.

Keywords

Double cut and join indels genome median problem circular chromosome

Introduction

In the course of evolution, genomes become a subject to a number of large-scale evolutionary events such as genome rearrangements that shuffle genomic architectures, and gene insertions and deletions (indels) that insert or remove continuous intervals of genes. Since these evolutionary events are rare, the number of them between two genomes is used in phylogenomic studies to measure the evolutionary distance between them. Such measurement is often based on the maximum parsimony assumption, implying that the evolutionary distance can be estimated as the minimum number of events between genomes. A convenient model for the most common genome rearrangements is given by the double-cut-and-join (DCJ) operations,¹ also known as 2-breaks,² which make two “cuts” in a genome and “glues” the resulting genomic fragments in a new order. Namely, DCJs mimic reversals (that inverse contiguous segments of chromosomes), translocations (that exchange tails of the two chromosomes), and fissions/fusions (that split/join chromosomes), while indels can be modeled by the DCJs on certain artificial circular chromosomes called prosthetic.^3,4

The maximum parsimony assumption enables addressing the ancestral genome reconstruction problem, which asks to reconstruct ancestral genomes from given extant genomes, by minimizing the total distance between genomes along the branches of the phylogenetic tree. The basic case of this problem with just three given genomes is known as the genome median problem (GMP), which asks for a single ancestral genome (median genome) at the minimum total distance from the given genomes.

The GMP is NP-hard under a number of models of genome rearrangements, such as reversals-only⁵ and DCJ.⁶ While these problems can be posed for both circular genomes (consisting of circular chromosomes) and linear genomes (consisting of linear chromosomes), the DCJ model allows appearance of circular chromosomes in transformations between linear genomes. Correspondingly, a solution to the GMP under the DCJ model may contain circular chromosomes even if the given genomes are linear. Since appearance of circular chromosomes in the reconstructed ancestral genomes of extant linear genomes represents an artifact and inadequately describes the biological reality, it is important to distinguish between the GMP and the linear genome median problem (L-GMP), where the latter is restricted to linear genomes only.

To the best of our knowledge, there exist no solvers for the L-GMP, while there are some advanced GMP solvers,^7–9 which allow the median genome to contain circular chromosomes. This deficiency inspired us to pose the problem of using the solution for the GMP to obtain a linear genome approximating the solution to the L-GPM. In this study, we propose an algorithm that linearizes chromosomes of a given GMP solution in a certain optimal way as described in the “Background” section. Our approach also provides insights into the combinatorial structure of genome transformations by DCJs and indels with respect to appearance of circular chromosomes. We remark that a similar linearization problem appears in adjacency-based reconstructions of median genomes and is known to be intractable,¹⁰ forcing the existing approaches^10–14 to solve its relaxation and allowing the constructed median genomes to contain circular chromosomes.

The article is organized as follows. In the “Background” section, we describe the graph-theoretical representation of genomes, DCJs, and indels. In the “Main Results” section, we formulate main theorems providing an approximate solution for L-GMP. In the “Methods” section, we develop necessary machienery and prove our main theorems. We conclude the article with the “Discussion” section.

Background

DCJ-Indel distance and genome graphs

In this study, we focus on genomes with no duplicated genes. Let P be a genome, which may contain both circular and linear chromosomes. We represent a circular chromosome consisting of n genes as a graph cycle with n directed gene edges encoding genes and their strands, which alternate with n undirected edges connecting the extremities of adjacent genes. Similarly, we represent a linear chromosome consisting of n genes as a path with n directed gene edges alternating with $n + 1$ undirected edges, where $n - 1$ undirected edges connect extremities of adjacent genes, and two more undirected edges connect each endpoint extremity to its own special vertex labeled $\infty$ corresponding to a telomere (Figure 1A). We label each gene edge with the corresponding gene x and further label its tail and head endpoints with $x^{t}$ and $x^{h}$ , respectively (Figure 1A). We define the operation $\bar{\cdot}$ as $\bar{x^{t}} = x^{h}$ and $\bar{x^{h}} = x^{t}$ . A collection of cycles and paths representing the chromosomes of P forms the genome graph $S (P)$ . The undirected edges in $S (P)$ are called P-edges. We denote by $S (P)$ the gene content of P (ie, the set of genes present in P) and by $V (P)$ the set of regular (non- $\infty$ ) vertices of $S (P)$ .

Figure 1.

(A) A genome graph for a linear genome $(+ 2 + 3 + 4 + 1)$ . (B) A genome graph for a genome $(+ 2 + 1) {+ 3 + 4}$ consisting of circular and linear chromosomes is obtained by a DCJ that splits the linear chromosome into two chromosomes. (C) A genome graph for a genome $(+ 2 + 5 + 6 + 3 + 4 + 1)$ is obtained by an insertion of gene sequence $+ 5, + 6$ . Dotted directed edges correspond to inserted genes.

A DCJ transforming a genome P into a genome $P'$ corresponds to one of the following operations transforming $S (P)$ into $S (P')$ (Figure 1A and B):

${x, y}, {u, v} \to {x, u}, {y, v}$ (internal reversals, translocations)

${x, y}, {u, \infty} \to {x, u}, {y, \infty}$ (reversals at chromosome ends, translocations involving a whole chromosome)

${x, \infty}, {y, \infty} \to {x, y}$ (fusions)

${x, y} \to {x, \infty}, {y, \infty}$ (fissions)

where x, y, u, and v are distinct vertices from $V (P)$ .

A DCJ scenario between genomes P and Q with equal gene content (ie, $S (P) = S (Q)$ ) is a sequence of DCJs transforming P into Q. We define the DCJ distance $d_{DCJ} (P, Q)$ between genomes P and Q as the length of a shortest DCJ scenario between them.

To transform a genome P into a genome Q with unequal gene content, one needs to consider gene insertion and deletion operations (indels) in addition to DCJs. An insertion transforming a genome P into a genome $P'$ corresponds to one of the following operations transforming $S (P)$ into $S (P')$ (Figure 1A and C):

replace a P-edge ${x, y}$ with a path $(x, u_{1}, {\bar{u}}_{1}, u_{2}, \dots, {\bar{u}}_{l}, y)$ (including the case of either $x = \infty$ or $y = \infty$ ),

add a path $(\infty, u_{1}, {\bar{u}}_{1}, u_{2}, \dots, {\bar{u}}_{l}, \infty)$ ,

add a cycle $(u_{1}, {\bar{u}}_{1}, u_{2}, \dots, {\bar{u}}_{l}, u_{1})$ ,

where the edges alternate between $P'$ -edges ${{\bar{u}}_{i}, u_{i + 1}}$ and gene edges $(u_{i}, {\bar{u}}_{i})$ with $u_{i} \notin S (P)$ , resulting in $S (P') = S (P) \cup {u_{1}, \dots, u_{l}}$ .

A deletion can be viewed as an event reversing an insertion. A deletion transforming a genome P into a genome $P'$ corresponds to one of the following operations transforming $S (P)$ into $S (P')$ (Figure 1A and C):

replace a path $(x, u_{1}, {\bar{u}}_{1}, u_{2}, \dots u_{l}, {\bar{u}}_{l}, y)$ with a $P'$ -edge ${x, y}$ (including the case of either $x = \infty$ or $y = \infty$ ),

remove a path $(\infty, u_{1}, {\bar{u}}_{1}, u_{2}, \dots, u_{l}, {\bar{u}}_{l}, \infty)$ ,

remove a cycle $(u_{1}, {\bar{u}}_{1}, u_{2}, \dots, u_{l}, {\bar{u}}_{l}, u_{1})$ ,

where the edges alternate between P-edges ${{\bar{u}}_{i}, u_{i + 1}}$ and gene edges $(u_{i}, {\bar{u}}_{i})$ , resulting in $S (P') = S (P) \ {u_{1}, \dots, u_{l}}$ .

A DCJ-Indel scenariot between genomes P and Q is a sequence of DCJs and indels transforming P into Q, where deletions delete genes from $S (P) \ S (Q)$ and insertions insert genes from $S (Q) \ S (P)$ (ie, no gene can be inserted and then deleted, or deleted and then inserted), denoted as $t : P \to Q$ . We also find it convenient to represent t as $P = P_{0} \overset{ϑ_{1}}{\to} P_{1} \overset{ϑ_{2}}{\to} \dots \to P_{n - 1} \overset{ϑ_{n}}{\to} P_{n} = Q$ , where each $ϑ_{i}$ is a DCJ or an indel. We define the DCJ-Indel distance $d_{DI} (P, Q)$ as the length of a shortest DCJ-Indel scenario transforming genome P into genome Q. It is easy to see that any DCJ-Indel scenario transforming P into Q can be reversed (turning each insersion into a deletion, and vice versa) to obtain a DCJ-Indel scenario transforming Q into P, implying that $d_{DI} (P, Q) = d_{DI} (Q, P)$ .

A circular chromosome C in P is a singleton with respect to genome Q if it is composed of genes absent in Q, ie, $S (C) \cap S (Q) = \emptyset$ . Let $sn g_{Q} (P)$ be the number of singletons in P with respect to Q. The total number of singletons in P and Q with respect to each other is $sng (P, Q) : = sn g_{P} (Q) + sn g_{Q} (P)$ . The following lemma describes an important property of singletons.

Lemma 1 (Compeau⁴)

For given genomes P and Q, let C be a singleton in P with respect to Q and $P^{0}$ be the genome obtained from P by removing C. Then $d_{DI} (P^{0}, Q) = d_{DI} (P, Q) - 1$ .

From Lemma 1, the DCJ-Indel distance between two genomes can be computed with the following formula:⁴

\begin{matrix} d_{DI} (P, Q) & = sng (P, Q) + d_{DI} ({}^{Q}P, {}^{P}Q) \end{matrix},

(1)

where ${}^{Q}P$ and ${}^{P}Q$ are obtained from P and Q by removing all singletons (ie, $sng ({}^{Q}P, {}^{P}Q) = 0$ ). We need the following lemma.

Lemma 2 (Compeau⁴)

Let $t : P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n}}{\to} P_{n}$ be a shortest DCJ-Indel scenario. Let C be a singleton in $P_{o}$ with respect to $P_{n}$ . Then for any $i \in {0, \dots, n}$ and any chromosome D in $P_{i}$ such that $S (C) \cap S (D) \neq \emptyset$ , we have $S (D) \cap S (P_{0}) = \emptyset$ .

Genome median problem

We pose the GMP under the DCJ-Indel model as follows.

Genome median problem (GMP)

Given genomes $B_{1}, B_{2}, and B_{3}$ , find a genome M with $S (M) \subseteq S (B_{1}) \cup S (B_{2}) \cup S (B_{3})$ that minimizes the DCJ-Indel median score:

m s_{DI} (M, B_{1}, B_{2}, B_{3}) : = \sum_{i = 1}^{3} d_{DI} (B_{i}, M) .

Since the GMP is posed under the DCJ-Indel model, a median genome for given linear genomes may contain circular chromosomes. To address the issue of circular chromosome presence, we pose the following problem.

Linear genome median problem (L-GMP)

For given linear genomes $B_{1}$ , $B_{2}$ , and $B_{3}$ , find a linear genome M with $S (M) \subseteq S (B_{1}) \cup S (B_{2}) \cup S (B_{3})$ minimizing the DCJ-Indel median score $m s_{DI} (M, B_{1}, B_{2}, B_{3})$ .

While we are not aware of efficient algorithms (let alone, software solvers) for solving the L-GMP, we pose the problem of constructing an approximate solution for the L-GMP from the given solution for GMP.

Results

Chromosome linearization

Let $t : P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n}}{\to} P_{n}$ be a DCJ-Indel scenario and C be a circular chromosome in $P_{0}$ . For each $i \in {0, 1, \dots, n}$ , let $C_{i} = {C_{i}^{1}, \dots, C_{i}^{m_{i}}}$ be a collection of all circular chromosomes in $P_{i}$ such that $S (C_{i}^{l}) \cap S (C) \neq \emptyset$ ( $l \in {1, \dots, m_{i}}$ ). We call $C_{i}$ a meta-chromosome of C in $P_{i}$ and note that $C_{i}$ itself may be viewed as a genome, for which $S (C_{i})$ , $S (C_{i})$ , and $V (C_{i})$ are defined. In particular, we have $S (C_{i}) = ⋃_{l = 1}^{m_{i}} S (C_{i}^{l})$ . Below, we describe an important property of circular chromosomes appearing in DCJ-Indel scenarios (Figure 3).

Definition 1

A circular chromosome C is linearized within a DCJ-Indel scenario $t : P \to Q$ (or t linearizes C) if the following three conditions hold:

(E1) C is present in P;

(E2) $S (C) \cap S (Q) \neq \emptyset$ ;

(E3) $S (C) \cap S (P) \neq S (C) \cap S (Q)$ , where $C$ is the meta-circular chromosome of C in Q.

Equivalently, a circular chromosome C of genome P is linearized within $t : P \to Q$ if there exists a gene in C that resides on a linear chromosome in Q, or together with a gene from another chromosome of P resides on a circular chromosome in Q.

We extend Definition 1 to a particular event in a DCJ-Indel scenario as follows.

Definition 2

Let $t : P = P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n}}{\to} P_{n} = Q$ be a DCJ-Indel scenario that linearizes a circular chromosome C. We say that an event $ϑ_{i}$ linearizes C within t if C is linearized within $(ϑ_{1}, \dots, ϑ_{i})$ and C is not linearized within $(ϑ_{1}, \dots, ϑ_{k})$ for any k<i.

The following theorem shows that for given linear genomes, all circular chromosomes in their median genome are linearized within the corresponding DCJ-Indel scenarios.

Lemma 3

Let $B_{1}$ , $B_{2}$ , and $B_{3}$ be linear genomes, and M be a genome such that $S (M) \subseteq S (B_{1}) \cup S (B_{2}) \cup S (B_{3})$ . Let $t_{i}$ be a shortest DCJ-Indel scenario between M and $B_{i}$ for $i \in {1, 2, 3}$ . Then each circular chromosome in M is linearized in at least one of the DCJ-Indel scenarios $t_{1}, t_{2}, t_{3}$ .

Proof

Assume that there is a circular chromosome C in M that is not linearized in either of $t_{1}, t_{2}, t_{3}$ . Then at least one of conditions (E2) or (E3) does not hold for each $Q \in {B_{1}, B_{2}, B_{3}}$ . Since $S (M) \subseteq S (B_{1}) \cup S (B_{2}) \cup S (B_{3})$ , for each circular chromosome $C'$ in M, we have $S (C') \subseteq S (B_{1}) \cup S (B_{2}) \cup S (B_{3})$ . Hence, the condition (E2) must hold for at least one $Q \in {B_{1}, B_{2}, B_{3}}$ . So, there is $l \in {1, 2, 3}$ such that the condition (E3) does not hold for the genome $B_{l}$ (ie, $S (C) \cap S (M) = S (C) \cap S (B_{l})$ ), where $C$ is the meta-chromosome of C in $B_{l}$ . In other words, there exist circular chromosomes $C = {C'_{1}, \dots, C'_{k}}$ in the genome $B_{l}$ , which contradicts its linearity.□

The following theorems represent a key to proving our main results on linearization of median genomes. Proofs of these theorems are rather technical and given in the “Methods” section.

Theorem 1

Let $t : P \to Q$ be a DCJ-Indel scenario that linearizes a circular chromosome C. Then there exists a DCJ-Indel scenario $\tilde{t} : P \overset{r}{\to} P' \overset{\tilde{t}'}{\to} Q$ such that r is a DCJ linearizing C within $\tilde{t}$ and $| \tilde{t}' | \leq | t |$ .

For DCJ scenarios, we have a somewhat stronger result.

Theorem 2

Let $t : P \to Q$ be a DCJ scenario that linearizes a circular chromosome C. Then there exists a DCJ scenario $\tilde{t} : P \overset{r}{\to} P' \overset{\tilde{t}'}{\to} Q$ such that r is a DCJ linearizing C within $\tilde{t}$ and $| \tilde{t}' | = | t | - 1$ .

Linearization of median genomes

For a genome P, let $cchr (P)$ be the number of circular chromosomes in genome P. Our main results on linearization of median genomes are given by the following theorems.

Theorem 3

Let $B_{1}$ , $B_{2}$ , and $B_{3}$ be linear genomes, and M be a given median genome. Then for any $n \leq cchr (M)$ , there exists a genome $\hat{M}$ such that $cchr (\hat{M}) = cchr (M) - n$ , $S (M) = S (\hat{M})$ , and

m s_{DI} (\hat{M}, B_{1}, B_{2}, B_{3}) - m s_{DI} (M, B_{1}, B_{2}, B_{3}) \leq 2 n .

Proof

We prove the theorem by induction on n. If $n = 0$ , the theorem trivially holds for $\hat{M} = M$ .

We assume that the theorem holds for $n < cchr (M)$ . Then there exists a genome $M'$ such that $cchr (M') = cchr (M) - n$ , $S (M) = S (M')$ , and $m s_{DI} (M', B_{1}, B_{2}, B_{3}) - m s_{DI} (M, B_{1}, B_{2}, B_{3}) \leq 2 n$ . Let $C'$ be a circular chromosome in $M'$ . Since $S (M') = S (M) \subseteq S (B_{1}) \cup S (B_{2}) \cup S (B_{3})$ , we have $S (C') \subseteq S (B_{1}) \cup S (B_{2}) \cup S (B_{3})$ . Let $t'_{i}$ be a shortest DCJ-Indel scenario between $M'$ and $B_{i}$ for $i \in {1, 2, 3}$ (Figure 2). By Lemma 3, there is at least one of the DCJ-Indel scenarios $t'_{1}$ , $t'_{2}$ , and $t'_{3}$ that linearizes $C'$ , say $t'_{1}$ . By Theorem 1, we obtain a DCJ-Indel scenario ${\tilde{t}}_{1}$ of the form $M' \overset{ϑ}{\to} \hat{M} \overset{\tilde{t}'_{1}}{\to} B_{1}$ such that $ϑ$ linearizes $C'$ within ${\tilde{t}}_{1}$ and $| \tilde{t}'_{1} | \leq | t'_{1} |$ . Clearly, $d_{DI} (M', B_{1}) = | \tilde{t}'_{1} | \leq | t'_{1} |$ . By the triangle inequality, for $i = 2, 3$ , we have $d_{DI} (\hat{M}, B_{i}) \leq d_{DI} (\hat{M}, M') + d_{DI} (M', B_{i}) = 1 + | t_{i'} | .$ Hence, we have $m s_{DI} (\hat{M}, B_{1}, B_{2}, B_{3}) - m s_{DI} (M', B_{1}, B_{2}, B_{3}) \leq 2 .$ Thus, we have

\begin{matrix} m s_{DI} (\hat{M}, B_{1}, B_{2}, B_{3}) - m s_{DI} (M, B_{1}, B_{2}, B_{3}) \\ = m s_{DI} (\hat{M}, B_{1}, B_{2}, B_{3}) - m s_{DI} (M', B_{1}, B_{2}, B_{3}) \\ + m s_{DI} (M', B_{1}, B_{2}, B_{3}) - m s_{DI} (M, B_{1}, B_{2}, B_{3}) \\ \leq 2 + 2 n = 2 \cdot (n + 1) . \end{matrix}

□

Figure 2.

Linear genomes $B_{1}$ , $B_{2}$ , and $B_{3}$ and their median genome M represented as vertices. A genome $M'$ containing $cchr (M) - n$ ( $n < cchr (M)$ ) circular chromosomes is represented as vertex, and the corresponding shortest transformations $t'_{1}$ , $t'_{2}$ , and $t'_{3}$ are represented as directed dashed edges. We construct a shortest transformation from $M'$ to $B_{1}$ composed of $ϑ$ and $\tilde{t}'_{1}$ such that $ϑ$ results in a genome $\hat{M}$ with $cchr (\hat{M}) = cchr (M') - 1$ and $| \tilde{t}'_{1} | \leq | t'_{1} |$ . The corresponding shortest transformations from $\hat{M}$ to $B_{2}$ and $B_{3}$ are represented as bold directed edges and denoted by ${\hat{t}}_{2}$ and ${\hat{t}}_{3}$ .

For the GMP under the DCJ model, we can immediately improve the derived upper bound as follows.

Theorem 4

Let $B_{1}$ , $B_{2}$ , and $B_{3}$ be linear genomes with equal gene content, and M be a given median genome. Then for any $n \leq cchr (M)$ , there exists a genome $\hat{M}$ such that $cchr (\hat{M}) = cchr (M) - n$ , and

m s_{DCJ} (\hat{M}, B_{1}, B_{2}, B_{3}) - m s_{DCJ} (M, B_{1}, B_{2}, B_{3}) \leq n .

Proof

The proof proceeds as the proof of Theorem 3 with the following difference. We use Theorem 2 instead of Theorem 1 to obtain a DCJ scenario ${\tilde{t}}_{1}$ of the form $M' \overset{ϑ}{\to} \hat{M} \overset{{\tilde{t}}_{1}'}{\to} B_{1}$ such that $ϑ$ linearizes $C'$ within ${\tilde{t}}_{1}$ and $| \tilde{t}'_{1} | = | t'_{1} | - 1$ . Hence, we have $m s_{DCJ} (\hat{M}, B_{1}, B_{2}, B_{3}) - m s_{DCJ} (M', B_{1}, B_{2}, B_{3}) \leq 1 .$ □

Methods

This section is devoted to the proof of Theorems 1 and 2.

We call any two DCJ-Indel scenarios between the same pair of genomes equivalent. Let $t : P \to Q$ be a DCJ-Indel scenario that linearizes a circular chromosome C. First, in Lemma 4, we will show that there exists an event r within t that linearizes a circular chromosome C. Second, in Theorem 5, we will show that r is a DCJ. Third, we will show how to obtain equivalent pair of events (i.e., a DCJ-Indel scenario of length 2) $(α', β')$ from adjacent events $(α, β)$ in t, where $β$ and $α'$ linearize C.

We will distinguish the pair of adjacent events based on their dependency. Namely, two adjacent events $α$ and $β$ in a DCJ-Indel scenario are called independent if the edges removed by $β$ are not created by $α$ . Otherwise, when $β$ removes edge(s) created by $α$ , we say that $β$ depends on $α$ . We will assume that $β$ is a DCJ if not stated otherwise. We will consider the following cases:

$α$ and $β$ are independent events (addressed in Lemma 6);

$β$ depends on a deletion $α$ (addressed in Lemma 8);

$β$ depends on a DCJ $α$ (addressed in Lemma 7);

$β$ depends on an insertion $α$ (addressed in Lemmas 9 to 11).

Eventually, results of Lemmas 6 to 11 will enable us to prove Theorems 1 and 2.

Circular chromosomes and DCJ-Indel scenarios

The following lemma shows the connection between Definitions 1 and 2.

Lemma 4

Let $t : P \to Q$ be a DCJ-Indel scenario that linearizes a circular chromosome C. Then there exists an event r that linearizes C within t.

Proof

Suppose that t is of the form: $P = P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n}}{\to} P_{n} = Q$ . For each $i \in {0, 1, \dots, n}$ , let $C_{i}$ be the meta-chromosome of C in $P_{i}$ . In particular, $C_{0} = {C}$ . Then, the equality

S (C_{i}) \cap S (P) = S (C) \cap S (P_{i})

(2)

holds for $i = 0$ but not for $i = n$ (since C is linearized within t). Hence, there exists $k \in {1, \dots, n}$ such that equation (2) holds for $i = k - 1$ but not for $i = k$ . Moreover, it is clear that $S (C) \cap S (P_{k - 1}) \neq \emptyset$ and $S (C) \cap S (P_{k}) \neq \emptyset$ . By Definition 1, C is not linearized within $(ϑ_{1}, \dots, ϑ_{k - 1})$ and is linearized within $(ϑ_{1}, \dots, ϑ_{k})$ . Thus, $r = ϑ_{k}$ linearizes C within t.□

An event linearizing a circular chromosome C can also be described in terms of removing edges in genome graphs as follows (Figure 3).

Figure 3.

Illustration of a linearized circular chromosome C within a DCJ-Indel scenario $(ϑ_{1}, \dots, ϑ_{k})$ and Theorem 5. Dashed gray and black edges denote newly inserted genes and arbitrary gene sequences, respectively. Dotted edges represent genes that do not belong to meta-chromosomes of C. (A) Initial genome graph, where C is a linearized circular chromosome and D is a chromosome of any type. (B) The intermediate genome graph resulted from a DCJ-Indel scenario $(ϑ_{1}, \dots, ϑ_{k - 1})$ , where $C'$ is a meta-chromosome of C and $D'$ is a chromosome obtained from D. (C) The resulting graph after a fission $ϑ_{k}$ on a circular chromosome $C'$ . (D) The graph resulted from a DCJ $ϑ_{k}$ that combines a circular chromosome $C'$ and a chromosome $D'$ .

Theorem 5

Let $t : P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n}}{\to} P_{n}$ be a DCJ-Indel scenario that linearizes a circular chromosome C. Let $C_{i}$ be the meta-chromosome of C in $P_{i}$ for each $i \in {0, 1, \dots, n}$ . Then $ϑ_{k}$ linearizes C within t if and only if $ϑ_{k}$ is a DCJ with a minimal index k such that one of the following conditions holds:

$ϑ_{k}$ removes edges $a \in S (C_{k - 1})$ and $b \notin S (C_{k - 1})$ ;

$ϑ_{k}$ removes a single edge $a \in S (C_{k - 1})$ .

Proof

Assume that $ϑ_{k}$ is a DCJ and the above conditions (i) or (ii) holds, where k is the smallest such index. Since C is linearized within t, (E1) and (E2) hold for DCJ-Indel scenarios $(ϑ_{1}, \dots, ϑ_{k - 1})$ and $(ϑ_{1}, \dots, ϑ_{k})$ . Now, we need to show that (E3) holds for $(ϑ_{1}, \dots, ϑ_{k})$ , but not for $(ϑ_{1}, \dots, ϑ_{k - 1})$ . We consider the following two cases.

If condition (i) holds, then a belongs to a circular chromosome $C' \in C_{k - 1}$ and b belongs to a chromosome $D' \notin C_{k - 1}$ (Figure 3D). If $D'$ is circular, then $ϑ_{k}$ creates a new circular chromosome $C ″ \in C_{k}$ such that $S (C ″) = S (C') \cup S (D')$ (ie, $ϑ_{k}$ is a fusion of circular chromosomes). By Lemma 2, we have $S (D') \cap S (P_{0}) \neq \emptyset$ . Since $S (D') \cap S (C_{k - 1}) = \emptyset$ , we have $S (C_{k}) \cap S (P_{0}) \neq S (C) \cap S (P_{k})$ , ie, (E3) holds. If $D'$ is linear, then $ϑ_{k}$ turns $C'$ into a linear chromosome. Hence, we have $S (C') \cap S (C_{k}) = \emptyset$ . Since $S (C) \cap S (C') \neq \emptyset$ , we have $S (C_{k}) \cap S (P_{0}) \neq S (C) \cap S (P_{k})$ , ie, (E3) holds. Since k is the smallest index and $S (P_{k - 1}) = S (P_{k})$ , (E3) does not hold for $(ϑ_{1}, \dots, ϑ_{k - 1})$ .

If condition (ii) holds, the proof is similar (Figure 3C).

Now, assume that $ϑ_{k}$ linearizes C within t. Then the equality

S (C_{k - 1}) \cap S (P_{0}) = S (C) \cap S (P_{k - 1})

(3)

holds. There are three possible types of $ϑ_{k}$ , namely, insertion, deletion, and DCJ. First, we assume that $ϑ_{k}$ is an insertion. Then $S (P_{k - 1}) \subset S (P_{k})$ . Recall that $ϑ_{k}$ inserts genes from $S (P_{n}) \ S (P_{0})$ . In particular, since C is present in $P_{0}$ (ie, $S (C) \subseteq S (P_{0})$ ), $ϑ_{k}$ does not insert any genes from $S (C)$ . Thus, we have $S (C) \cap S (P_{k - 1}) = S (C) \cap S (P_{k})$ . Since insertions cannot change the chromosome types, we have $S (C_{k - 1}) \cap S (P_{0}) = S (C_{k}) \cap S (P_{0})$ . By equation (3), we have a contradiction. Thus, $ϑ_{k}$ is not an insertion.

Second, we assume that $ϑ_{k}$ is a deletion. Then $S (P_{k}) \subset S (P_{k - 1})$ and $S (C_{k}) \subseteq S (C_{k - 1})$ . Let

\begin{matrix} A = (S (C) \cap S (P_{k - 1})) \ (S (C) \cap S (P_{k})) \\ = S (C) \cap (S (P_{k - 1}) \ S (P_{k})) \end{matrix} .

Note that since $S (C) \cap S (P_{n}) \neq \emptyset$ , we have $S (C) \cap S (P_{i}) \neq \emptyset$ for all $i \in {0, 1, \dots, n}$ . Then $A \neq S (C) \cap S (P_{k - 1})$ . Let

\begin{array}{l} B = (S (C_{k - 1}) \cap S (P_{0})) \ (S (C_{k}) \cap S (P_{0})) \\ = (S (C_{k - 1}) \ S (C_{k})) \cap S (P_{0}) . \end{array}

Our goal is to prove that $A = B$ . Since A is the subset of genes removed by $ϑ_{k}$ , $A \cap S (P_{k}) = \emptyset$ . In particular, $A \cap S (C_{k}) = \emptyset$ . Hence, $A \cap (S (C_{k}) \cap S (P_{0})) = \emptyset$ . By equation (3), we have that $A \subseteq B$ . Now, let $g \in B$ . Note that $g \in S (P_{0})$ , $g \in S (C_{k - 1})$ , and $g \notin S (C_{k})$ . Since deletion cannot change the chromosome types, it follows that g is removed by $ϑ_{k}$ . Then $g \notin S (P_{k})$ . By equation (3), $g \in S (C) \cap S (P_{k - 1})$ , and thus we have $g \in A$ . Since the choice of g was arbitrary, we have proved that $A = (S (C_{k - 1}) \cap S (P_{0})) \ (S (C_{k}) \cap S (P_{0}))$ . Note that $S (C_{k}) \cap S (P_{0}) \subset S (C_{k - 1}) \cap S (P_{0})$ and $S (C) \cap S (P_{k}) \subset S (C) \cap S (P_{k - 1})$ . Therefore, $S (C_{k}) \cap S (P_{0}) = S (C) \cap S (P_{k})$ , a contradiction to $ϑ_{k}$ linearizing C. Thus, $ϑ_{k}$ is not a deletion.

We proved that $ϑ_{k}$ is a DCJ. Then $S (P_{k - 1}) = S (P_{k})$ . Hence,

\begin{matrix} S (C_{k - 1}) \cap S (P_{0}) = S (C) \cap S (P_{k - 1}) \\ = S (C) \cap S (P_{k}) \neq S (C_{k}) \cap S (P_{0}) \end{matrix} .

Thus, $S (C_{k - 1}) \neq S (C_{k})$ holds. Since $ϑ_{k}$ does not change the gene content, $ϑ_{k}$ either breaks one circular chromosome $C' \in C_{k - 1}$ , or combines circular chromosomes $C' \in C_{k - 1}$ and $C ″ \notin C_{k - 1}$ into a single circular chromosome, or combines a circular chromosome $C' \in C_{k - 1}$ and linear chromosome into a single linear chromosome. In the first case, $ϑ_{k}$ removes a single edge that belongs to $S (C_{k - 1})$ (Figure 3C). In the last two cases, among the two edges removed by $ϑ_{k}$ , one must belong to $S (C_{k - 1})$ and the other does not belong to $S (C_{k - 1})$ (Figure 3D).□

The following lemma describes an important property of meta-chromosomes.

Lemma 5

Let $t : P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n}}{\to} P_{n}$ be a DCJ-Indel scenario, where $ϑ_{n}$ linearizes a circular chromosome C within t. Let $k \in {0, \dots, n - 1}$ , and $C_{k}$ and $C_{k + 1}$ be the meta-chromosomes of C in $P_{k}$ and $P_{k + 1}$ , respectively. Then for any vertex $x \in V (P_{k}) \cap V (P_{k + 1})$ , if $x \in V (C_{k + 1})$ , then $x \in V (C_{k})$ .

Proof

Let $g_{x}$ be the gene corresponding to x. Note that $x \in V (C_{i})$ if and only if $g_{x} \in S (C_{i})$ for $i \in {k, k + 1}$ . Since $g_{x} \in S (P_{k})$ and $g_{x} \in S (P_{k + 1})$ , $g_{x}$ cannot be inserted or removed by $ϑ_{k + 1}$ . Suppose that $x \in V (C_{k + 1})$ , ie, $g_{x} \in S (C_{k + 1})$ . We consider two cases depending on whether $ϑ_{k + 1}$ is an indel or a DCJ.

First, assume that $ϑ_{k + 1}$ is an indel. Since $g_{x} \in S (C_{k + 1})$ , there is a circular chromosome $C' \in C_{k + 1}$ such that $g_{x} \in S (C')$ . Let $C ″$ be a chromosome in $P_{k}$ such that $g_{x} \in S (C ″)$ , ie, $S (C ″) \cap S (C') \neq \emptyset$ . If $C ″ = C'$ (ie, $ϑ_{k + 1}$ does not affect $C'$ ), we have $C ″ \in C_{k}$ , implying that $g_{x} \in S (C_{k})$ . If $C ″ \neq C'$ , we have either $S (C') \subset S (C ″)$ or $S (C ″) \subset S (C')$ . Since $C' \in C_{k + 1}$ , in both cases, $C' \in C_{k}$ . Therefore, $g_{x} \in S (C_{k})$ .

Second, assume that $ϑ_{k + 1}$ is a DCJ. Then, since $ϑ_{k + 1}$ does not linearize C, $ϑ_{k + 1}$ operates on four vertices that belong to $V (C_{k + 1})$ . Since $ϑ_{k + 1}$ is a DCJ, $S (C_{k}) = S (C_{k + 1})$ . Hence, these four vertices belong to $V (C_{k})$ . Thus, if $g_{x} \in S (C_{k + 1})$ then $g_{x} \in S (C_{k})$ .□

From Lemma 5, the following corollary follows immediately.

Corollary 1

Let $t : P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n}}{\to} P_{n}$ be a DCJ-Indel scenario, where $ϑ_{n}$ linearizes a circular chromosome C. Let $k \in {0, \dots, n - 1}$ and $x, y, z$ be vertices from $V (P_{k}) \cap V (P_{k + 1})$ such that ${x, y} \in S (P_{k})$ and ${x, z} \in S (P_{k + 1})$ . Let $C_{k}$ and $C_{k + 1}$ be the meta-chromosomes of C in $P_{k}$ and $P_{k + 1}$ , respectively. If ${x, z} \in S (C_{k + 1})$ , then ${x, y} \in S (C_{k})$ .

Independent adjacent events

In this section, we address the case $(1)$ , ie, $α$ and $β$ are independent events. It is easy to see that the order of any two adjacent independent events in a DCJ-Indel scenario can be changed without affecting the starting and ending genomes.¹⁵

Lemma 6

Let $t : P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n - 2}}{\to} P_{n - 2} \overset{ϑ_{n - 1}}{\to} P_{n - 1} \overset{ϑ_{n}}{\to} P_{n}$ be a DCJ-Indel scenario that linearizes a circular chromosome C, where $ϑ_{n - 1}$ and $ϑ_{n}$ are independent events. If $ϑ_{n}$ linearizes C within t, then $ϑ_{n}$ also linearizes C within the DCJ-Indel scenario $P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n - 2}}{\to} P_{n - 2} \overset{ϑ_{n}}{\to} P' \overset{ϑ_{n - 1}}{\to} P_{n}$ .

Proof

Let $C_{n - 2}$ and $C_{n - 1}$ be the meta-chromosomes of C in $P_{n - 2}$ and $P_{n - 1}$ , respectively. Since $ϑ_{n}$ linearizes C within t, by Theorem 5, $ϑ_{n}$ is a DCJ. If $ϑ_{n}$ removes two edges in $S (P_{n - 1})$ , say ${x, y} \in S (C_{n - 1})$ and ${z, w} \notin S (C_{n - 1})$ , then since $ϑ_{n - 1}$ and $ϑ_{n}$ are independent, the edges ${x, y}$ and ${z, w}$ are present in $S (P_{n - 2})$ . By Corollary 1, we have ${x, y} \in S (C_{n - 2})$ and ${z, w} \notin S (C_{n - 2})$ . By Theorem 5, $ϑ_{n}$ linearizes C within the DCJ-Indel scenario $P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n - 2}}{\to} P_{n - 2} \overset{ϑ_{n}}{\to} P' \overset{ϑ_{n - 1}}{\to} P_{n}$ . If $ϑ_{n}$ removes a single edge, the proof is similar.□

DCJ depends on a deletion

In this section, we consider case $(2)$ , ie, a DCJ $β$ depends on a deletion $α$ . For such pair of events the following lemma holds.

Lemma 7

Let $t : P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n - 2}}{\to} P_{n - 2} \overset{ϑ_{n - 1}}{\to} P_{n - 1} \overset{ϑ_{n}}{\to} P_{n}$ be a DCJ-Indel scenario that linearizes a circular chromosome C, where DCJ $ϑ_{n}$ depends on deletion $ϑ_{n - 1}$ . If $ϑ_{n}$ linearizes C, then there exists a DCJ-Indel scenario $P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n - 2}}{\to} P_{n - 2} \overset{ϑ'_{n - 1}}{\to} P' \overset{ϑ'_{n}}{\to} P_{n}$ , where $ϑ'_{n - 1}$ linearizes C and $ϑ'_{n}$ is a deletion.

Proof

Let $C_{n - 2}$ and $C_{n - 1}$ be the meta-chromosomes of C in $P_{n - 2}$ and $P_{n - 1}$ , respectively. Let $(x, u_{1}, {\bar{u}}_{1} \dots, u_{l}, {\bar{u}}_{l}, y)$ be the path in $S (P_{n - 2})$ that is replaced with ${x, y}$ in $S (P_{n - 1})$ by $ϑ_{n - 1}$ .

Suppose that $ϑ_{n}$ removes two edges. Since $ϑ_{n}$ depends on $ϑ_{n - 1}$ , we can assume that $ϑ_{n}$ removes edges ${x, y}$ , ${z, w}$ in $S (P_{n - 1})$ and creates ${x, z}$ , ${y, w}$ in $S (P_{n})$ (Figure 4A, B, and D). By Theorem 5, without loss of generality, we assume that ${x, y} \in S (C_{n - 1})$ and ${z, w} \notin S (C_{n - 1})$ . We define $ϑ'_{n - 1}$ as the DCJ that removes edges ${x, u_{1}}$ , ${z, w}$ in $S (P_{n - 2})$ , and creates ${x, z}$ , ${u_{1}, w}$ in $S (P')$ , where $P'$ is the genome resulted from $ϑ'_{n - 1}$ . Moreover, we define $ϑ'_{n}$ as the deletion that replaces a path $(w, u_{1}, {\bar{u}}_{1}, \dots, u_{l}, {\bar{u}}_{l}, y)$ in $S (P')$ with an edge ${y, w}$ in $S (P_{n})$ (Figure 4A, C, and D). Since $ϑ_{n}$ depends on $ϑ_{n - 1}$ , ${z, w}$ is present in $S (P_{n - 2})$ . By Corollary 1, ${x, u_{1}} \in S (C_{n - 2})$ and ${z, w} \notin S (C_{n - 2})$ . Thus, by Theorem 5, $ϑ'_{n - 1}$ linearizes C within $P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n - 2}}{\to} P_{n - 2} \overset{ϑ'_{n - 1}}{\to} P' \overset{ϑ'_{n}}{\to} P_{n}$ .

Figure 4.

Illustration of Lemma 8. (A) Initial genome graph, where the dashed edges denote arbitrary gene sequences. $C_{1}$ is a circular chromosome linearized by $r_{2}$ within DCJ-Indel scenario $(r_{1}, r_{2})$ , where $r_{1}$ is a deletion of gene sequence $(g_{k + 1}, g_{k + 2})$ and $r_{2}$ is a DCJ. (B) The intermediate genome resulted from deletion $r_{1}$ . (C) The intermediate genome resulted from DCJ $r'_{1}$ . (D) The graph resulted from the equivalent pair of DCJ-Indel scenarios $(r_{1}, r_{2})$ and $(r'_{1}, r'_{2})$ , where $C_{1}$ is linearized by DCJs $r'_{1}$ and $r_{2}$ , and $r_{1}, r'_{2}$ are deletions.

Suppose that $ϑ_{n}$ removes a single edge a. Since $ϑ_{n}$ depends on $ϑ_{n - 1}$ , we have $a = {x, y}$ . We define $ϑ'_{n - 1}$ as the DCJ that removes a single edge ${x, u_{1}}$ and creates ${x, \infty}$ , ${u_{1}, \infty}$ , and $ϑ'_{n}$ as the deletion that replaces a path $(\infty, u_{1}, \dots, {\bar{u}}_{l}, y)$ with an edge ${y, \infty}$ . The proof that $ϑ'_{n - 1}$ linearizes C within $P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n - 2}}{\to} P_{n - 2} \overset{ϑ'_{n - 1}}{\to} P' \overset{ϑ'_{n}}{\to} P_{n}$ is similar.□

DCJ depends on a DCJ

In this section, we address case $(3)$ , ie, DCJ $β$ depends on a DCJ $α$ . Let A be the set of edges created by $α$ , and B be the set of edges removed by $β$ . Since $β$ depends on $α$ , $A \cap B \neq \emptyset$ . We say that $β$ strongly depends on $α$ if $A = B$ and weakly depends on $α$ otherwise (such pairs of adjacent DCJs are also known as enchained¹⁵). In a genome graph, a pair of adjacent dependent DCJs replaces three edges with three other edges on the same six vertices (this operation is known as a 3-break²). It is easy to see that for a pair of weakly dependent DCJs, there exist equivalent pairs of weakly dependent DCJs.¹⁵ Then the following lemma holds.

Lemma 8

Let $t : P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n - 2}}{\to} P_{n - 2} \overset{ϑ_{n - 1}}{\to} P_{n - 1} \overset{ϑ_{n}}{\to} P_{n}$ be a DCJ-Indel scenario that linearizes a circular chromosome C, where $ϑ_{n - 1}$ and $ϑ_{n}$ are dependent DCJs. If $ϑ_{n}$ linearizes C, then there exists a pair of DCJs $(ϑ'_{n - 1}, ϑ'_{n})$ equivalent to $(ϑ_{n - 1}, ϑ_{n})$ such that $ϑ'_{n - 1}$ linearizes C within the DCJ-Indel scenario $P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n - 2}}{\to} P_{n - 2} \overset{ϑ'_{n - 1}}{\to} P' \overset{ϑ'_{n}}{\to} P_{n}$ .

Proof

Let $C_{n - 2}$ and $C_{n - 1}$ be the meta-chromosomes of C in $P_{n - 2}$ and $P_{n - 1}$ , respectively. Let A be the set of edges created by $α$ , and B be the set of edges removed by $β$ . We consider two cases depending on whether $ϑ_{n}$ strongly depends or weakly depends on $ϑ_{n - 1}$ .

First, assume that $ϑ_{n}$ strongly depends on $ϑ_{n - 1}$ (ie, $A = B$ ). If $| A | = 2$ , then let ${x, y}$ and ${z, w}$ be the edges removed by $ϑ_{n}$ in $P_{n - 1}$ . By Theorem 5, without loss of generality, we assume that ${x, y} \in S (C_{n - 1})$ and ${z, w} \notin S (C_{n - 1})$ . Since ${x, y}$ and ${z, w}$ are created by $ϑ_{n - 1}$ , the edges ${x, z}$ and ${y, w}$ , or ${x, w}$ and ${y, z}$ are present in $S (P_{n - 2})$ . In both cases, we have a contradiction to Corollary 1. If $| A | = 1$ , the proof is similar.

For the rest of the proof, we assume that $ϑ_{n}$ weakly depends on $ϑ_{n - 1}$ (ie, $A \cap B \neq \emptyset$ and $A \neq B$ ). We consider two cases depending on the number of edges removed by $ϑ_{n}$ .

If $ϑ_{n}$ removes two edges in $P_{n - 1}$ , let ${x, y}$ and ${z, w}$ be these edges, and ${x, z}$ and ${y, w}$ be the edges created by $ϑ_{n}$ in $P_{n}$ . By Theorem 5, without loss of generality, we assume that ${x, y} \in S (C_{n - 1})$ and ${z, w} \notin S (C_{n - 1})$ . Since $ϑ_{n}$ weakly depends on $ϑ_{n - 1}$ , either ${x, y}$ or ${z, w}$ is created by $ϑ_{n - 1}$ (Figure 5A, B, and E). We consider these two subcases below.

Figure 5.

Illustration of Lemma 7. (A) Initial genome graph, where the dashed edges denote arbitrary gene sequences. The dashed edge and black undirected edge between $w_{1}$ and $w_{2}$ form a circular chromosome C that is linearized within $(r_{1}, r_{2})$ by $r_{2}$ . (B-D) The intermediate genomes after first DCJs in the three equivalent pairs of weakly dependent DCJs. (E) The resulting genome graph after the equivalent pairs of DCJs, where C is linearized by DCJs $r_{2}$ and either $r_{3}$ or $r_{5}$ (depending on the belonging other chromosomes to meta-chromosome corresponding to C).

Suppose that ${x, y}$ is created by $ϑ_{n - 1}$ . If $ϑ_{n - 1}$ creates a single edge, then $ϑ_{n - 1}$ removes edges ${x, \infty}$ and ${y, \infty}$ in $S (P_{n - 2})$ , a contradiction to Corollary 1. Thus, we assume that $ϑ_{n - 1}$ removes two edges, say ${x, x_{1}}$ and ${y, y_{1}}$ . By Corollary 1, both ${x, x_{1}}$ and ${y, y_{1}}$ belong to $S (C_{n - 2})$ . Since ${z, w} \in S (P_{n - 2})$ and ${z, w} \notin S (C_{n - 1})$ , by Corollary 1, ${z, w} \notin S (C_{n - 2})$ . We define $ϑ'_{n - 1}$ as a DCJ that removes ${z, w}$ and ${x, x_{1}}$ in $S (P_{n - 2})$ and creates ${x, z}$ and ${w, x_{1}}$ in $S (P')$ . We further define $ϑ'_{n}$ as a DCJ that removes ${w, x_{1}}$ and ${y, y_{1}}$ in $S (P')$ (Figure 5A, C, D, and E) and creates ${y, w}$ and ${x_{1}, y_{1}}$ in $S (P_{n})$ . Then by Theorem 5, $ϑ'_{n - 1}$ linearizes C within $t'$ .

Suppose that ${z, w}$ is created by $ϑ_{n - 1}$ . Let us first assume that $ϑ_{n - 1}$ removes two edges ${z, z_{1}}$ and ${w, w_{1}}$ . Since ${z, w} \notin S (C_{n - 1})$ , by Corollary 1, ${z, z_{1}}$ and ${w, w_{1}}$ do not belong to $S (C_{n - 2})$ . Moreover, since ${x, y} \in S (P_{n - 2})$ and ${x, y} \in S (C_{n - 1})$ , we have ${x, y} \in S (C_{n - 2})$ . We define $ϑ'_{n - 1}$ as the DCJ that removes ${x, y}$ and ${z, z_{1}}$ in $S (P_{n - 2})$ and creates ${x, z}$ and ${y, z_{1}}$ in $S (P')$ . We define $ϑ'_{n}$ as the DCJ that removes ${w, w_{1}}$ and ${y, z_{1}}$ in $S (P')$ and creates ${y, w}$ and ${w_{1}, y_{1}}$ in $S (P_{n})$ (Figure 5). By Theorem 5, $ϑ'_{n - 1}$ linearizes C within $t'$ . If $ϑ_{n - 1}$ removes a single edge, then the proof is similar.

If $ϑ_{n}$ removes a single edge ${x, y}$ in $P_{n - 1}$ , then by Theorem 5, ${x, y} \in S (C_{n - 1})$ . Since $ϑ_{n - 1}$ creates ${x, y}$ , it removes two edges. We assume that these edges are ${x, x_{1}}$ and ${y, y_{1}}$ . By Corollary 1, ${x, x_{1}}$ and ${y, y_{1}}$ belong to $S (C_{n - 2})$ . We define $ϑ'_{n - 1}$ as a DCJ that removes a single edge in $S (P_{n - 2})$ , say ${x, x_{1}}$ , and creates two edges ${x, \infty}$ and ${x_{1}, \infty}$ in $S (P')$ . We define $ϑ'_{n}$ as a DCJ that removes ${y, y_{1}}$ and ${x_{1}, \infty}$ in $S (P')$ and creates ${y, \infty}$ and ${x_{1}, y_{1}}$ in $S (P_{n})$ . By Theorem 5, $ϑ'_{n - 1}$ linearizes C within $t'$ . It is easy to see that by construction, in all cases, $(ϑ'_{n - 1}, ϑ'_{n})$ is equivalent to $(ϑ_{n - 1}, ϑ_{n})$ , which completes the proof.□

DCJ depends on an insertion

In this section, we consider case $(4)$ , ie, a DCJ $β$ depends on an insertion $α$ . We say that $β$ strongly depends on $α$ if $β$ removes two edges created by $α$ . If $β$ removes one edge created by $α$ , we say that $β$ weakly depends on $α$ . In contrast to cases $(2)$ and $(3)$ , when $β$ weakly depends on $α$ , there may not always exist an equivalent pair $(α', β')$ , where $α'$ is a DCJ and $β'$ is an insertion.

To better capture and analyze the combinatorial structure of events in a DCJ-Indel scenario t, we construct the dependency graph¹⁶ $DG (t)$ (also known as overlap graph^17,18), whose vertices are labeled with events from t and there is an arc $(δ, γ)$ whenever an event $γ$ depends on an event $δ$ . We remark that a DCJ can weakly depend on at most two insertions in a DCJ-Indel scenario. The following definition describes DCJs $β$ in t for which the pair of adjacent events $(α, β)$ does not have an equivalent pair $(α', β')$ , where $α', β$ are DCJs and $α, β'$ are an insertion.

Definition 3

A DCJ $β$ in a DCJ-Indel scenario t is called upper-movable if the following property holds:

If there exists exactly one insertion $α$ in t such that there is a path from α to β in $DG (t)$ , say $(α, γ, \dots, β)$ , then $γ$ removes either the first or the last edge of the path inserted by $α$ .

First, we consider the case when a DCJ depends on two insertions (Figure 6). Second, we address the case when a DCJ is upper-movable and depends on only one insertion (Figure 7). Finally, we consider the case when a DCJ is not upper-movable.

Figure 6.

Illustration of Lemma 9. (A) Initial genome graph, where the dashed edges denote arbitrary gene sequences. $C_{1}$ is a circular chromosome linearized by $r_{3}$ within DCJ-Indel scenario $(r_{1}, r_{2}, r_{3})$ , where $r_{3}$ is a DCJ and $r_{1}, r_{2}$ are insertions of gene sequences $(g_{l}, g_{l + 1})$ and $(g_{j}, g_{j + 1})$ . (B) The intermediate genomes before and after an insertion $r_{2}$ . (C) The intermediate genomes before and after an insertion $r'_{2}$ . (D) The resulting graph after the equivalent pair of DCJ-Indel scenarios $(r_{1}, r_{2}, r_{3})$ and $(r'_{1}, r'_{2}, r'_{3})$ , where $C_{1}$ is linearized by DCJs $r'_{1}$ and $r_{3}$ , and $r_{1}, r_{2}, r'_{2}, r'_{3}$ are insertions.

Figure 7.

Illustration of Lemma 10. (A) Initial genome graph, where the dashed edges denote arbitrary gene sequences. $C_{1}$ is a circular chromosome linearized by $r_{2}$ within DCJ-Indel scenario $(r_{1}, r_{2})$ , where $r_{1}$ is an insertion of gene sequences $(g_{l}, g_{l + 1})$ and $r_{2}$ is a DCJ. (B) The intermediate genome after an insertion $r_{1}$ . (C) The intermediate genome after a DCJ $r'_{1}$ . (D) The resulting graph after the equivalent pair of DCJ-Indel scenarios $(r_{1}, r_{2})$ and $(r'_{1}, r'_{2})$ , where $C_{1}$ is linearized by DCJs $r'_{1}$ and $r_{2}$ , and $r_{1}, r'_{2}$ are insertions.

Lemma 9

Let $t : P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n - 3}}{\to} P_{n - 3} \overset{ϑ_{n - 2}}{\to} P_{n - 2} \overset{ϑ_{n - 1}}{\to} P_{n - 1} \overset{ϑ_{n}}{\to} P_{n}$ be a DCJ-Indel scenario that linearizes a circular chromosome C, where DCJ $ϑ_{n}$ weakly depends on insertions $ϑ_{n - 1}$ and $ϑ_{n - 2}$ . If $ϑ_{n}$ linearizes C, then there exists a DCJ-Indel scenario $P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n - 3}}{\to} P_{n - 3} \overset{ϑ'_{n - 2}}{\to} P' \overset{ϑ'_{n - 1}}{\to} P ″ \overset{ϑ'_{n}}{\to} P_{n}$ , where $ϑ'_{n - 2}$ linearizes C and $ϑ'_{n - 1}$ , $ϑ'_{n}$ are insertions.

Proof

Let $C_{n - 3}$ , $C_{n - 2}$ , and $C_{n - 1}$ be the meta-chromosomes of C in $P_{n - 3}$ , $P_{n - 2}$ , and $P_{n - 1}$ , respectively. Let $P_{1} = (x, u_{1}, {\bar{u}}_{1}, \dots, u_{l}, {\bar{u}}_{l}, y)$ and $P_{2} = (z, v_{1}, {\bar{v}}_{1} \dots, v_{k}, {\bar{v}}_{k}, w)$ be paths inserted by $ϑ_{n - 2}$ and $ϑ_{n - 1}$ , respectively. Since DCJ $ϑ_{n}$ weakly depends on insertions $ϑ_{n - 1}$ and $ϑ_{n - 2}$ , without loss of generality, we assume that $ϑ_{n}$ removes ${{\bar{u}}_{p - 1}, u_{p}}$ and ${{\bar{v}}_{q - 1}, v_{q}}$ , and creates ${{\bar{u}}_{p - 1}, {\bar{v}}_{q - 1}}$ and ${u_{p}, v_{q}}$ for $p \in {2, \dots, l}$ and $q \in {2, \dots, k}$ (Figure 6A, B, and D). By Theorem 5, we have ${{\bar{u}}_{p - 1}, u_{p}} \in S (C_{n - 1})$ and ${{\bar{v}}_{q - 1}, v_{q}} \notin S (C_{n - 1})$ . Then, all edges in $P_{1}$ belong to $S (C_{n - 1})$ and all edges in $P_{2}$ do not belong to $S (C_{n - 1})$ . If $ϑ_{n - 1}$ depends on $ϑ_{n - 2}$ (ie, $z = {\bar{u}}_{s - 1}$ and $w = u_{s}$ for some $s \in {2, \dots, l}$ ), then all edges in $P_{1}$ and $P_{2}$ belong to $S (C_{n - 1})$ , a contradiction. Thus, $ϑ_{n - 1}$ and $ϑ_{n - 2}$ are independent events. We define $ϑ'_{n - 2}$ as a DCJ that removes ${x, y}$ and ${z, w}$ in $P_{n - 3}$ and creates ${x, z}$ and ${y, w}$ in $P'$ . We define $ϑ'_{n - 1}$ and $ϑ'_{n}$ as insertions that replace ${x, z}$ in $P'$ with a path $(x, u_{1}, {\bar{u}}_{1}, \dots, u_{p - 1}, {\bar{u}}_{p - 1}, {\bar{v}}_{q - 1}, v_{q - 1} \dots, {\bar{v}}_{1}, v_{1}, z)$ in $P ″$ and ${y, w}$ in $P ″$ with a path $(y, {\bar{u}}_{l}, u_{l} \dots, {\bar{u}}_{p}, u_{p}, v_{q}, {\bar{v}}_{q}, \dots, v_{k}, {\bar{v}}_{k}, w)$ in $P_{n}$ , respectively (Figure 6A, C, and D). By Corollary 1, ${x, u_{1}} \in S (C_{n - 2})$ and ${z, w} \notin S (C_{n - 2})$ and, moreover, ${x, y} \in S (C_{n - 3})$ and ${z, w} \notin S (C_{n - 3})$ . By Theorem 5, $ϑ'_{n - 2}$ linearizes C within a DCJ-Indel scenario $P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n - 3}}{\to} P_{n - 3} \overset{ϑ'_{n - 2}}{\to} P' \overset{ϑ'_{n - 1}}{\to} P ″ \overset{ϑ'_{n}}{\to} P_{n}$ .□

Lemma 10

Let $t : P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n - 2}}{\to} P_{n - 2} \overset{ϑ_{n - 1}}{\to} P_{n - 1} \overset{ϑ_{n}}{\to} P_{n}$ be a DCJ-Indel scenario that linearizes a circular chromosome C, where DCJ $ϑ_{n}$ depends on insertion $ϑ_{n - 1}$ , and there is no $α \in {ϑ_{1}, \dots, ϑ_{n - 2}}$ such that $α$ is insertion connected by a path to $ϑ_{n}$ in $DG (t)$ . If $ϑ_{n}$ is upper-movable and linearizes C, then there exists a DCJ-Indel scenario $P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n - 2}}{\to} P_{n - 2} \overset{ϑ'_{n - 1}}{\to} P' \overset{ϑ'_{n}}{\to} P_{n}$ , where $ϑ'_{n - 1}$ linearizes C and $ϑ'_{n}$ is an insertion.

Proof

Let $C_{n - 2}$ and $C_{n - 1}$ be the meta-chromosomes of C in $P_{n - 2}$ and $P_{n - 1}$ , respectively. Let $P = (u, u_{1}, {\bar{u}}_{1}, \dots, u_{l}, {\bar{u}}_{l}, v)$ be a path inserted by $ϑ_{n - 1}$ .

Assume that $ϑ_{n}$ strongly depends on $ϑ_{n - 1}$ . Let ${x, y}$ and ${z, w}$ be the edges removed by $ϑ_{n}$ in $P_{n - 1}$ . Then ${x, y}$ and ${z, w}$ are inserted by $ϑ_{n - 1}$ , and thus belong to the same chromosome. Then by Theorem 5, $ϑ_{n}$ cannot linearize C, a contradiction.

For the rest of the proof, we assume that $ϑ_{n}$ weakly depends on $ϑ_{n - 1}$ . Since there is no insertion $α \in {ϑ_{1}, \dots, ϑ_{n - 2}}$ connected by a path to $ϑ_{n}$ in $DG (t)$ , $ϑ_{n - 1}$ is the only insertion in t that has a path to $ϑ_{n}$ in $DG (t)$ . Since $ϑ_{n}$ is upper-movable, $ϑ_{n}$ removes ${u, u_{1}}$ or ${{\bar{u}}_{l}, v}$ . We consider two cases depending on the number of edges removed by $ϑ_{n}$ .

If $ϑ_{n}$ removes two edges, then without loss of generality, we assume that $ϑ_{n}$ removes ${u, u_{1}}$ and ${x, y}$ in $S (P_{n - 1})$ and creates ${u, x}$ and ${u_{1}, y}$ in $S (P_{n})$ (Figure 7A, B, and D). Let us define $ϑ'_{n - 1}$ as a DCJ that removes ${u, v}$ and ${x, y}$ in $S (P_{n - 2})$ and creates ${u, x}$ and ${y, v}$ in $S (P')$ . We define $ϑ'_{n}$ as an insertion that replaces the edge ${y, v}$ in $S (P')$ with a path $(y, u_{1}, {\bar{u}}_{1}, \dots, u_{l}, {\bar{u}}_{l}, v)$ in $S (P_{n})$ (Figure 7A, C, and D). By Theorem 5, without loss of generality, ${u, u_{1}} \in S (C_{n - 1})$ and ${x, y} \notin S (C_{n - 1})$ . By Corollary 1, ${u, v} \in S (C_{n - 2})$ and ${x, y} \notin S (C_{n - 2})$ . Thus, by Theorem 5, $ϑ'_{n - 1}$ linearizes C within $P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n - 2}}{\to} P_{n - 2} \overset{ϑ'_{n - 1}}{\to} P' \overset{ϑ'_{n}}{\to} P_{n}$ .

If $ϑ_{n}$ removes a single edge, the proof is similar.□

Lemma 11

Let $t : P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n - 2}}{\to} P_{n - 2} \overset{ϑ_{n - 1}}{\to} P_{n - 1} \overset{ϑ_{n}}{\to} P_{n}$ be a DCJ-Indel scenario that linearizes a circular chromosome C, where DCJ $ϑ_{n}$ weakly depends on insertion $ϑ_{n - 1}$ . If $ϑ_{n}$ linearizes C within t and is not upper-movable, then there exists a DCJ-Indel scenario $P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n - 2}}{\to} P_{n - 2} \overset{ϑ'_{n - 1}}{\to} P' \overset{ϑ'_{n}}{\to} P ″ \overset{ϑ'_{n + 1}}{\to} P_{n}$ , where $ϑ'_{n - 1}$ linearizes C and $ϑ'_{n}, ϑ'_{n + 1}$ are insertions.

Proof

Let $C_{n - 2}$ and $C_{n - 1}$ be the meta-chromosomes of C in $P_{n - 2}$ and $P_{n - 1}$ , respectively. Let $P = (u, u_{1}, {\bar{u}}_{1} \dots, u_{l}, {\bar{u}}_{l}, v)$ be a path inserted by $ϑ_{n - 1}$ . Since $ϑ_{n}$ weakly depends on $ϑ_{n - 1}$ and is not upper-movable, $ϑ_{n - 1}$ breaks $P$ into two non-trivial subpaths. We consider two cases depending on the number of edges removed by DCJ $ϑ_{n}$ .

If $ϑ_{n}$ removes two edges, then without loss of generality, we assume that $ϑ_{n}$ removes edges ${{\bar{u}}_{k}, u_{k + 1}}$ ( $k \in {1, \dots, l - 1}$ ) and ${x, y}$ in $S (P_{n - 1})$ and creates edges ${{\bar{u}}_{k}, x}$ and ${u_{k + 1}, y}$ in $S (P_{n})$ . By Theorem 5, we can assume that ${{\bar{u}}_{k}, u_{k + 1}} \in S (C_{n - 1})$ and ${x, y} \notin S (C_{n - 1})$ . Note that ${x, y}$ and ${u, v}$ are present in $S (P_{n - 2})$ . By Corollary 1, ${u, v} \in S (C_{n - 2})$ and ${x, y} \notin S (C_{n - 2})$ . We define $ϑ'_{n - 1}$ as a DCJ that removes ${x, y}$ and ${u, v}$ in $S (P_{n - 2})$ and creates ${x, u}$ and ${y, v}$ in $S (P')$ . We define $ϑ'_{n}$ and $ϑ'_{n + 1}$ as insertions that replace edges ${x, u}$ in $S (P')$ and ${y, v}$ in $S (P ″)$ with paths $(u, u_{1}, {\bar{u}}_{1}, \dots, u_{k}, {\bar{u}}_{k}, x)$ in $S (P ″)$ and $(y, u_{k + 1}, {\bar{u}}_{k + 1} \dots, u_{l}, {\bar{u}}_{l}, v)$ in $S (P_{n})$ , respectively. By Theorem 5, $ϑ'_{n - 1}$ linearizes C within $P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n - 2}}{\to} P_{n - 2} \overset{ϑ'_{n - 1}}{\to} P' \overset{ϑ'_{n}}{\to} P ″ \overset{ϑ'_{n + 1}}{\to} P_{n}$ .

If $ϑ_{n}$ removes a single edge, the proof is similar.□

Proof of Theorems 1 and 2

We remark that for each pair of adjacent events $(α, β)$ , there is an equivalent pair of adjacent events $(α', β')$ , where $α', β$ are insertions and $α, β'$ have the same type. Below we prove Theorem 6, which will imply Theorems 1 and 2.

Theorem 6

Let $t : P \to Q$ be a DCJ-Indel scenario that linearizes a circular chromosome C. Then there exists a DCJ-Indel scenario $P \overset{r}{\to} P' \overset{t'}{\to} Q$ such that r is a DCJ linearizing C, and if C is linearized by an upper-movable DCJ within t, then $| t' | = | t | - 1$ , otherwise $| t' | = | t |$ .

Proof

We prove the theorem statement by induction on $| t |$ . If $| t | = 1$ , then by Lemma 4 and Theorem 5, the statement trivially holds.

For an integer $n \geq 2$ , we assume that the theorem holds for all $| t | < n$ . Suppose that t has length n, ie, t has the form $t : P = P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n}}{\to} P_{n} = Q$ . We consider two cases depending on whether $ϑ_{n}$ linearizes C within t.

Case 1. $ϑ_{n}$ does not linearize C within t. By Lemma 4, there exists an event $ϑ_{k}$ for $k < n$ that linearizes C within t. By induction, we obtain a DCJ-Indel scenario $t_{1} : P_{0} \overset{r}{\to} P'_{1} \overset{ϑ'_{2}}{\to} \dots \overset{ϑ'_{l}}{\to} P_{k} \overset{ϑ_{k + 1}}{\to} \dots \overset{ϑ_{n}}{\to} P_{n}$ , where r linearizes C and $| t | \leq | t_{1} | \leq | t | + 1$ . We let $t' = (ϑ'_{2}, \dots, ϑ'_{l}, ϑ_{k + 1}, \dots, ϑ_{n})$ . It is clear that $| t' | = | t | - 1$ if $ϑ_{k}$ is upper-movable, and $| t' | = | t |$ otherwise.

Case 2. $ϑ_{n}$ linearizes C within t. By Theorem 5, $ϑ_{n}$ is a DCJ. We consider two cases depending on whether $ϑ_{n}$ depends on $ϑ_{n - 1}$ . If $ϑ_{n}$ does not depend on $ϑ_{n - 1}$ , then, by Lemma 6, we obtain a DCJ-Indel scenario $t_{1} : P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n - 2}}{\to} P_{n - 2} \overset{ϑ'_{n - 1}}{\to} P' \overset{ϑ'_{n}}{\to} P_{n}$ , where $ϑ'_{n - 1} = ϑ_{n}$ and $ϑ'_{n} = ϑ_{n - 1}$ , and $ϑ'_{n - 1}$ linearizes C. If $ϑ_{n}$ depends on $ϑ_{n - 1}$ and $ϑ_{n - 1}$ is a DCJ or a deletion, then by Lemma 7 or 8, we obtain a DCJ-Indel scenario $t_{1} : P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n - 2}}{\to} P_{n - 2} \overset{ϑ'_{n - 1}}{\to} P' \overset{ϑ'_{n}}{\to} P_{n}$ , where $ϑ'_{n - 1}$ linearizes C. In both cases, applying the induction to $t_{1}$ , we obtain a DCJ-Indel scenario $t_{2} : P_{0} \overset{r}{\to} P'_{1} \overset{{ϑ ″}_{2}}{\to} \dots \overset{{ϑ ″}_{l}}{\to} P' \overset{ϑ'_{n}}{\to} P_{n}$ , where r linearizes C and $| t | \leq | t_{2} | \leq | t | + 1$ . Now, we let $t' = ({ϑ ″}_{2}, \dots, {ϑ ″}_{l}, ϑ'_{n})$ . It is clear that $| t' | = | t | - 1$ if ${ϑ^{'}}_{n - 1}$ is upper-movable, and $| t' | = | t |$ otherwise.

It remains to consider the case when DCJ $ϑ_{n}$ depends on $ϑ_{n - 1}$ , $ϑ_{n - 1}$ is an insertion, which we split into two subcases depending on whether $ϑ_{n}$ is upper-movable.

Case 2.1. $ϑ_{n}$ is upper-movable. Here we consider two cases depending on whether there exists any insertion other than $ϑ_{n - 1}$ that is connected by a path to $ϑ_{n}$ in $DG (t)$ .

Case 2.1.1. $ϑ_{n - 1}$ is a single insertion such that there is a path to $ϑ_{n}$ in $DG (t)$ . By Lemma 10, we obtain a DCJ-Indel scenario $t_{1} : P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n - 2}}{\to} P_{n - 2} \overset{ϑ'_{n - 1}}{\to} P' \overset{ϑ'_{n}}{\to} P_{n}$ , where $ϑ'_{n - 1}$ linearizes C. Since $ϑ'_{n - 1}$ is upper-movable in $t_{1}$ , by induction, we obtain a DCJ-Indel scenario $t_{2} : P_{0} \overset{r}{\to} P'_{1} \overset{{ϑ ″}_{2}}{\to} \dots \overset{{ϑ ″}_{n - 2}}{\to} P'_{n - 2} \overset{{ϑ ″}_{n - 1}}{\to} P' \overset{ϑ'_{n}}{\to} P_{n}$ , where r linearizes C and $| t_{2} | = | t |$ . We let $t' = ({ϑ ″}_{2}, \dots, {ϑ ″}_{n - 2}, ϑ'_{n - 1}, ϑ'_{n})$ to complete the proof.

Case 2.1.2. There exists an insertion $ϑ_{i}$ with $i < n - 1$ connected by a path to $ϑ_{n}$ in $DG (t)$ . We consider two subcases depending on whether $i = n - 2$ . If $i = n - 2$ , then by Lemma 9, we obtain a DCJ-Indel scenario $t_{1} : P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n - 3}}{\to} P_{n - 3} \overset{ϑ'_{n - 2}}{\to} P' \overset{ϑ'_{n - 1}}{\to} P ″ \overset{ϑ'_{n}}{\to} P_{n}$ , where $ϑ'_{n - 2}$ linearizes C and $| t_{1} | = | t |$ . Since $ϑ'_{n - 2}$ is upper-movable in $t_{1}$ , by induction, we obtain a DCJ-Indel scenario $t_{2} : P_{0} \overset{r}{\to} P'_{1} \overset{{ϑ ″}_{2}}{\to} \dots \overset{{ϑ ″}_{n - 2}}{\to} P' \overset{ϑ'_{n - 1}}{\to} P ″ \overset{ϑ'_{n}}{\to} P_{n}$ , where r linearizes C and $| t_{2} | = | t |$ . We let $t' = ({ϑ ″}_{2}, \dots, {ϑ ″}_{n - 1}, ϑ'_{n})$ to complete the proof. If $i \neq n - 2$ , then we replace the pair of adjacent events $(ϑ_{n - 2}, ϑ_{n - 1})$ in t with an equivalent pair of adjacent events $(ϑ'_{n - 2}, ϑ'_{n - 1})$ , where $ϑ'_{n - 2}$ is an insertion and $ϑ'_{n - 1}$ has the same type as $ϑ_{n - 2}$ , resulting in $t_{1} : P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n - 3}}{\to} P_{n - 2} \overset{ϑ'_{n - 2}}{\to} P' \overset{ϑ'_{n - 1}}{\to} P_{n - 1} \overset{ϑ_{n}}{\to} P_{n}$ . By Lemmas 6 to 8 for the pair of adjacent events $(ϑ'_{n - 1}, ϑ_{n})$ (depending on the type of $ϑ'_{n - 1}$ and dependency with $ϑ_{n}$ ), we obtain a DCJ-Indel scenario $t_{2} : P_{0} \overset{ϑ_{1}}{\to} \dots \overset{ϑ_{n - 3}}{\to} P_{n - 2} \overset{ϑ'_{n - 2}}{\to} P' \overset{{ϑ ″}_{n - 1}}{\to} P ″ \overset{{ϑ ″}_{n}}{\to} P_{n}$ , where ${ϑ ″}_{n}$ linearizes C and $| t_{2} | = | t |$ . Since ${ϑ ″}_{n - 1}$ is upper-movable in $t_{2}$ , by induction, we obtain a DCJ-Indel scenario $t_{3} : P_{0} \overset{r}{\to} P'_{1} \overset{{ϑ' ″}_{2}}{\to} \dots \overset{{ϑ' ″}_{n - 1}}{\to} P ″ \overset{{ϑ ″}_{n}}{\to} P_{n}$ , where r linearizes C and $| t_{3} | = | t |$ . We let $t' = (ϑ ″'_{2}, \dots, ϑ ″'_{n - 1}, {ϑ ″}_{n})$ to complete the proof.

Case 2.2. $ϑ_{n}$ is not upper-movable. By Lemma 11, we obtain a DCJ-Indel scenario $t_{1} : P_{0} \overset{ϑ_{1}}{\to} P_{1} \dots \overset{ϑ_{n - 2}}{\to} P_{n - 2} \overset{ϑ'_{n - 1}}{\to} P' \overset{ϑ'_{n}}{\to} P ″ \overset{ϑ'_{n + 1}}{\to} P_{n}$ , where $ϑ'_{n - 1}$ linearizes C and $| t_{1} | = | t | + 1$ . Since there is no insertion $α$ in the DCJ-Indel scenario $t_{1}$ connected by a path to $ϑ'_{n - 1}$ in $DG (t_{1})$ , $ϑ'_{n - 1}$ is upper-movable in $t_{1}$ . By induction, we obtain a DCJ-Indel scenario $t_{2} : P_{0} \overset{r}{\to} P'_{1} \overset{{ϑ ″}_{2}}{\to} \dots \overset{{ϑ ″}_{n - 1}}{\to} P' \overset{ϑ'_{n}}{\to} P ″ \overset{ϑ'_{n + 1}}{\to} P_{n}$ , where r linearizes C and $| t_{2} | = | t | + 1$ . We let $t' = ({ϑ ″}_{2}, \dots, {ϑ ″}_{n - 1}, ϑ'_{n}, ϑ'_{n + 1})$ to complete the proof.

Theorems 1 and 2 immediately follow from Theorem 6.

Discussion

For three given linear genomes and their DCJ median genome M (which may contain circular chromosomes), we described an algorithm that constructs a linear genome $M'$ such that the approximation error of $M'$ (ie, the difference in the DCJ median scores of $M'$ and M) is bounded by twice the number of circular chromosomes in M.

We claim (and will prove elsewhere) that the bound in Theorem 3 is tight. We illustrate this claim with Figure 8, where each of the linear genomes $B_{1}$ , $B_{2}$ , and $B_{3}$ can be obtained from the genome M by an insertion followed by a fission. Note that all the pairwise DCJ distances between $B_{1}, B_{2}, and B_{3}$ equal 4. We claim that the DCJ-Indel median score of M is 6, while any linearization of M has the DCJ-Indel median score at least 8, implying that the bound in Theorem 3 is tight.

Figure 8.

A circular median genome M on genes ${g_{1}, g_{2}, g_{3}}$ of three unichromosomal linear genomes $B_{1}$ , $B_{2}$ , and $B_{3}$ on genes ${g_{1}, g_{2}, g_{3}, g_{4}, g_{5}}$ , ${g_{1}, g_{2}, g_{3}, g_{6}, g_{7}}$ , and ${g_{1}, g_{2}, g_{3}, g_{8}, g_{9}}$ , respectively, with specified pairwise DCJ-Indel distances, where $g_{4}, g_{5}, g_{6}, g_{7}, g_{8}, g_{9}$ are inserted genes.

At the same time, it was previously observed by Xu⁸ on simulated data that the number of circular chromosomes produced by their GMP solver is typically very small, implying negligible approximation error for our algorithm.

The proposed algorithm is implemented in the AGRP solver MGRA2.¹⁹

Footnotes

Acknowledgements

The authors thank the anonymous reviewers for their thoughtful comments regarding the earlier version of this paper. Some preliminary results of the present work appeared in the Proceedings of the 14th Workshop on Algorithms in Bioinformatics (WABI 2014).

Funding:

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Science Foundation under the grant no. IIS-1462107.

Declaration of Conflicting Iinterests:

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Author Contributions

All authors participated in the manuscript preparation.

ORCID iD

Max A Alekseyev

References

Yancopoulos

Attie

Friedberg

Efficient sorting of genomic permutations by translocation, inversion and block interchange. Bioinformatics. 2005;21:3340–3346. doi:10.1093/bioinformatics/bti535.

Alekseyev

Pevzner

PA.

Multi-break rearrangements and chromosomal evolution. Theor Comput Sci. 2008;395:193–202. doi:10.1016/j.tcs.2008.01.013.

Braga

MDV

Willing

Stoye

. Genomic distance with DCJ and indels. In: Moulton

Singh

, eds. Algorithms in Bioinformatics. Vol 6293. Berlin, Germany: Springer; 2010:90–101. doi:10.1007/978-3-642-15294-8_8.

Compeau

DCJ-indel sorting revisited. Algorithm Mol Biol. 2013;8:6. doi:10.1186/1748-7188-8-6.

Caprara

The reversal median problem. INFORMS J Comput. 2003;15:93–113.

Tannier

Zheng

Sankoff

Multichromosomal median and halving problems under different genomic distances. BMC Bioinformatics. 2009;10:120.

AW.

A fast and exact algorithm for the median of three problem: a graph decomposition approach. J Comput Biol. 2009;16:1369–1381.

. DCJ median problems on linear multichromosomal genomes: graph representation and fast exact solutions. In: Ciccarelli

Miklos

, eds. Comparative Genomics. Vol 5817. Berlin, Germany: Springer; 2009:70–83. doi:10.1007/978-3-642-04744-27.

Zhang

Arndt

Tang

An exact solver for the DCJ median problem. Paper presented at: Pacific Symposium on Biocomputing; November 5-9, 2009; Big Island, HI:138–149. Singapore: World Scientific.

10.

Maňuch

Patterson

Wittler

et al . Linearization of ancestral multichromosomal genomes. BMC Bioinformatics. 2012;13:S11.

11.

Zhang

Suh

et al . Reconstructing contiguous regions of an ancestral genome. Genome Res. 2006;16:1557–1565.

12.

Ratan

Raney

et al . DUPCAR: reconstructing contiguous ancestral regions with duplications. J Comput Biol. 2008;15:1007–1027.

13.

Muffato

Louis

Poisnel

et al . Genomicus: a database and a browser to study gene synteny in modern and ancestral genomes. Bioinformatics. 2010;26:1119–1121.

14.

Avdeyev

Alexeev

Rong

et al . A unified ILP framework for genome median, halving, and aliquoting problems under DCJ. Paper presented at: Proceedings of the 15th Annual Research in Computational Molecular Biology Satellite Workshop on Comparative Genomics (RECOMB-CG); October 4-6, 2017; Barcelona, Spain. Vol 10562:156–178. Berlin, Germany: Springer. doi:10.1007/978-3-319-67979-29.

15.

Braga

Stoye

The solution space of sorting by DCJ. J Comput Biol. 2010;17:1145–1165.

16.

Avdeyev

Jiang

Alekseyev

MA.

Implicit transpositions in DCJ scenarios. Front Genet. 2017;8:212. doi:10.3389/fgene.2017.00212.

17.

Ozery-Flato

Shamir

Sorting by translocations via reversals theory. Paper presented at: Proceedings of the 4th RECOMB International Workshop on Comparative Genomics (RECOMB-CG); September 24-26, 2006; Montreal, QC, Canada. Vol 4205:87–98. Berlin, Germany: Springer. doi:10.1007/118641278.

18.

Ouangraoua

Bergeron

Combinatorial structure of genome rearrangements scenarios. J Comput Biol. 2010;17:1129–1144.

19.

Avdeyev

Jiang

Aganezov

et al . Reconstruction of ancestral genomes in presence of gene gain and loss. J Comput Biol. 2016;23:150–164. doi:10.1089/cmb.2015.0160.

Linearization of Median Genomes Under the Double-Cut-and-Join-Indel Model

Abstract

Keywords

Introduction

Background

DCJ-Indel distance and genome graphs

Lemma 1 (Compeau 4 )

Lemma 2 (Compeau 4 )

Genome median problem

Genome median problem (GMP)

Linear genome median problem (L-GMP)

Results

Chromosome linearization

Definition 1

Definition 2

Lemma 3

Proof

Theorem 1

Theorem 2

Linearization of median genomes

Theorem 3

Proof

Theorem 4

Proof

Methods

Circular chromosomes and DCJ-Indel scenarios

Lemma 4

Proof

Theorem 5

Proof

Lemma 5

Proof

Corollary 1

Independent adjacent events

Lemma 6

Proof

DCJ depends on a deletion

Lemma 7

Proof

DCJ depends on a DCJ

Lemma 8

Proof

DCJ depends on an insertion

Definition 3

Lemma 9

Proof

Lemma 10

Proof

Lemma 11

Proof

Proof of Theorems 1 and 2

Theorem 6

Proof

Discussion

Footnotes

Acknowledgements

Funding:

Declaration of Conflicting Iinterests:

Author Contributions

ORCID iD

References

Lemma 1 (Compeau⁴)

Lemma 2 (Compeau⁴)