Sage Journals: Discover world-class research

Abstract

We study the algorithmic problem of finding the most “scale-free-like” spanning tree of a connected graph. This problem is motivated by the fundamental problem of genomic epidemiology: given viral genomes sampled from infected individuals, reconstruct the transmission network (“who infected whom”). We use two possible objective functions for this problem and introduce the corresponding algorithmic problems termed m -SF (-scale free) and s -SF Spanning Tree problems. We prove that those problems are APX- and NP-hard, respectively, even in the classes of cubic and bipartite graphs. We propose two integer linear programming (ILP) formulations for the s -SF Spanning Tree problem, and experimentally assess its performance using simulated and experimental data. In particular, we demonstrate that the ILP-based approach allows for accurate reconstruction of transmission histories of several hepatitis C outbreaks.

1. Introduction

Viral outbreaks continue to be major causes of morbidity and mortality. The ongoing pandemic of the coronavirus SARS-CoV-2 (Huang et al., 2020) is a vivid example, but long-standing epidemics of HIV, hepatitis B virus, and hepatitis C virus (HCV) are hardly less damaging (Kilmarx, 2009; Hajarizadeh et al., 2013). Viral epidemics are complex processes defined by evolutionary dynamics of pathogens and social dynamics of susceptible populations (e.g., individual behaviors, social interactions, and mobility patterns).

Recent advances in sequencing technologies invigorated the field of genomic epidemiology (Armstrong et al., 2019; Knyazev et al., 2020) that aims to use viral genomic data to understand the epidemiological dynamics of pathogens. The fundamental algorithmic problem of genomic epidemiology could be formulated as follows:

Given viral genomes sampled from n infected individuals, infer a transmission network indicating who of them infected whom (Knyazev et al., 2020). If each individual is supposed to be infected only once, then a transmission network is a tree called a transmission tree.

This problem has been approached by a variety of methods (Jombart et al., 2011, 2014; Sledzieski et al., 2019; Wertheim et al., 2014; Campo et al., 2016; De Maio et al., 2016; Klinkenberg et al., 2017; Skums et al., 2018). One family of methods is based on the so-called network approach. It is particularly popular among researchers of HIV and HCV and has been adopted as a standard methodology for outbreak investigations carried out by the CDC (Wertheim et al., 2014; Campo et al., 2016; Campbell et al., 2017; Kosakovsky Pond et al., 2018; Ramachandran et al., 2018; Ragonnet-Cronin et al., 2019). This approach usually consists of two stages. First, a weighted relatedness graph G_R is constructed. Its vertices represent infected hosts, and edges connect the hosts whose viral populations are close to each other according to a selected population genetics measure. Often G_R itself supplies enough information for epidemiologists and provides a fast and scalable alternative to phylogenetic trees when applied to next-generation sequencing (NGS) data (Wertheim et al., 2014; Campo et al., 2016; Ragonnet-Cronin et al., 2019). However, usually it contains many edges that do not represent actual transmissions. Thus, at the second stage, the transmission tree is inferred as the spanning tree of G_R.

Under the maximum parsimony criterion, the most likely transmission network is a minimum spanning tree of G_R (Jombart et al., 2011). However, experiments demonstrated that this approach is not accurate (Jombart et al., 2014). Furthermore, genomic data alone often do not allow to resolve ambiguities in transmission tree inference, and incorporation of additional evidence is necessary (Jombart et al., 2014; Villandre et al., 2016; Jha et al., 2017). Such evidence usually comes in the form of epidemiological information, such as sample collection times and exposure intervals. However, HIV, HCV, and many other infections tend to be initially asymptomatic, and consequently, sampling times may not accurately reflect the infection times. In addition, in outbreaks with high transmission rates (e.g., HIV/HCV among injection drug users), susceptible hosts are almost constantly exposed to the virus, which makes exposure intervals useless. Another important drawback of many existing methods is their implicit assumption that transmission tree edges are independent. In reality, it is not the case, as, for example, certain hosts (so-called superspreaders) infect more people than an average person (Galvani and May, 2005).

Skums et al. (2018) proposed an alternative approach. It is known that for viruses, whose transmissions are associated with behavioral risk factors, their transmission trees have properties of so-called scale-free graphs (Leigh Brown et al., 2011; Wertheim et al., 2014). Those graphs have specific features, including power-law degree distribution, small diameter, and the presence of high-degree vertices (hubs). This observation gives rise to the following informally defined algorithmic problem (scale-free spanning tree problem): find the most “scale-free-like” spanning tree T of the graph G_R. In addition, constraints on the weight of T could be imposed. This approach was the basis of the Bayesian framework and the Markov Chain Monte Carlo algorithm for the transmission network inference described by Skums et al. (2018) and implemented as a tool called QUENTIN. Although QUENTIN is efficient in practice, it is a heuristic, and the questions about computational complexity and possibility of the exact solution of the problem were left open.

In this article, we present the first detailed study of the scale-free spanning tree problem. Our major contributions are as follows.

(1)

We propose two rigorous formulations of the scale-free spanning tree problem further referred to as m-SF Spanning Tree and s-SF Spanning Tree problems. They are based on two related objective functions and, to the best of our knowledge, have not been previously studied.

(2)

We establish the computational complexity of both problems by demonstrating that they are NP-hard or APX-hard, even when restricted to cubic graphs and bipartite graphs.

(3)

We propose two integer linear programming (ILP) formulations for the problems, and perform computational experiments to assess their performance using simulated data. Then we apply an ILP approach to real genomic data from several epidemiologically curated HCV outbreaks investigated by the CDC (Campo et al., 2016; Skums et al., 2018) and demonstrate that it allows for accurate inference of transmission trees.

2. Preliminaries

2.1. Problem formulations

We consider only finite undirected simple graphs and use standard graph-theoretic terminologies, see, for example, Chartrand et al. (2016). Let $G = (V, E)$ be a connected graph. For a vertex $x \in V (G)$ , the neighborhood $N_{G} (x)$ of x is the set of all vertices that are adjacent to x in G. The degree of x is defined as ${deg}_{G} x = | N_{G} (x) |$ . Several definitions of scale-free graphs of different degrees of mathematical rigor are known in a literature. We utilize the rigorous combinatorial characterization that has been introduced by Li et al. (2005) using the so-called s-metric of a graph. This graph invariant is defined as follows: $s (G) = \sum_{u v \in E (G)} {deg}_{G} u {deg}_{G} v .$ (1)

The same parameter is known in mathematical chemistry as second Zagreb index (Das and Gutman, 2004; Borovicanin et al., 2017). Li et al. (2005) demonstrated that a higher s-metric indicates with high probability the presence of most of the expected properties of scale-free graphs. The intuition behind these results is that in a graph with a high s-metric, a large number of edges should be incident to high-degree vertices, thus forcing them to resemble preferential attachment graphs—a standard Barabási and Albert (1999) model for scale-free networks. Therefore, another mathematical chemistry parameter called the first Zagreb index (Borovicanin et al., 2017) or m-metric also can serve as a measure of “scale-freeness” of a graph:

Thus, we can formulate m-SF Spanning Tree and s-SF Spanning Tree problems: given a connected graph G, find the spanning tree T of G such that $m (T)$ (respectively, $s (T)$ ) is maximal. The respective maximum values of $m (T)$ and $s (T)$ are called first and second SF-dimensions of G and denoted by $τ_{1} (G)$ and $τ_{2} (G)$ . By $T^{s o p t}$ and $T^{m o p t}$ , we denote an s-optimal tree and an m-optimal tree of G, respectively.

A somehow related problem has been studied by Kincaid et al. (2016): find a spanning subgraph with prescribed vertex degrees such that its s-metric is maximum. This problem is polynomially solvable in general, but becomes NP-hard, when the output spanning subgraph is required to be connected.

2.2. Mathematical preliminaries

2.2.1. Subgraph counting

Here we establish the characterizations for the m-metric and s-metric in terms of numbers of small subgraphs in a graph. This technique is used to establish complexity results in Section 3 and ILP formulations in Section 4.

Proposition 1. For any graph G, $m (G) = 2 γ_{2} (G) + 2 γ_{1} (G), s (G) = 3 γ_{Δ} (G) + γ_{3} (G) + 2 γ_{2} (G) + γ_{1} (G),$

where $γ_{Δ} (G)$ is the number of triangles and $γ_{t} (G)$ is the number of paths of length t in G, respectively.

Proof. We prove only the second equality, the first one can be proved similarly. Let $A = [a_{i j}]$ be the adjacency matrix of G and d be its degree vector. We have $s (G) = \frac{1}{2} d^{T} \cdot A \cdot d$ and $d = A \cdot 1$ , where . Therefore $s (G) = \frac{1}{2} 1^{T} \cdot A^{3} \cdot 1 = \frac{1}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} a_{i j}^{(3)},$

where denotes -entry in the matrix A³.

It is known that $a_{i j}^{(3)}$ equals the number of walks of length 3 between vertices i and j. Thus, $s (G)$ is equal to one-half of the total number of three-walks in G. An edge $v_{1} v_{2}$ produces exactly two such walks: $W_{11} = (v_{1}, v_{2}, v_{1}, v_{2})$ and $W_{12} = (v_{2}, v_{1}, v_{2}, v_{1})$ . Each 2-path ${v_{1} v_{2}, v_{2} v_{3}}$ produces four 3-walks: $W_{21} = (v_{1}, v_{2}, v_{3}, v_{2})$ , $W_{22} = (v_{2}, v_{3}, v_{2}, v_{1})$ , $W_{23} = (v_{2}, v_{1}, v_{2}, v_{3})$ , and $W_{24} = (v_{3}, v_{2}, v_{1}, v_{2})$ . Each 3-path ${v_{1} v_{2}, v_{2} v_{3}, v_{3} v_{4}}$ produces two 3-walks: $W_{31} = (v_{1}, v_{2}, v_{3}, v_{4})$ and $W_{32} = (v_{4}, v_{3}, v_{2}, v_{1})$ . Finally, each triangle with vertex set ${v_{1}, v_{2}, v_{3}}$ produces six 3-walks: $W_{Δ 1} = (v_{1}, v_{2}, v_{3}, v_{1})$ , $W_{Δ 2} = (v_{1}, v_{3}, v_{2}, v_{1})$ , $W_{Δ 3} = (v_{2}, v_{3}, v_{1}, v_{2})$ , $W_{Δ 4} = (v_{2}, v_{1}, v_{3}, v_{2})$ , $W_{Δ 5} = (v_{3}, v_{1}, v_{2}, v_{3})$ , and $W_{Δ 6} = (v_{3}, v_{2}, v_{1}, v_{3})$ . As every three-walk of G has one of these forms, the statement of the lemma follows.□

2.2.2. Neighbor switching

This is a tree rearrangement technique that is used for obtaining structural and complexity results. Let T be a tree and $(u, v)$ be a pair of distinct vertices $u, v \in V (T)$ , where ${deg}_{T} u = p \geq 2$ and ${deg}_{T} v = t \geq 2$ . We denote the unique $u - v$ path in T by $P_{T} (u, v)$ , and neighbors of u and v laying on $P_{T} (u, v)$ by $u^{+}$ and $v^{-}$ , respectively. In case u and v are not adjacent, the neighbor of $u^{+}$ distinct from u and laying on $P_{T} (u, v)$ is denoted by $u^{+ +}$ . Let $A = N_{T} (u) ∖ {u^{+}} = {a_{1}, \dots, a_{p - 1}}$ , and let the set $N_{T} (v) ∖ {v^{-}}$ be partitioned into two subsets $B = {b_{1}, \dots, b_{q}}$ and $C = {c_{1}, \dots, c_{r}}$ , where $B \neq$ . Furthermore, let ${deg}_{T} u^{+} = α$ and ${deg}_{T} v^{-} = β$ . Define numbers D_A, D_B, and D_C as follows: $D_{A} = \sum_{i = 1}^{p - 1} {deg}_{T} a_{i}, D_{B} = \sum_{j = 1}^{q} {deg}_{T} b_{j}, D_{C} = \sum_{k = 1}^{r} {deg}_{T} c_{k} .$ (3)

Given the pair $(u, v)$ , the neighbor switch $S_{v \to u}^{B}$ is a transformation producing a new tree $\tilde{T}$ from T by replacing the edges $v b_{1}, \dots, v b_{q}$ with new edges $u b_{1}, \dots, u b_{q}$ (Fig. 1). This operation changes only degrees of the vertices u and v, namely ${deg}_{\tilde{T}} u = p + q$ , ${deg}_{\tilde{T}} v = r + 1$ .

FIG. 1.

Neighbor switch.

Lemma 2. Suppose that $S_{v \to u}^{B} (T) = \tilde{T}$ . If $p \geq r + 1$ , $D_{A} > D_{C}$ and, in case u and v are not adjacent, additionally $α \geq β$ , then $s (\tilde{T}) > s (T)$ .

Proof. We prove lemma when u and v are not adjacent, that is, $u \neq v^{-}$ and $v \neq u^{+}$ (the proof for the other case is similar). Define by X (resp., Y) the set of edges of T (resp., $\tilde{T}$ ) incident to u or v. Let us denote by $λ (X)$ (resp., $\tilde{λ} (Y)$ ) the contribution to $s (T)$ (resp., $s (\tilde{T})$ ) from the edges of X (resp., Y). Then $s (\tilde{T}) - s (T) = \tilde{λ} (Y) - λ (X) .$ (4)

Using Equation (3) one can easily calculate $\begin{matrix} λ (X) = {deg}_{T} u {deg}_{T} u^{+} + {deg}_{T} v^{-} {deg}_{T} v + \sum_{i = 1}^{p - 1} {deg}_{T} u {deg}_{T} a_{i} + \sum_{j = 1}^{q} {deg}_{T} v {deg}_{T} b_{j} \\ + \sum_{k = 1}^{r} {deg}_{T} v {deg}_{T} c_{k} = p α + β t + p D_{A} + t D_{B} + t D_{C} . \end{matrix}$

After substituting $t = q + r + 1$ , we obtain $λ (X) = p α + β q + β (r + 1) + p D_{A} + q D_{B} + (r + 1) D_{B} + q D_{C} + (r + 1) D_{C} .$ (5)

Similarly, $\tilde{λ} (Y) = p α + q α + β (r + 1) + p D_{A} + q D_{A} + p D_{B} + q D_{B} + (r + 1) D_{C} .$ (6)

Using equalities (4)–(6) we obtain $\begin{matrix} s (\tilde{T}) - s (T) = \tilde{λ} (Y) - λ (X) = q α + q D_{A} + p D_{B} - β q - (r + 1) D_{B} - q D_{C} \\ = q (α - β) + D_{B} (p - r - 1) + q (D_{A} - D_{C}) . \end{matrix}$ (7)

Since $α \geq β$ and $p \geq r + 1$ , it follows that $q (α - β) + D_{B} (p - r - 1) \geq 0$ . On the contrary, since $q \geq 1$ and $D_{A} > D_{C}$ , we have $q (D_{A} - D_{C}) > 0$ and therefore $s (\tilde{T}) - s (T) > 0$ .

If $B = N_{T} (v) ∖ {v^{-}}$ , then the neighbor switch produces a tree $\tilde{T}$ with v being a leaf. In this case $S_{v \to u}^{B}$ is a total neighbor switch. For our goals it suffices to prove the following corollary.

Corollary 3. If $\tilde{T}$ is obtained from T by a total neighbor switch $S_{v \to u}^{B}$ and, in case of u and v not being adjacent, additionally $α \geq β$ or $p \geq β$ , then $s (\tilde{T}) > s (T)$ .

Proof. We check that all conditions of Lemma 2 are satisfied for the total neighbor switch. Indeed, since $D_{A} \geq p - 1 \geq 1$ (recall ${deg}_{T} u = p \geq 2$ ) and $D_{C} = r = 0$ , we have $D_{A} > D_{C}$ and $p \geq r + 1$ . If u and v are not adjacent, we still require that $α \geq β$ , as in Lemma 2. However, this condition can be replaced if we rewrite Equation (7) as follows: $s (\tilde{T}) - s (T) = q (α - β) + D_{B} (p - 1) + q D_{A} = q (α + D_{A} - β) + D_{B} (p - 1) .$

Note that the latter expression is positive in case of $p \geq β$ , since $α \geq 2$ and $D_{A} \geq p - 1 \geq 1$ .

In the same way we can compare the trees T and $\tilde{T} = S_{v \to u}^{B} (T)$ in terms of their m-metrics.

Lemma 4. Suppose that $S_{v \to u}^{B} (T) = \tilde{T}$ and $p > r + 1$ . Then $m (\tilde{T}) > m (T)$ .

Proof. The idea is similar to the proof of Lemma 2. Since the neighbor switch changes only degrees of vertices u and v, $m (\tilde{T}) - m (T) = {deg}_{\tilde{T}}^{2} u + {deg}_{\tilde{T}}^{2} v - {deg}_{T}^{2} u - {deg}_{T}^{2} v = 2 q (p - r - 1)$ , which proves the lemma, since $q \geq 1$ .

For further results we need weaker modifications of Lemmas 2 and 4 for the case ${deg}_{T} u = p \geq 1$ (and therefore $D_{A} \geq 0$ ). Recall ${deg}_{T} v = t \geq 2$ since we still require at least one vertex to switch.

Lemma 5. Suppose $\tilde{T}$ is obtained from T by a total neighbor switch $S_{v \to u}^{B}$ , then the following propositions hold:

(a) $m (\tilde{T}) \geq m (T)$ ;

(b) $s (\tilde{T}) \geq s (T)$ (unless u and v are not adjacent with $α < β$ ).

Now we consider a special case, when u is a vertex of maximum degree in T and all vertices in $N_{T} (v) ∖ {v^{-}}$ are leaves. In addition, let $B, C \neq$ . We introduce a double neighbor switch $S_{v \to u, u^{+}}^{B, C} (T) = S_{v \to u^{+}}^{C} (S_{v \to u}^{B} (T))$ . The reason to treat this two-step switch as a single operation is that the first switch itself might cause the descend of s-metric, however, the decrease would be compensated by the second switch.

Lemma 6. If $\hat{T}$ is obtained from T by a double neighbor switch $S_{v \to u, u^{+}}^{B, C} (T)$ , then $s (\hat{T}) > s (T)$ .

Proof. In case u and v are adjacent, that is, $u^{+} = v$ , a double switch $S_{v \to u, u^{+}}^{B, C} (T)$ gets reduced to the first neighbor switch $S_{v \to u}^{B} (T)$ , which produces a tree with a higher s-metric due to Lemma 2. Therefore, assume u and v are not adjacent. Consider the first switch and let $\tilde{T} = S_{v \to u}^{B} (T)$ . From Equation (7), since $D_{B} = q$ and $D_{C} = r$ , we obtain $s (\tilde{T}) - s (T) = q (α + D_{A} - β) + q (p - r - 1) - q r .$ (8)

Next let $\hat{T} = S_{v \to u^{+}}^{C} (\tilde{T})$ be obtained by the total neighbor switch. To avoid reassigning of notations we denote $D_{E} = \sum_{w \in N_{\tilde{T}} (u^{+}) ∖ {u^{+ +}}} {deg}_{\tilde{T}} w$ and $γ = {deg}_{\tilde{T}} u^{+ +}$ . Other notations stay the same from the first switch. Again from Equation (7) we get $s (\hat{T}) - s (\tilde{T}) = r (γ - β) + r (γ - 1) + r D_{E} .$ (9)

Summation of Equations (8) and (9) gives $s (\hat{T}) - s (T) = q (α + D_{A} - β) + q (p - r - 1) + r (γ + D_{E} - β - q) + r (γ - 1),$

where $D_{A} \geq p - 1$ , $α \geq 2$ , $γ \geq 2$ , and $D_{E} \geq {deg}_{\tilde{T}} u = p + q$ . Furthermore, since u is a vertex of maximum degree in T, $p \geq β$ and $p \geq q + r + 1 > r + 1$ (recall $q, r > 0$ ), which proves the lemma.

2.3. Bounds in terms of the maximum degree

There exist bounds for both SF-dimensions of a graph in terms of its order only (de Caen, 1998; Das, 2003; Das and Gutman, 2004). However, they are not particularly efficient, when used as ILP cuts. Here we provide the adjusted upper bounds that turned out to be more useful for that purpose. Let $Δ (G)$ denote the maximum vertex degree of G and $S_{m, k}$ denote a double star, that is, a tree obtained from two disjoint stars $K_{1, m}$ and $K_{1, k}$ with m and k leaves, respectively, by adding an edge joining their central vertices.

Theorem 7. For any graph G of order $n \geq 2$ , $\begin{matrix} τ_{1} (G) \leq m (S_{Δ (G) - 1, n - Δ (G) - 1}) = 2 Δ^{2} (G) + n^{2} - 2 n Δ (G) + n - 2, \\ τ_{2} (G) \leq s (S_{Δ (G) - 1, n - Δ (G) - 1}) = n (n - Δ (G) - 1) + Δ^{2} (G) . \end{matrix}$

Proof. We provide the proof for the second SF-dimension only (the other proof is similar). Suppose $T^{s o p t}$ is an s-optimal tree of G and $T^{s o p t} \neq S_{Δ (G) - 1, n - Δ (G) - 1}$ . We prove the statement by performing a sequence of neighbor switches on $T^{s o p t}$ , with each of them increasing s-metric, so that the resulting tree is $S_{Δ (G) - 1, n - Δ (G) - 1}$ .

Let u be a vertex of maximum degree in $T^{s o p t}$ . Then for every v in $T^{s o p t}$ follows ${deg}_{T^{s o p t}} v \leq {deg}_{T^{s o p t}} u \leq {deg}_{G} u \leq Δ (G)$ . Let $T : = T^{s o p t}$ . We divide the sequence of neighbor switches into three stages.

Stage 1: For each vertex v with all vertices in $N_{T} (v) ∖ {v^{-}}$ (where $v^{-} \in P_{T} (u, v)$ ) being leaves, we either perform the total neighbor switch $T : = S_{v \to u}^{B} (T)$ or double neighbor switch $T : = S_{v \to u, u^{+}}^{B, C} (T)$ until the degree of u is not equal to $Δ (G) .$

One can observe that a double neighbor switch is needed to ensure that ${deg}_{T} u$ can be increased exactly to $Δ (G)$ . Since ${deg}_{T} u$ increases after each switch, only the finite number of switches is required. In case the tree T obtained after the first stage differs from $S_{Δ (G) - 1, n - Δ (G) - 1}$ , we perform the second stage if there exist at least two vertices w₁ and w₂ in $N_{T} (u)$ with ${deg}_{T} w_{1} \geq {deg}_{T} w_{2} \geq 2$ or jump directly to Stage 3 otherwise.

Stage 2: For each distinct w₁ and w₂ in $N_{T} (u)$ with ${deg}_{T} w_{1} \geq {deg}_{T} w_{2} \geq 2$ perform a total neighbor switch $T : = S_{w_{2} \to w_{1}}^{B} (T)$ .

After each iteration, the number of vertices in $N_{T} (u)$ with degree at least two decreases by one. Thus, Stage 2 terminates after a finite number of switches leaving at most one vertex $w \in N_{T} (u)$ with degree at least two. Finally if T still differs from $S_{Δ (G) - 1, n - Δ (G) - 1}$ , the third stage is required.

Stage 3: While there exists vertex v in $N_{T} (w) ∖ {u}$ with ${deg}_{T} v \geq 2$ perform a total neighbor switch $T : = S_{v \to w}^{B} (T)$ .

Since the number of neighbors of w with degrees at least two decreases after each switch, Stage 3 terminates after finite number of steps with all neighbors of $w,$ except for u, being leaves, that is, $T = S_{Δ (G) - 1, n - Δ (G) - 1}$ . Note that each iteration of Stages 1–3 produces a tree with a higher s-metric due to Lemmas 2, 6 and Corollary 3.

3. Hardness Results

In this section, we study the computational complexity of both the m-SF and the s-SF Spanning Tree problem. The following known fact is used:

Theorem 8 (Kleitman and West, 1991). Any connected graph of order n with minimum vertex degree at least 3 has a spanning tree with at least $n ∕ 4 + 2$ leaves.

We start by investigating the complexity of our problems for cubic graphs.

Theorem 9. The m-SF Spanning Tree problem is $A P X$ -hard for cubic graphs.

Proof. Let G be a cubic graph on n vertices and T be a spanning tree with $ℓ = ℓ (T)$ leaves and $n_{i} = n_{i} (T)$ vertices of degree i, $i \in {2, 3}$ . Then $m (T) = ℓ + 4 n_{2} + 9 n_{3},$ (10)

with the numbers n_i satisfying the equalities $ℓ + n_{2} + n_{3} = n$ and $ℓ + 2 n_{2} + 3 n_{3} = 2 (n - 1) .$ Deriving n₂ and n₃ from these equalities gives us $n_{2} = n + 2 - 2 ℓ, n_{3} = ℓ - 2 .$ (11)

After substituting these expressions into Equation (10), we get $m (T) = 2 ℓ + 4 n - 10 .$ (12)

Thus, finding a spanning tree with maximum m-metric in this case is polynomially equivalent to finding a spanning tree with maximum number of leaves (MaxLeaf problem). For cubic graphs, the latter problem was shown to be APX-hard by Bonsma (2012). Thus, we prove the APX-hardness of the m-SF Spanning Tree problem by providing an L-reduction (Papadimitriou and Yannakakis, 1991) from MaxLeaf.

Given an optimization problem P and an instance I of this problem, we use $o p t_{P} (I)$ to denote the optimum value of I, and $v a l_{P} (I, S)$ to denote the value of a feasible solution S of instance I. Let A and B be two optimization problems. The problem A is said to be L-reducible to B if there exist polynomial-time computable functions f, g and constants $α, β > 0$ such that

(L1) f maps any instance I of A to an instance $f (I)$ of B such that $o p t_{B} (f (I)) \leq α \cdot o p t_{A} (I)$ ;

(L2) for any instance I of A and a solution $S'$ of the instance $f (I)$ , g maps $S'$ to a solution S for I such that $| v a l_{A} (I, S) - o p t_{A} (I) | \leq β \cdot | v a l_{B} (f (I), S') - o p t_{B} (f (I)) |$ .

Let $T^{m o p t}$ be an m-optimal spanning tree of G and $ℓ^{*}$ be the maximum number of leaves in spanning trees of G. Note that $ℓ^{*} \geq n ∕ 4 + 2$ by Theorem 8, and therefore, $n \leq 4 ℓ^{*} - 8$ . Then using Equation (12) we get $τ_{1} (G) = m (T^{m o p t}) = 2 ℓ (T^{m o p t}) + 4 n - 10 \leq 2 ℓ^{*} + 16 ℓ^{*} - 32 \leq 18 ℓ^{*} .$

Moreover, for every spanning tree T of G we have $\frac{1}{2} | m (T) - m (T^{m o p t}) | = | ℓ (T) - ℓ^{*} |$ . As a result, Equation (12) implies an L-reduction with identity mappings f and g and constants $α = 18$ and $β = \frac{1}{2}$ , thus proving the theorem.

Theorem 10. The s-SF Spanning Tree problem is $N P$ -hard for cubic graphs.

Proof. For the reduction, we use the following problem proved to be NP-complete by Lemke (1988):

Instance: A connected cubic graph G of order n.

Question: Is there a spanning tree of G without vertices of degree 2?

According to Equation (11), $n_{2} = n_{2} (T) = n + 2 - 2 ℓ (T)$ . Thus, the answer for the problem's question is negative if n is odd. Hence, we concentrate only on the case when $n \geq 4$ is even, thus n₂ is even as well. We show that among all trees T of order n with $Δ (T) \leq 3$ , the trees without vertices of degree 2 have the highest s-metric. Indeed, the following claim holds:

Claim 11. If $Δ (T) \leq 3$ and $n \geq 4$ are even, then $s (T) \leq 6 n - 15$ . The equality holds if and only if T has no vertices of degree 2.

Proof. If T has no vertices of degree 2, then Equation (11) implies $ℓ = ℓ (T) = \frac{n + 2}{2}$ . Furthermore, $s (T) = 3 m_{1} + 9 m_{3}$ , where m₁ is the number of edges incident to a leaf and m₃ is the number of edges with both ends of degree 3. Obviously, $m_{1} = ℓ$ and $m_{3} = n - 1 - ℓ$ , thus yielding $s (T) = 6 n - 15$ .

Now suppose that T has $n_{2} \geq 2$ vertices of degree 2. Let u and v be two vertices of degree 2 lying on a path $P_{T} (u, v)$ and ${deg}_{T} u^{+} \geq {deg}_{T} v^{-}$ . Iteratively applying a total neighbor switch $S_{v \to u}^{B}$ for all pairs of vertices u and v of degree 2, we obtain a tree with higher s-metric (due to Corollary 3) and without vertices of degree 2. This proves the claim. □

Thus, $τ_{2} (G) = 6 n - 15$ if and only if G has a spanning tree without vertices of degree 2. This concludes the proof.

Next, we consider bipartite graphs.

Theorem 12. The m-SF Spanning Tree and s-SF Spanning Tree problems are $N P$ -hard for bipartite graphs.

Proof. We present a polynomial-time reduction from the NP-complete 3-Dimensional Matching (3-DM) problem (Garey and Johnson, 1979):

Instance: Pairwise disjoint sets X, Y, Z of cardinality n, and a collection $ℳ$ of m three-element sets, where each $M \in ℳ$ includes exactly one element from each of X, Y, and Z.

Question: Is there a set of pairwise disjoint members of $ℳ$ (a perfect 3-dimensional matching), whose union is $X \cup Y \cup Z$ ?

Let $Q = (X, Y, Z, ℳ)$ be an instance of 3-DM. We construct a graph $G = G_{Q}$ on $3 n + m + 1$ vertices as follows. The vertex set of G is the disjoint union ${r} \cup A \cup B$ , where $A = ℳ$ , $B = X \cup Y \cup Z$ , and r is the special root vertex. The edge set includes all edges ra, $a \in A$ , as well as the edges Mx, My, and Mz for each $M = {x, y, z} \in A$ (Fig. 2). We may assume that G is connected. Note also that G is a bipartite graph with the parts A and ${r} \cup B$ .

For a vertex v of G and a subset $W \subseteq V (G)$ let us denote by $(v : W)$ the set of edges connecting v to vertices in W.

FIG. 2.

An example of the graph G for $n = 3$ , $X = {x_{1}, x_{2}, x_{3}}$ , $Y = {y_{1}, y_{2}, y_{3}}$ , $Z = {z_{1}, z_{2}, z_{3}}$ , and $ℳ = {{x_{1}, y_{2}, z_{1}}, {x_{3}, y_{2}, z_{3}}, {x_{2}, y_{1}, z_{1}},$ ${x_{1}, y_{2}, z_{3}}, {x_{3}, y_{1}, z_{2}}, {x_{2}, y_{3}, z_{1}}}$ . Here each vertex labeled ${p, q, r}$ represents a set ${x_{p}, y_{q}, z_{r}}$ .

Lemma 13. There are spanning trees T₁ and T₂ in G, both containing all edges of $(r : A)$ , with $m (T_{1}) = τ_{1} (G)$ and $s (T_{2}) = τ_{2} (G)$ .

Proof. We provide the proof for the s-metric, the proof for the m-metric is similar. Among the optimal spanning trees of G, let T₂ be the one with the maximum number of edges from $(r : A)$ . We claim that T₂ contains all these edges.

Suppose for a contradiction that the set $C \subseteq A$ of all vertices that are adjacent to r in T₂ differs from A. Then there must be a vertex $b \in B$ with $P_{T_{2}} (r, b)$ having two edges, such that set $D = N_{T_{2}} (b) \cap {A ∖ C}$ is nonempty. By Lemma 5, since ${deg}_{T_{2}} r^{+} = {deg}_{T_{2}} b^{-}$ and ${deg}_{T_{2}} r \geq 1$ , we can apply total neighbor switch $S_{b \to r}^{D}$ to construct a spanning tree $T'_{2}$ from T₂ with $s (T'_{2}) \geq s (T_{2})$ , and the root r having more neighbors in $T'_{2}$ than it has in T₂.

Any spanning tree T of G containing all edges of $(r : A)$ has $m + 3 n$ edges, $3 n (m - 1)$ paths of length three (each of the $3 n$ edges of the tree connecting A and B induces exactly $m - 1$ such paths), and $m (m - 1) ∕ 2 + 3 n$ paths of length two that are not formed by a pair of edges between A and B. There are $3 δ_{4} + δ_{3}$ remaining paths of length two, where $δ_{i}$ is the number of vertices in A that have degree i in the tree. Indeed, a vertex $v \in A$ with $j \in {0, 1, 2, 3}$ neighbors from B in the tree contributes no such path in case of $j \in {0, 1}$ , one such path in case of $j = 2$ , and three such paths in case of $j = 3$ . Thus by Proposition 1 $m (T) = m^{2} + m + 12 n + 6 δ_{4} + 2 δ_{3}, s (T) = m^{2} + 3 m n + 6 n + 6 δ_{4} + 2 δ_{3} .$

Since $| B | = 3 n$ , we have $3 δ_{4} + 2 δ_{3} \leq 3 n$ and $6 δ_{4} + 2 δ_{3} \leq 6 δ_{4} + 4 δ_{3} \leq 6 n$ . Hence, $6 δ_{4} + 2 δ_{3} \leq 6 n$ with equality holding if and only if $δ_{3} = 0$ and $δ_{4} = n$ .

A perfect 3-DM $ℳ^{*} = {M_{1}, \dots, M_{n}}$ induces the spanning tree $T_{ℳ^{*}}$ that contains all edges from $(r : A)$ and edges $a x, a y, a z$ for each $a = {x, y, z} \in ℳ^{*}$ . For this tree we have $δ_{4} = n$ and

Conversely, every spanning tree T that contains all edges from $(r : A)$ and $m (T) = t_{1} (n, m)$ or $s (T) = t_{2} (n, m)$ (and thus $δ_{4} = n$ ) arises from a perfect 3-DM.

By Lemma 13, the graph G satisfies $τ_{1} (G) \geq t_{1} (n, m)$ (resp., $τ_{2} (G) \geq t_{2} (n, m)$ ) if and only if there is a spanning tree T of G that contains all edges from $(r : A)$ and whose m-metric (resp., s-metric) is equal to $t_{1} (n, m)$ (resp., $t_{2} (n, m))$ . The latter is true if and only if Q has a perfect 3-DM. □

4. ILP Formulations

Here we describe two ILP models for the s-SF Spanning Tree problem (for the m-SF Spanning Tree problem the approach is similar). For a given spanning tree T of a graph $G = (V, E)$ of order n, consider the indicator variables $(x_{e}) e \in E$ : $x_{e} = \{\begin{matrix} \begin{matrix} 1, e \in E (T); \\ 0, o t h e r w i s e . \end{matrix} \end{matrix}$ (13)

Using Proposition 1, we can represent $s (T)$ as $s (T) = \sum_{{e_{i}, e_{j}, e_{k}} \in Γ_{3} (G)} x_{e_{i}} x_{e_{j}} x_{e_{k}} + 2 \sum_{{e_{i}, e_{j}} \in Γ_{2} (G)} x_{e_{i}} x_{e_{j}} + \sum_{e \in E (G)} x_{e},$ (14)

where $Γ_{i} (G)$ denotes the set of all paths of length i in G. To linearize (14), we introduce Boolean variables $y_{i j k}$ and $y_{i j}$ and the following constraints: $\begin{matrix} y_{i j k} \leq x_{e_{i}}, y_{i j} \leq x_{e_{i}}, \\ y_{i j k} \leq x_{e_{j}}, y_{i j} \leq x_{e_{j}}, \\ y_{i j k} \leq x_{e_{k}}, y_{i j} \geq x_{e_{i}} + x_{e_{j}} - 1, \\ y_{i j k} \geq x_{e_{i}} + x_{e_{j}} + x_{e_{k}} - 2, \end{matrix}$ (15)

for every ${e_{i}, e_{j}, e_{k}} \in Γ_{3} (G)$ and ${e_{i}, e_{j}} \in Γ_{2} (G)$ , which are equivalent to $y_{i j k} = x_{e_{i}} x_{e_{j}} x_{e_{k}}$ and $y_{i j} = x_{e_{i}} x_{e_{j}}$ . Thus the objective function (14) can be rewritten as $s (T) = \sum_{{e_{i}, e_{j}, e_{k}} \in Γ_{3} (G)} y_{i j k} + 2 \sum_{{e_{i}, e_{j}} \in Γ_{2} (G)} y_{i j} + \sum_{e \in E (G)} x_{e} .$ (16)

We use two types of constraints to describe the spanning trees. The first type is the extended formulation of Martin (1991), which uses auxiliary variables $z_{(v, w)}^{r}, z_{(w, v)}^{r} \geq 0 f o r e v e r y r \in V (G), v w \in E (G),$ (17)

where $z_{(v, r)}^{r} = 0$ for every $r \in V (G)$ and $v r \in E (G)$ . A 0/1-vector x describes a spanning tree of G if and only if these variables satisfy the constraints $\begin{matrix} x_{v w} - z_{(v, w)}^{r} - z_{(w, v)}^{r} & = 0, r \in V (G), v w \in E (G), \\ \sum_{v w \in E (G)} z_{(v, w)}^{r} = 1, r, w \in V (G), r \neq w, \\ \sum_{v r \in E (G)} z_{(v, r)}^{r} = 0, r \in V (G) . \end{matrix}$ (18)

The second type exploits the Miller–Tucker–Zemlin (MTZ) constraints (Miller et al., 1960). We introduce the auxiliary variables $\begin{matrix} z_{(v, w)}, z_{(w, v)} \in {0, 1} f o r e v e r y v w \in E (G), \\ t_{v} \in [0, n - 1] f o r e v e r y v \in V (G), \end{matrix}$ (19)

and constraints $\begin{matrix} x_{v w} - z_{(v, w)} - z_{(w, v)} = 0, v w \in E (G), \\ \sum_{v w \in E (G)} z_{(v, w)} = 1, w \in V (G) ∖ {r}, \\ \sum_{v r \in E (G)} z_{(v, r)} = 0, \\ t_{v} - t_{w} + n z_{(v, w)} \leq n - 1, v, w \in V (G), v w \in E (G), \end{matrix}$ (20)

where $r \in V (G)$ is some fixed vertex. Finally we add the additional constraint $s (T) = \sum_{{e_{i}, e_{j}, e_{k}} \in Γ_{3} (G)} y_{i j k} + 2 \sum_{{e_{i}, e_{j}} \in Γ_{2} (G)} y_{i j} + \sum_{e \in E (G)} x_{e} \leq n (n - Δ (G) - 1) + Δ^{2} (G),$ (21)

defined by Theorem 7, which turns out to significantly improve the algorithm running times. Maximization of the objective (16) subject to the constraints (15), (18), (21) is further referred to as Martin formulation, while maximization of Equation (16) subject to Equations (15), (20), (21) as MTZ formulation.

5. Experimental Results

In this section, we investigate the practical aspects of scale-free spanning tree problems by conducting computational experiments for various simulated and experimental data sets to evaluate the performance of the ILP models. All computations below were performed on a standard laptop with 2.0 GHz dual core processor and 16 GB of RAM, and ILP problems were solved using Gurobi 8.1.

5.1. Synthetic data

5.1.1. Synthetic graphs

We used graphs from the following synthetic data sets:

Erdős-Rényi graphs constructed by adding each possible edge uniformly and independently with the probability $p = 4.25 ∕ n$ . The number of nodes n in our experiments varied from 10 to 40 (corresponding to the sizes of HCV outbreaks analyzed later).

$n \times m$ grid graphs (Cartesian products of paths P_n and P_m) with $n, m = 4, \dots, 7$ .

Scale-free graphs of two types generated using NetworkX library (Hagberg et al., 2008): those based on the classical Barabási and Albert (1999) model and those constructed with NetworkX default parameters. The latter graphs are usually denser.

For all synthetic data sets except for grid graphs, we generated 10 graphs per node number. Figures 3 and 4 illustrate the running times of the ILP solver on both the MTZ formulation and the Martin formulation compared with the published tool QUENTIN (Skums et al., 2018) runtimes for all four simulated graph classes.¹ The results demonstrate that for those graph classes, the ILP algorithms in average perform much better than in the worst case and are able to produce optimal results in a reasonable amount of time. Moreover, for considered graph sizes, they outperform QUENTIN. For Erdős–Rényi graphs and grids (Fig. 3), which are characterized by relatively large sets of feasible solutions, the Martin formulation was superior to MTZ and QUENTIN, while for Barabási–Albert scale-free graphs (Fig. 4a), the MTZ formulation was leading to the faster algorithm. In general, the ILP approach allows to solve the problem within minutes or few hours for small-to-medium-sized problems (up to several dozens of vertices) on Erdős–Rényi graphs and grids, and for medium-sized problems (several hundred vertices) on scale-free graphs.

FIG. 3.

Running times of ILP solver and QUENTIN on Erdős–Rényi graphs (a) and grids (b). ILP, integer linear programming.

FIG. 4.

Running times of ILP solver on Barabási–Albert (a) and NetworkX (b) scale-free graphs.

5.1.2. Simulated outbreaks

We simulated outbreaks over scale-free Barabási–Albert contact networks of $n = 10 - 30$ nodes using the following model. The infection spreads over each network according to the susceptible infected (SI) model (Newman, 2010) with the transmission rate $ρ = 1 0^{- 2}$ . Each infected individual is assumed to carry a viral sequence of length $m = 13200$ , and at each transmission event, the source's sequence is transmitted to the recipient. Sequence evolution is described by a skyline model with the piecewise constant decreasing mutation rate, that is, viral sequences mutate at the basic rate of $μ = 1 0^{- 5}$ changes/position/time unit, and the mutation rate is decreasing by $30 %$ every $τ = 100$ time units. This model captures the decrease of the speed of intrahost evolution as the infection progresses from an acute to a persistent stage (De Maio et al., 2016; Icer et al., 2020).

For each simulated outbreak, we compared the performance of the ILP algorithm for the Martin formulation, with the standard approach based on the phylogenetic trait inference (Sagulenko et al., 2018). First, we constructed a maximum likelihood phylogeny using MEGA (Kumar et al., 2018). Each patient was encoded by a discrete trait, and the marginal likelihood ancestral traits were reconstructed using the Felsenstein pruning algorithm (Felsenstein, 2004) with the pairwise between-trait transition rates equal to $ρ$ . Inferred transmission links then correspond to trait changes along the phylogeny branches. The genetic relatedness network G_R used as an input for the ILP was constructed using a threshold-based approach suggested by Kosakovsky Pond et al. (2018). A pair of vertices of G_R are adjacent, if the Hamming distance between the corresponding sequences does not exceed a threshold t that was estimated as the minimal integer such that the graph G_R is connected. The obtained graph was further sparsed out by applying the same procedure to each of its biconnected component.

The results of algorithms' comparison are shown in Figure 5. We measured algorithm accuracy by the proportion of correctly inferred transmission links and transmission ancestries (i.e., pairs, ancestor/descendant). s-SF-based ILP clearly outperformed the phylogenetic approach: the average transmission link detection accuracy was $82.44 %$ for the former and $72.61 %$ for the latter, while the average transmission ancestry detection accuracies were $97.48 %$ and $73.96 %$ , respectively.

FIG. 5.

Accuracy of s-SF ILP model compared with the phylogenetic trait inference algorithm.

5.2. Data from hepatitis C outbreaks

We applied the concept of scale-free spanning trees to the graphs arising from the benchmark data set consisting of several epidemiologically curated HCV outbreaks investigated by the CDC (Campo et al., 2016; Glebova et al., 2017; Skums et al., 2018). This data set comprises HCV quasispecies populations sampled from 81 infected individuals involved in 10 viral outbreaks. Each population consists of RNA sequences of HCV hypervariable region 1 (HVR1) of length 264 bp. Transmission histories of the outbreaks (“who infected whom”) are known as a result of epidemiological investigations. In this case, we are dealing with intrahost viral populations rather than single sequences, and therefore, we compared the proposed approach with QUENTIN, which has been specifically designed to handle such data (Skums et al., 2018).

For each outbreak, the genetic relatedness network G_R was constructed using the threshold-based approach suggested by Campo et al. (2016). The vertices of G_R are adjacent, if the minimal Hamming distance between the sets of sequences sampled from these patients does not exceed the threshold t. The threshold value was estimated as described in Subsection 5.1.2. Next, the ILP algorithm for the Martin formulation has been applied to G_R. For all outbreaks, the ILP problem has been solved to optimality.

We tested the accuracy of inference of transmission links and identification of the superspreaders (the sources of majority of infections). The results are reported in Table 1. The superspreaders correspond to vertices of highest degrees in s-optimal and m-optimal trees for 9 out of 10 outbreaks. It should be noted that all algorithms incorrectly identified a superspreader for the same outbreak. It is the only outbreak where the virus was transmitted via a nonsocial interaction (namely, through blood transfusions), while all other outbreaks were associated with unsafe injection practices or sexual contacts. For those outbreaks, both ILP approaches correctly recovered $92 %$ of transmission links and all ancestor/descendant pairs, thus outperforming QUENTIN.

Table 1.

Results on Experimental Data with Different Models

Methods	Evaluation metric
Methods	(A)	(B)	(C)
QUENTIN	0.9	0.78	0.98
s-SF	0.9	0.92	1.0
m-SF	0.9	0.92	1.0

(A) Superspreader inference accuracy, (B) accuracy of transmission link inference, and (C) accuracy of transmission ancestry inference.

6. Discussion

In genomic epidemiology, reconstruction of viral transmission histories from genomic data is fundamental for the investigation of outbreaks and understanding of epidemic spread. Genomic analysis has become one of the major tools for the investigation of outbreaks and surveillance of transmission dynamics (Armstrong et al., 2019; Knyazev et al., 2020). Naturally, graphs are the primary models used in such studies (Wertheim et al., 2014; Campo et al., 2016; Ragonnet-Cronin et al., 2019). In many settings, graph-based methods have been shown to be more efficient to ascertain transmission links compared with methods based on binary phylogenies (Wertheim et al., 2014), as phylogenetic clades are not easily resolvable into transmission clusters and pairs (Lewis et al., 2008; Hughes et al., 2009; Kouyos et al., 2010), while the statistical support for a clade does not necessarily indicate the statistical support for a relationship between individual genomes inside a clade (Volz et al., 2012; Wertheim et al., 2014). However, in many cases, transmission links cannot be inferred using the genomic data alone (Jombart et al., 2014; Villandre et al., 2016). It leads to the need to introduce additional constraints on the reconstructed transmission networks or utilize more complicated objectives.

As a result, the associated algorithmic problems become harder. In this article, we studied one such problem—scale-free spanning tree problem—that arises in epidemiological studies of viruses whose spread is highly influenced by social networks of contacts between susceptible individuals. This includes HIV, HCV, and other pathogens transmitted through sexual contact or needle sharing. We demonstrated that this problem in its two possible algorithmic formulations is NP-hard, even if restricted to relatively simple graph classes. However, it admits an ILP formulation allowing to efficiently solve the problem for small-to-medium networks. It is often enough for the vast majority of outbreaks of HIV and HCV that involve dozens of infected individuals.

However, some outbreaks involve hundreds or even thousands of hosts, and in such cases, more scalable algorithmic solutions are needed. Thus, an important open problem is to establish whether constant or logarithmic approximation exists for the m-SF Spanning Tree and s-SF Spanning Tree problems. In this context, it would be interesting to explore the relationships between scale-free spanning tree problems and max-leaf spanning tree problems. The latter is a well-studied combinatorial problem (Griggs et al., 1989; Galbiati et al., 1994), which seems to be the closest to our problem. Indeed, both problems aim to find a “star-like” spanning tree; furthermore, several reduction schemes for the proof of NP-completeness used by us exploit this relationship. Importantly, Lu and Ravi (1998) and Reich (2016) showed that the max-leaf spanning tree problem is approximable within a constant factor. Although the problems are far from being equivalent, it may seem reasonable for future studies to try to adopt algorithmic machinery developed for the max-leaf spanning tree problem to the scale-free spanning tree problem.

Footnotes

Author Disclosure Statement

The authors declare they have no conflicting financial interests.

Funding Information

Y.O. was partially supported by the BRFFR grant (Project F20UKA-005). The work of V.K. and K.K. was supported by the German National Science Foundation via DFG-Research Training Group 2297 (Mathematical Complexity Reduction—MathCoRe). P.S. was supported by the National Institutes of Health grant 1R01EB025022 and by the National Science Foundation grant 2047828.

References

Armstrong

G.L.

, MacCannell

D.R.

, Taylor

, et al. 2019. Pathogen genomics in public health. N. Engl. J. Med. 381, 2569–2580.

Barabási

A.-L.

, and Albert

1999. Emergence of scaling in random networks. Science, 286, 509–512.

Bonsma

2012. Max-leaves spanning tree is APX-hard for cubic graphs. J. Discrete Algorithms. 12, 14–23.

Borovicanin

, Das

K.C.

, Furtula

, et al. 2017. Bounds for Zagreb indices. MATCH Commun. Math. Comput. Chem. 78, 17–100.

Campbell

E.M.

, Jia

, Shankar

, et al. 2017. Detailed transmission network analysis of a large opiate-driven outbreak of HIV infection in the United States. J. Infect. Dis. 216, 1053–1062.

Campo

D.S.

, Xia

G.-L.

, Dimitrova

, et al. 2016. Accurate genetic detection of hepatitis C virus transmissions in outbreak settings. J. Infect. Dis. 213, 957–965.

Chartrand

, Lesniak

, and Zhang

2016. Graphs & Digraphs. CRC Press, Taylor & Francis Group, Boca Raton, FL.

Das

K.C.

2003. Sharp bounds for the sum of the squares of the degrees of a graph. Kragujev. J. Math. 25, 19–41.

Das

K.C.

, and Gutman

2004. Some properties of the second Zagreb index. MATCH Commun. Math. Comput. Chem. 52, 103–112.

10.

de Caen

1998. An upper bound on the sum of squares of degrees in a graph. Discrete Math. 185, 245–248.

11.

De Maio

, Wu

C.-H.

, and Wilson

D.J.

2016. SCOTTI: Efficient reconstruction of transmission within outbreaks with the structured coalescent. PLoS Comput. Biol. 12, e1005130.

12.

Felsenstein

2004. Inferring phylogenies. Sinauer Associates, Sunderland, MA.

13.

Galbiati

, Maffioli

, and Morzenti

1994. A short note on the approximability of the maximum leaves spanning tree problem. Inform. Process. Lett. 52, 45–49.

14.

Galvani

A.P.

, and May

R.M.

2005. Dimensions of superspreading. Nature, 438, 293–295.

15.

Garey

M.R.

, and Johnson

D.S.

1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman & Co., New York, NY.

16.

Glebova

, Knyazev

, Melnyk

, et al. 2017. Inference of genetic relatedness between viral quasispecies from sequencing data. BMC Genomics. 18, 918.

17.

Griggs

J.R.

, Kleitman

D.J.

, and Shastri

1989. Spanning trees with many leaves in cubic graphs. J Graph Theory, 13, 669–695.

18.

Hagberg

, Swart

, and Chult

2008. Exploring network structure, dynamics, and function using NetworkX. In Proceedings of the 7th Python in Science Conference. Online publication, Pasadena, California; pp. 11–16.

19.

Hajarizadeh

, Grebely

, and Dore

G.J.

2013. Epidemiology and natural history of HCV infection. Nat. Rev. Gastroenterol. Hepatol. 10, 553–562.

20.

Huang

, Wang

, Li

, et al. 2020. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet, 395, 497–506.

21.

Hughes

G.J.

, Fearnhill

, Dunn

, et al. 2009. Molecular phylodynamics of the heterosexual HIV epidemic in the United Kingdom. PLoS Pathogens, 5, e1000590.

22.

Icer Baykal

, Lara

, Khudyakov

, et al. 2020. Quantitative differences between intra-host HCV populations from persons with recently established and persistent infections. Virus Evol. 7, veaa103.

23.

Jha

, Skums

, Zelikovsky

, et al. 2017. Modeling the spread of HIV and HCV infections based on identification and characterization of high-risk communities using social media, 425–430. In International Symposium on Bioinformatics Research and Applications. Springer, Cham.

24.

Jombart

, Cori

, Didelot

, et al. 2014. Bayesian reconstruction of disease outbreaks by combining epidemiologic and genomic data. PLoS Comput. Biol. 10, e1003457.

25.

Jombart

, Eggo

, Dodd

, et al. 2011. Reconstructing disease outbreaks from genetic data: A graph approach. Heredity, 106, 383–390.

26.

Kilmarx

P.H.

2009. Global epidemiology of HIV. Curr. Opin. HIV AIDS. 4, 240–246.

27.

Kincaid

R.K.

, Kunkler

S.J.

, Lamar

M.D.

, et al. 2016. Algorithms and complexity results for finding graphs with extremal Randić index. Networks, 67, 338–347.

28.

Kleitman

D.J.

, and West

D.B.

1991. Spanning trees with many leaves. SIAM J Discrete Math. 4, 99–106.

29.

Klinkenberg

, Backer

J.A.

, Didelot

, et al. 2017. Simultaneous inference of phylogenetic and transmission trees in infectious disease outbreaks. PLoS Comput. Biol. 13, e1005495.

30.

Knyazev

, Hughes

, Skums

, et al. 2020. Epidemiological data analysis of viral quasispecies in the next-generation sequencing era. Brief. Bioinformatics, 22, 96–108.

31.

Kosakovsky Pond

S.L.

, Weaver

, Leigh Brown

A.J.

, et al. 2018. HIV-TRACE (TRansmission Cluster Engine): A tool for large scale molecular epidemiology of HIV-1 and other rapidly evolving pathogens. Mol. Biol. Evol. 35, 1812–1819.

32.

Kouyos

R.D.

, Von Wyl

, Yerly

, et al. 2010. Molecular epidemiology reveals long-term changes in HIV type 1 subtype B transmission in Switzerland. J. Infect. Dis. 201, 1488–1497.

33.

Kumar

, Stecher

, Li

, et al. 2018. MEGA X: Molecular evolutionary genetics analysis across computing platforms. Mol. Biol. Evol. 35, 1547–1549.

34.

Leigh Brown

A.J.

, Lycett

S.J.

, Weinert

, et al. 2011. Transmission network parameters estimated from HIV sequences for a nationwide epidemic. J. Infect. Dis. 204, 1463–1469.

35.

Lemke

1988. The maximum leaf spanning tree problem for cubic graphs is NP-complete. IMA Preprint Series No. 428.

36.

Lewis

, Hughes

G.J.

, Rambaut

, et al. 2008. Episodic sexual transmission of HIV revealed by molecular phylodynamics. PLoS Med. 5, e50.

37.

, Alderson

, Doyle

J. C.

, et al. 2005. Towards a theory of scale-free graphs: Definition, properties, and implications. Internet Math. 2, 431–523.

38.

H.I.

, and Ravi

1998. Approximating maximum leaf spanning trees in almost linear time. J. Algorithms. 29, 132–141.

39.

Martin

R.K.

1991. Using separation algorithms to generate mixed integer model reformulations. Oper. Res. Lett. 10, 119–128.

40.

Miller

C.E.

, Tucker

A.W.

, and Zemlin

R.A.

1960. Integer programming formulation of traveling salesman problems. J. ACM, 7, 326–329.

41.

Newman

2010. Networks. Oxford University Press, New York, NY.

42.

Papadimitriou

, and Yannakakis

1991. Optimization, approximation, and complexity classes. J. Comput. Syst. Sci. 43, 425–440.

43.

Ragonnet-Cronin

, Hu

Y.W.

, Morris

S.R.

, et al. 2019. HIV transmission networks among transgender women in Los Angeles County, CA, USA: A phylogenetic analysis of surveillance data. Lancet HIV. 6, e164–e172.

44.

Ramachandran

, Thai

, Forbi

J.C.

, et al. 2018. A large HCV transmission network enabled a fast-growing HIV outbreak in rural Indiana, 2015. EBioMedicine, 37, 374–381.

45.

Reich

2016. Complexity of the maximum leaf spanning tree problem on planar and regular graphs. Theoret.Comput. Sci. 626, 134–143.

46.

Sagulenko

, Puller

, and Neher

R.A.

2018. TreeTime: Maximum-likelihood phylodynamic analysis. Virus Evol. 4, vex042.

47.

Skums

, Zelikovsky

, Singh

, et al. 2018. QUENTIN: Reconstruction of disease transmissions from viral quasispecies genomic data. Bioinformatics, 34, 163–170.

48.

Sledzieski

, Zhang

, Mandoiu

, et al. 2019. TreeFix-TP: Phylogenetic error-correction for infectious disease transmission network inference. bioRxiv. 1:813931.

49.

Villandre

, Stephens

D.A.

, Labbe

, et al. 2016. Assessment of overlap of phylogenetic transmission clusters and communities in simple sexual contact networks: Applications to HIV-1. PLoS One, 11, e0148459.

50.

Volz

E.M.

, Koopman

J.S.

, Ward

M.J.

, et al. 2012. Simple epidemiological dynamics explain phylogenetic clustering of HIV from patients with recent infection. PLoS Comput.Biol. 8, e1002552.

51.

Wertheim

J.O.

, Leigh Brown

A.J.

, Hepler

N.L.

, et al. 2014. The global transmission network of HIV-1. J. Infect. Dis. 209, 304–313.

Scale-Free Spanning Trees and Their Application in Genomic Epidemiology

Abstract

1. Introduction

2. Preliminaries

2.1. Problem formulations

2.2. Mathematical preliminaries

2.2.1. Subgraph counting

2.2.2. Neighbor switching

2.3. Bounds in terms of the maximum degree

3. Hardness Results

4. ILP Formulations

5. Experimental Results

5.1. Synthetic data

5.1.1. Synthetic graphs

5.1.2. Simulated outbreaks

5.2. Data from hepatitis C outbreaks

6. Discussion

Footnotes

Author Disclosure Statement

Funding Information

References