We study the algorithmic problem of finding the most “scale-free-like” spanning tree of a connected graph. This problem is motivated by the fundamental problem of genomic epidemiology: given viral genomes sampled from infected individuals, reconstruct the transmission network (“who infected whom”). We use two possible objective functions for this problem and introduce the corresponding algorithmic problems termedm-SF (-scale free) ands-SF Spanning Tree problems. We prove that those problems are APX- and NP-hard, respectively, even in the classes of cubic and bipartite graphs. We propose two integer linear programming (ILP) formulations for thes-SF Spanning Tree problem, and experimentally assess its performance using simulated and experimental data. In particular, we demonstrate that the ILP-based approach allows for accurate reconstruction of transmission histories of several hepatitis C outbreaks.
1. Introduction
Viral outbreaks continue to be major causes of morbidity and mortality. The ongoing pandemic of the coronavirus SARS-CoV-2 (Huang et al., 2020) is a vivid example, but long-standing epidemics of HIV, hepatitis B virus, and hepatitis C virus (HCV) are hardly less damaging (Kilmarx, 2009; Hajarizadeh et al., 2013). Viral epidemics are complex processes defined by evolutionary dynamics of pathogens and social dynamics of susceptible populations (e.g., individual behaviors, social interactions, and mobility patterns).
Recent advances in sequencing technologies invigorated the field of genomic epidemiology (Armstrong et al., 2019; Knyazev et al., 2020) that aims to use viral genomic data to understand the epidemiological dynamics of pathogens. The fundamental algorithmic problem of genomic epidemiology could be formulated as follows:
Given viral genomes sampled from n infected individuals, infer a transmission network indicating who of them infected whom (Knyazev et al., 2020). If each individual is supposed to be infected only once, then a transmission network is a tree called a transmission tree.
This problem has been approached by a variety of methods (Jombart et al., 2011, 2014; Sledzieski et al., 2019; Wertheim et al., 2014; Campo et al., 2016; De Maio et al., 2016; Klinkenberg et al., 2017; Skums et al., 2018). One family of methods is based on the so-called network approach. It is particularly popular among researchers of HIV and HCV and has been adopted as a standard methodology for outbreak investigations carried out by the CDC (Wertheim et al., 2014; Campo et al., 2016; Campbell et al., 2017; Kosakovsky Pond et al., 2018; Ramachandran et al., 2018; Ragonnet-Cronin et al., 2019). This approach usually consists of two stages. First, a weighted relatedness graph GR is constructed. Its vertices represent infected hosts, and edges connect the hosts whose viral populations are close to each other according to a selected population genetics measure. Often GR itself supplies enough information for epidemiologists and provides a fast and scalable alternative to phylogenetic trees when applied to next-generation sequencing (NGS) data (Wertheim et al., 2014; Campo et al., 2016; Ragonnet-Cronin et al., 2019). However, usually it contains many edges that do not represent actual transmissions. Thus, at the second stage, the transmission tree is inferred as the spanning tree of GR.
Under the maximum parsimony criterion, the most likely transmission network is a minimum spanning tree of GR (Jombart et al., 2011). However, experiments demonstrated that this approach is not accurate (Jombart et al., 2014). Furthermore, genomic data alone often do not allow to resolve ambiguities in transmission tree inference, and incorporation of additional evidence is necessary (Jombart et al., 2014; Villandre et al., 2016; Jha et al., 2017). Such evidence usually comes in the form of epidemiological information, such as sample collection times and exposure intervals. However, HIV, HCV, and many other infections tend to be initially asymptomatic, and consequently, sampling times may not accurately reflect the infection times. In addition, in outbreaks with high transmission rates (e.g., HIV/HCV among injection drug users), susceptible hosts are almost constantly exposed to the virus, which makes exposure intervals useless. Another important drawback of many existing methods is their implicit assumption that transmission tree edges are independent. In reality, it is not the case, as, for example, certain hosts (so-called superspreaders) infect more people than an average person (Galvani and May, 2005).
Skums et al. (2018) proposed an alternative approach. It is known that for viruses, whose transmissions are associated with behavioral risk factors, their transmission trees have properties of so-called scale-free graphs (Leigh Brown et al., 2011; Wertheim et al., 2014). Those graphs have specific features, including power-law degree distribution, small diameter, and the presence of high-degree vertices (hubs). This observation gives rise to the following informally defined algorithmic problem (scale-free spanning tree problem): find the most “scale-free-like” spanning tree T of the graph GR. In addition, constraints on the weight of T could be imposed. This approach was the basis of the Bayesian framework and the Markov Chain Monte Carlo algorithm for the transmission network inference described by Skums et al. (2018) and implemented as a tool called QUENTIN. Although QUENTIN is efficient in practice, it is a heuristic, and the questions about computational complexity and possibility of the exact solution of the problem were left open.
In this article, we present the first detailed study of the scale-free spanning tree problem. Our major contributions are as follows.
We propose two rigorous formulations of the scale-free spanning tree problem further referred to as m-SF Spanning Tree and s-SF Spanning Tree problems. They are based on two related objective functions and, to the best of our knowledge, have not been previously studied.
We establish the computational complexity of both problems by demonstrating that they are NP-hard or APX-hard, even when restricted to cubic graphs and bipartite graphs.
We propose two integer linear programming (ILP) formulations for the problems, and perform computational experiments to assess their performance using simulated data. Then we apply an ILP approach to real genomic data from several epidemiologically curated HCV outbreaks investigated by the CDC (Campo et al., 2016; Skums et al., 2018) and demonstrate that it allows for accurate inference of transmission trees.
2. Preliminaries
2.1. Problem formulations
We consider only finite undirected simple graphs and use standard graph-theoretic terminologies, see, for example, Chartrand et al. (2016). Let be a connected graph. For a vertex , the neighborhood of x is the set of all vertices that are adjacent to x in G. The degree of x is defined as . Several definitions of scale-free graphs of different degrees of mathematical rigor are known in a literature. We utilize the rigorous combinatorial characterization that has been introduced by Li et al. (2005) using the so-called s-metric of a graph. This graph invariant is defined as follows:
The same parameter is known in mathematical chemistry as second Zagreb index (Das and Gutman, 2004; Borovicanin et al., 2017). Li et al. (2005) demonstrated that a higher s-metric indicates with high probability the presence of most of the expected properties of scale-free graphs. The intuition behind these results is that in a graph with a high s-metric, a large number of edges should be incident to high-degree vertices, thus forcing them to resemble preferential attachment graphs—a standard Barabási and Albert (1999) model for scale-free networks. Therefore, another mathematical chemistry parameter called the first Zagreb index (Borovicanin et al., 2017) or m-metric also can serve as a measure of “scale-freeness” of a graph:
Thus, we can formulate m-SF Spanning Tree and s-SF Spanning Tree problems: given a connected graph G, find the spanning tree T of G such that (respectively, ) is maximal. The respective maximum values of and are called first and second SF-dimensions of G and denoted by and . By and , we denote an s-optimal tree and an m-optimal tree of G, respectively.
A somehow related problem has been studied by Kincaid et al. (2016): find a spanning subgraph with prescribed vertex degrees such that its s-metric is maximum. This problem is polynomially solvable in general, but becomes NP-hard, when the output spanning subgraph is required to be connected.
2.2. Mathematical preliminaries
2.2.1. Subgraph counting
Here we establish the characterizations for the m-metric and s-metric in terms of numbers of small subgraphs in a graph. This technique is used to establish complexity results in Section 3 and ILP formulations in Section 4.
Proposition 1.For any graph G,
whereis the number of triangles andis the number of paths of length t in G, respectively.
Proof. We prove only the second equality, the first one can be proved similarly. Let be the adjacency matrix of G and d be its degree vector. We have and , where . Therefore
where denotes -entry in the matrix A3.
It is known that equals the number of walks of length 3 between vertices i and j. Thus, is equal to one-half of the total number of three-walks in G. An edge produces exactly two such walks: and . Each 2-path produces four 3-walks: , , , and . Each 3-path produces two 3-walks: and . Finally, each triangle with vertex set produces six 3-walks: , , , , , and . As every three-walk of G has one of these forms, the statement of the lemma follows.□
2.2.2. Neighbor switching
This is a tree rearrangement technique that is used for obtaining structural and complexity results. Let T be a tree and be a pair of distinct vertices , where and . We denote the unique path in T by , and neighbors of u and v laying on by and , respectively. In case u and v are not adjacent, the neighbor of distinct from u and laying on is denoted by . Let , and let the set be partitioned into two subsets and , where . Furthermore, let and . Define numbers DA, DB, and DC as follows:
Given the pair , the neighbor switch is a transformation producing a new tree from T by replacing the edges with new edges (Fig. 1). This operation changes only degrees of the vertices u and v, namely , .
Neighbor switch.
Lemma 2.Suppose that. If,and, in case u and v are not adjacent, additionally, then.
Proof. We prove lemma when u and v are not adjacent, that is, and (the proof for the other case is similar). Define by X (resp., Y) the set of edges of T (resp., ) incident to u or v. Let us denote by (resp., ) the contribution to (resp., ) from the edges of X (resp., Y). Then
Using Equation (3) one can easily calculate
After substituting , we obtain
Similarly,
Using equalities (4)–(6) we obtain
Since and , it follows that . On the contrary, since and , we have and therefore .
If , then the neighbor switch produces a tree with v being a leaf. In this case is a total neighbor switch. For our goals it suffices to prove the following corollary.
Corollary 3.Ifis obtained from T by a total neighbor switchand, in case of u and v not being adjacent, additionallyor, then.
Proof. We check that all conditions of Lemma 2 are satisfied for the total neighbor switch. Indeed, since (recall ) and , we have and . If u and v are not adjacent, we still require that , as in Lemma 2. However, this condition can be replaced if we rewrite Equation (7) as follows:
Note that the latter expression is positive in case of , since and .
In the same way we can compare the trees T and in terms of their m-metrics.
Lemma 4.Suppose thatand. Then.
Proof. The idea is similar to the proof of Lemma 2. Since the neighbor switch changes only degrees of vertices u and v, , which proves the lemma, since .
For further results we need weaker modifications of Lemmas 2 and 4 for the case (and therefore ). Recall since we still require at least one vertex to switch.
Lemma 5.Supposeis obtained from T by a total neighbor switch, then the following propositions hold:
(a);
(b)(unless u and v are not adjacent with).
Now we consider a special case, when u is a vertex of maximum degree in T and all vertices in are leaves. In addition, let . We introduce a double neighbor switch. The reason to treat this two-step switch as a single operation is that the first switch itself might cause the descend of s-metric, however, the decrease would be compensated by the second switch.
Lemma 6.Ifis obtained from T by a double neighbor switch, then.
Proof. In case u and v are adjacent, that is, , a double switch gets reduced to the first neighbor switch , which produces a tree with a higher s-metric due to Lemma 2. Therefore, assume u and v are not adjacent. Consider the first switch and let . From Equation (7), since and , we obtain
Next let be obtained by the total neighbor switch. To avoid reassigning of notations we denote and . Other notations stay the same from the first switch. Again from Equation (7) we get
Summation of Equations (8) and (9) gives
where , , , and . Furthermore, since u is a vertex of maximum degree in T, and (recall ), which proves the lemma.
2.3. Bounds in terms of the maximum degree
There exist bounds for both SF-dimensions of a graph in terms of its order only (de Caen, 1998; Das, 2003; Das and Gutman, 2004). However, they are not particularly efficient, when used as ILP cuts. Here we provide the adjusted upper bounds that turned out to be more useful for that purpose. Let denote the maximum vertex degree of G and denote a double star, that is, a tree obtained from two disjoint stars and with m and k leaves, respectively, by adding an edge joining their central vertices.
Theorem 7.For any graph G of order,
Proof. We provide the proof for the second SF-dimension only (the other proof is similar). Suppose is an s-optimal tree of G and . We prove the statement by performing a sequence of neighbor switches on , with each of them increasing s-metric, so that the resulting tree is .
Let u be a vertex of maximum degree in . Then for every v in follows . Let . We divide the sequence of neighbor switches into three stages.
Stage 1: For each vertex v with all vertices in(where) being leaves, we either perform the total neighbor switchor double neighbor switchuntil the degree of u is not equal to
One can observe that a double neighbor switch is needed to ensure that can be increased exactly to . Since increases after each switch, only the finite number of switches is required. In case the tree T obtained after the first stage differs from , we perform the second stage if there exist at least two vertices w1 and w2 in with or jump directly to Stage 3 otherwise.
Stage 2: For each distinct w1and w2inwithperform a total neighbor switch.
After each iteration, the number of vertices in with degree at least two decreases by one. Thus, Stage 2 terminates after a finite number of switches leaving at most one vertex with degree at least two. Finally if T still differs from , the third stage is required.
Stage 3: While there exists vertex v inwithperform a total neighbor switch.
Since the number of neighbors of w with degrees at least two decreases after each switch, Stage 3 terminates after finite number of steps with all neighbors of except for u, being leaves, that is, . Note that each iteration of Stages 1–3 produces a tree with a higher s-metric due to Lemmas 2, 6 and Corollary 3.
3. Hardness Results
In this section, we study the computational complexity of both the m-SF and the s-SF Spanning Tree problem. The following known fact is used:
Theorem 8 (Kleitman and West, 1991). Any connected graph of order n with minimum vertex degree at least 3 has a spanning tree with at leastleaves.
We start by investigating the complexity of our problems for cubic graphs.
Theorem 9.The m-SF Spanning Treeproblem is-hard for cubic graphs.
Proof. Let G be a cubic graph on n vertices and T be a spanning tree with leaves and vertices of degree i, . Then
with the numbers ni satisfying the equalities and Deriving n2 and n3 from these equalities gives us
After substituting these expressions into Equation (10), we get
Thus, finding a spanning tree with maximum m-metric in this case is polynomially equivalent to finding a spanning tree with maximum number of leaves (MaxLeaf problem). For cubic graphs, the latter problem was shown to be APX-hard by Bonsma (2012). Thus, we prove the APX-hardness of the m-SF Spanning Tree problem by providing an L-reduction (Papadimitriou and Yannakakis, 1991) from MaxLeaf.
Given an optimization problem P and an instance I of this problem, we use to denote the optimum value of I, and to denote the value of a feasible solution S of instance I. Let A and B be two optimization problems. The problem A is said to be L-reducible to B if there exist polynomial-time computable functions f, g and constants such that
(L1) f maps any instance I of A to an instance of B such that ;
(L2) for any instance I of A and a solution of the instance , g maps to a solution S for I such that .
Let be an m-optimal spanning tree of G and be the maximum number of leaves in spanning trees of G. Note that by Theorem 8, and therefore, . Then using Equation (12) we get
Moreover, for every spanning tree T of G we have . As a result, Equation (12) implies an L-reduction with identity mappings f and g and constants and , thus proving the theorem.
Theorem 10.The s-SF Spanning Treeproblem is-hard for cubic graphs.
Proof. For the reduction, we use the following problem proved to be NP-complete by Lemke (1988):
Instance: A connected cubic graph G of order n.
Question: Is there a spanning tree of G without vertices of degree 2?
According to Equation (11), . Thus, the answer for the problem's question is negative if n is odd. Hence, we concentrate only on the case when is even, thus n2 is even as well. We show that among all trees T of order n with , the trees without vertices of degree 2 have the highest s-metric. Indeed, the following claim holds:
Claim 11.Ifandare even, then. The equality holds if and only if T has no vertices of degree 2.
Proof. If T has no vertices of degree 2, then Equation (11) implies . Furthermore, , where m1 is the number of edges incident to a leaf and m3 is the number of edges with both ends of degree 3. Obviously, and , thus yielding .
Now suppose that T has vertices of degree 2. Let u and v be two vertices of degree 2 lying on a path and . Iteratively applying a total neighbor switch for all pairs of vertices u and v of degree 2, we obtain a tree with higher s-metric (due to Corollary 3) and without vertices of degree 2. This proves the claim. □
Thus, if and only if G has a spanning tree without vertices of degree 2. This concludes the proof.
Proof. We present a polynomial-time reduction from the NP-complete 3-Dimensional Matching (3-DM) problem (Garey and Johnson, 1979):
Instance: Pairwise disjoint sets X, Y, Z of cardinality n, and a collection of m three-element sets, where each includes exactly one element from each of X, Y, and Z.
Question: Is there a set of pairwise disjoint members of (a perfect 3-dimensional matching), whose union is ?
Let be an instance of 3-DM. We construct a graph on vertices as follows. The vertex set of G is the disjoint union , where , , and r is the special root vertex. The edge set includes all edges ra, , as well as the edges Mx, My, and Mz for each (Fig. 2). We may assume that G is connected. Note also that G is a bipartite graph with the parts A and .
For a vertex v of G and a subset let us denote by the set of edges connecting v to vertices in W.
An example of the graph G for , , , , and . Here each vertex labeled represents a set .
Lemma 13.There are spanning trees T1and T2in G, both containing all edges of, withand.
Proof. We provide the proof for the s-metric, the proof for the m-metric is similar. Among the optimal spanning trees of G, let T2 be the one with the maximum number of edges from . We claim that T2 contains all these edges.
Suppose for a contradiction that the set of all vertices that are adjacent to r in T2 differs from A. Then there must be a vertex with having two edges, such that set is nonempty. By Lemma 5, since and , we can apply total neighbor switch to construct a spanning tree from T2 with , and the root r having more neighbors in than it has in T2.
Any spanning tree T of G containing all edges of has edges, paths of length three (each of the edges of the tree connecting A and B induces exactly such paths), and paths of length two that are not formed by a pair of edges between A and B. There are remaining paths of length two, where is the number of vertices in A that have degree i in the tree. Indeed, a vertex with neighbors from B in the tree contributes no such path in case of , one such path in case of , and three such paths in case of . Thus by Proposition 1
Since , we have and . Hence, with equality holding if and only if and .
A perfect 3-DM induces the spanning tree that contains all edges from and edges for each . For this tree we have and
Conversely, every spanning tree T that contains all edges from and or (and thus ) arises from a perfect 3-DM.
By Lemma 13, the graph G satisfies (resp., ) if and only if there is a spanning tree T of G that contains all edges from and whose m-metric (resp., s-metric) is equal to (resp., . The latter is true if and only if Q has a perfect 3-DM. □
4. ILP Formulations
Here we describe two ILP models for the s-SF Spanning Tree problem (for the m-SF Spanning Tree problem the approach is similar). For a given spanning tree T of a graph of order n, consider the indicator variables :
Using Proposition 1, we can represent as
where denotes the set of all paths of length i in G. To linearize (14), we introduce Boolean variables and and the following constraints:
for every and , which are equivalent to and . Thus the objective function (14) can be rewritten as
We use two types of constraints to describe the spanning trees. The first type is the extended formulation of Martin (1991), which uses auxiliary variables
where for every and . A 0/1-vector x describes a spanning tree of G if and only if these variables satisfy the constraints
The second type exploits the Miller–Tucker–Zemlin (MTZ) constraints (Miller et al., 1960). We introduce the auxiliary variables
and constraints
where is some fixed vertex. Finally we add the additional constraint
defined by Theorem 7, which turns out to significantly improve the algorithm running times. Maximization of the objective (16) subject to the constraints (15), (18), (21) is further referred to as Martin formulation, while maximization of Equation (16) subject to Equations (15), (20), (21) as MTZ formulation.
5. Experimental Results
In this section, we investigate the practical aspects of scale-free spanning tree problems by conducting computational experiments for various simulated and experimental data sets to evaluate the performance of the ILP models. All computations below were performed on a standard laptop with 2.0 GHz dual core processor and 16 GB of RAM, and ILP problems were solved using Gurobi 8.1.
5.1. Synthetic data
5.1.1. Synthetic graphs
We used graphs from the following synthetic data sets:
Erdős-Rényi graphs constructed by adding each possible edge uniformly and independently with the probability . The number of nodes n in our experiments varied from 10 to 40 (corresponding to the sizes of HCV outbreaks analyzed later).
grid graphs (Cartesian products of paths Pn and Pm) with .
Scale-free graphs of two types generated using NetworkX library (Hagberg et al., 2008): those based on the classical Barabási and Albert (1999) model and those constructed with NetworkX default parameters. The latter graphs are usually denser.
For all synthetic data sets except for grid graphs, we generated 10 graphs per node number. Figures 3 and 4 illustrate the running times of the ILP solver on both the MTZ formulation and the Martin formulation compared with the published tool QUENTIN (Skums et al., 2018) runtimes for all four simulated graph classes.1 The results demonstrate that for those graph classes, the ILP algorithms in average perform much better than in the worst case and are able to produce optimal results in a reasonable amount of time. Moreover, for considered graph sizes, they outperform QUENTIN. For Erdős–Rényi graphs and grids (Fig. 3), which are characterized by relatively large sets of feasible solutions, the Martin formulation was superior to MTZ and QUENTIN, while for Barabási–Albert scale-free graphs (Fig. 4a), the MTZ formulation was leading to the faster algorithm. In general, the ILP approach allows to solve the problem within minutes or few hours for small-to-medium-sized problems (up to several dozens of vertices) on Erdős–Rényi graphs and grids, and for medium-sized problems (several hundred vertices) on scale-free graphs.
Running times of ILP solver and QUENTIN on Erdős–Rényi graphs (a) and grids (b). ILP, integer linear programming.
Running times of ILP solver on Barabási–Albert (a) and NetworkX (b) scale-free graphs.
5.1.2. Simulated outbreaks
We simulated outbreaks over scale-free Barabási–Albert contact networks of nodes using the following model. The infection spreads over each network according to the susceptible infected (SI) model (Newman, 2010) with the transmission rate . Each infected individual is assumed to carry a viral sequence of length , and at each transmission event, the source's sequence is transmitted to the recipient. Sequence evolution is described by a skyline model with the piecewise constant decreasing mutation rate, that is, viral sequences mutate at the basic rate of changes/position/time unit, and the mutation rate is decreasing by every time units. This model captures the decrease of the speed of intrahost evolution as the infection progresses from an acute to a persistent stage (De Maio et al., 2016; Icer et al., 2020).
For each simulated outbreak, we compared the performance of the ILP algorithm for the Martin formulation, with the standard approach based on the phylogenetic trait inference (Sagulenko et al., 2018). First, we constructed a maximum likelihood phylogeny using MEGA (Kumar et al., 2018). Each patient was encoded by a discrete trait, and the marginal likelihood ancestral traits were reconstructed using the Felsenstein pruning algorithm (Felsenstein, 2004) with the pairwise between-trait transition rates equal to . Inferred transmission links then correspond to trait changes along the phylogeny branches. The genetic relatedness network GR used as an input for the ILP was constructed using a threshold-based approach suggested by Kosakovsky Pond et al. (2018). A pair of vertices of GR are adjacent, if the Hamming distance between the corresponding sequences does not exceed a threshold t that was estimated as the minimal integer such that the graph GR is connected. The obtained graph was further sparsed out by applying the same procedure to each of its biconnected component.
The results of algorithms' comparison are shown in Figure 5. We measured algorithm accuracy by the proportion of correctly inferred transmission links and transmission ancestries (i.e., pairs, ancestor/descendant). s-SF-based ILP clearly outperformed the phylogenetic approach: the average transmission link detection accuracy was for the former and for the latter, while the average transmission ancestry detection accuracies were and , respectively.
Accuracy of s-SF ILP model compared with the phylogenetic trait inference algorithm.
5.2. Data from hepatitis C outbreaks
We applied the concept of scale-free spanning trees to the graphs arising from the benchmark data set consisting of several epidemiologically curated HCV outbreaks investigated by the CDC (Campo et al., 2016; Glebova et al., 2017; Skums et al., 2018). This data set comprises HCV quasispecies populations sampled from 81 infected individuals involved in 10 viral outbreaks. Each population consists of RNA sequences of HCV hypervariable region 1 (HVR1) of length 264 bp. Transmission histories of the outbreaks (“who infected whom”) are known as a result of epidemiological investigations. In this case, we are dealing with intrahost viral populations rather than single sequences, and therefore, we compared the proposed approach with QUENTIN, which has been specifically designed to handle such data (Skums et al., 2018).
For each outbreak, the genetic relatedness network GR was constructed using the threshold-based approach suggested by Campo et al. (2016). The vertices of GR are adjacent, if the minimal Hamming distance between the sets of sequences sampled from these patients does not exceed the threshold t. The threshold value was estimated as described in Subsection 5.1.2. Next, the ILP algorithm for the Martin formulation has been applied to GR. For all outbreaks, the ILP problem has been solved to optimality.
We tested the accuracy of inference of transmission links and identification of the superspreaders (the sources of majority of infections). The results are reported in Table 1. The superspreaders correspond to vertices of highest degrees in s-optimal and m-optimal trees for 9 out of 10 outbreaks. It should be noted that all algorithms incorrectly identified a superspreader for the same outbreak. It is the only outbreak where the virus was transmitted via a nonsocial interaction (namely, through blood transfusions), while all other outbreaks were associated with unsafe injection practices or sexual contacts. For those outbreaks, both ILP approaches correctly recovered of transmission links and all ancestor/descendant pairs, thus outperforming QUENTIN.
Results on Experimental Data with Different Models
Methods
Evaluation metric
(A)
(B)
(C)
QUENTIN
0.9
0.78
0.98
s-SF
0.9
0.92
1.0
m-SF
0.9
0.92
1.0
(A) Superspreader inference accuracy, (B) accuracy of transmission link inference, and (C) accuracy of transmission ancestry inference.
6. Discussion
In genomic epidemiology, reconstruction of viral transmission histories from genomic data is fundamental for the investigation of outbreaks and understanding of epidemic spread. Genomic analysis has become one of the major tools for the investigation of outbreaks and surveillance of transmission dynamics (Armstrong et al., 2019; Knyazev et al., 2020). Naturally, graphs are the primary models used in such studies (Wertheim et al., 2014; Campo et al., 2016; Ragonnet-Cronin et al., 2019). In many settings, graph-based methods have been shown to be more efficient to ascertain transmission links compared with methods based on binary phylogenies (Wertheim et al., 2014), as phylogenetic clades are not easily resolvable into transmission clusters and pairs (Lewis et al., 2008; Hughes et al., 2009; Kouyos et al., 2010), while the statistical support for a clade does not necessarily indicate the statistical support for a relationship between individual genomes inside a clade (Volz et al., 2012; Wertheim et al., 2014). However, in many cases, transmission links cannot be inferred using the genomic data alone (Jombart et al., 2014; Villandre et al., 2016). It leads to the need to introduce additional constraints on the reconstructed transmission networks or utilize more complicated objectives.
As a result, the associated algorithmic problems become harder. In this article, we studied one such problem—scale-free spanning tree problem—that arises in epidemiological studies of viruses whose spread is highly influenced by social networks of contacts between susceptible individuals. This includes HIV, HCV, and other pathogens transmitted through sexual contact or needle sharing. We demonstrated that this problem in its two possible algorithmic formulations is NP-hard, even if restricted to relatively simple graph classes. However, it admits an ILP formulation allowing to efficiently solve the problem for small-to-medium networks. It is often enough for the vast majority of outbreaks of HIV and HCV that involve dozens of infected individuals.
However, some outbreaks involve hundreds or even thousands of hosts, and in such cases, more scalable algorithmic solutions are needed. Thus, an important open problem is to establish whether constant or logarithmic approximation exists for the m-SF Spanning Tree and s-SF Spanning Tree problems. In this context, it would be interesting to explore the relationships between scale-free spanning tree problems and max-leaf spanning tree problems. The latter is a well-studied combinatorial problem (Griggs et al., 1989; Galbiati et al., 1994), which seems to be the closest to our problem. Indeed, both problems aim to find a “star-like” spanning tree; furthermore, several reduction schemes for the proof of NP-completeness used by us exploit this relationship. Importantly, Lu and Ravi (1998) and Reich (2016) showed that the max-leaf spanning tree problem is approximable within a constant factor. Although the problems are far from being equivalent, it may seem reasonable for future studies to try to adopt algorithmic machinery developed for the max-leaf spanning tree problem to the scale-free spanning tree problem.
Footnotes
Author Disclosure Statement
The authors declare they have no conflicting financial interests.
Funding Information
Y.O. was partially supported by the BRFFR grant (Project F20UKA-005). The work of V.K. and K.K. was supported by the German National Science Foundation via DFG-Research Training Group 2297 (Mathematical Complexity Reduction—MathCoRe). P.S. was supported by the National Institutes of Health grant 1R01EB025022 and by the National Science Foundation grant 2047828.
References
1.
ArmstrongG.L., MacCannellD.R., TaylorJ., et al.2019. Pathogen genomics in public health. N. Engl. J. Med. 381, 2569–2580.
2.
BarabásiA.-L., and AlbertR.1999. Emergence of scaling in random networks. Science, 286, 509–512.
3.
BonsmaP.2012. Max-leaves spanning tree is APX-hard for cubic graphs. J. Discrete Algorithms. 12, 14–23.
4.
BorovicaninB., DasK.C., FurtulaB., et al.2017. Bounds for Zagreb indices. MATCH Commun. Math. Comput. Chem. 78, 17–100.
5.
CampbellE.M., JiaH., ShankarA., et al.2017. Detailed transmission network analysis of a large opiate-driven outbreak of HIV infection in the United States. J. Infect. Dis. 216, 1053–1062.
6.
CampoD.S., XiaG.-L., DimitrovaZ., et al.2016. Accurate genetic detection of hepatitis C virus transmissions in outbreak settings. J. Infect. Dis. 213, 957–965.
7.
ChartrandG., LesniakL., and ZhangP.2016. Graphs & Digraphs. CRC Press, Taylor & Francis Group, Boca Raton, FL.
8.
DasK.C.2003. Sharp bounds for the sum of the squares of the degrees of a graph. Kragujev. J. Math. 25, 19–41.
9.
DasK.C., and GutmanI.2004. Some properties of the second Zagreb index. MATCH Commun. Math. Comput. Chem. 52, 103–112.
10.
de CaenD.1998. An upper bound on the sum of squares of degrees in a graph. Discrete Math. 185, 245–248.
11.
De MaioN., WuC.-H., and WilsonD.J.2016. SCOTTI: Efficient reconstruction of transmission within outbreaks with the structured coalescent. PLoS Comput. Biol. 12, e1005130.
GalbiatiG., MaffioliF., and MorzentiA.1994. A short note on the approximability of the maximum leaves spanning tree problem. Inform. Process. Lett. 52, 45–49.
14.
GalvaniA.P., and MayR.M.2005. Dimensions of superspreading. Nature, 438, 293–295.
15.
GareyM.R., and JohnsonD.S.1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman & Co., New York, NY.
16.
GlebovaO., KnyazevS., MelnykA., et al.2017. Inference of genetic relatedness between viral quasispecies from sequencing data. BMC Genomics. 18, 918.
17.
GriggsJ.R., KleitmanD.J., and ShastriA.1989. Spanning trees with many leaves in cubic graphs. J Graph Theory, 13, 669–695.
18.
HagbergA., SwartP., and ChultD.2008. Exploring network structure, dynamics, and function using NetworkX. In Proceedings of the 7th Python in Science Conference. Online publication, Pasadena, California; pp. 11–16.
19.
HajarizadehB., GrebelyJ., and DoreG.J.2013. Epidemiology and natural history of HCV infection. Nat. Rev. Gastroenterol. Hepatol. 10, 553–562.
20.
HuangC., WangY., LiX., et al.2020. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet, 395, 497–506.
21.
HughesG.J., FearnhillE., DunnD., et al.2009. Molecular phylodynamics of the heterosexual HIV epidemic in the United Kingdom. PLoS Pathogens, 5, e1000590.
22.
Icer BaykalP., LaraJ., KhudyakovY., et al.2020. Quantitative differences between intra-host HCV populations from persons with recently established and persistent infections. Virus Evol. 7, veaa103.
23.
JhaD., SkumsP., ZelikovskyA., et al.2017. Modeling the spread of HIV and HCV infections based on identification and characterization of high-risk communities using social media, 425–430. In International Symposium on Bioinformatics Research and Applications. Springer, Cham.
24.
JombartT., CoriA., DidelotX., et al.2014. Bayesian reconstruction of disease outbreaks by combining epidemiologic and genomic data. PLoS Comput. Biol. 10, e1003457.
25.
JombartT., EggoR., DoddP., et al.2011. Reconstructing disease outbreaks from genetic data: A graph approach. Heredity, 106, 383–390.
26.
KilmarxP.H.2009. Global epidemiology of HIV. Curr. Opin. HIV AIDS. 4, 240–246.
27.
KincaidR.K., KunklerS.J., LamarM.D., et al.2016. Algorithms and complexity results for finding graphs with extremal Randić index. Networks, 67, 338–347.
28.
KleitmanD.J., and WestD.B.1991. Spanning trees with many leaves. SIAM J Discrete Math. 4, 99–106.
29.
KlinkenbergD., BackerJ.A., DidelotX., et al.2017. Simultaneous inference of phylogenetic and transmission trees in infectious disease outbreaks. PLoS Comput. Biol. 13, e1005495.
30.
KnyazevS., HughesL., SkumsP., et al.2020. Epidemiological data analysis of viral quasispecies in the next-generation sequencing era. Brief. Bioinformatics, 22, 96–108.
31.
Kosakovsky PondS.L., WeaverS., Leigh BrownA.J., et al.2018. HIV-TRACE (TRansmission Cluster Engine): A tool for large scale molecular epidemiology of HIV-1 and other rapidly evolving pathogens. Mol. Biol. Evol. 35, 1812–1819.
32.
KouyosR.D., Von WylV., YerlyS., et al.2010. Molecular epidemiology reveals long-term changes in HIV type 1 subtype B transmission in Switzerland. J. Infect. Dis. 201, 1488–1497.
33.
KumarS., StecherG., LiM., et al.2018. MEGA X: Molecular evolutionary genetics analysis across computing platforms. Mol. Biol. Evol. 35, 1547–1549.
34.
Leigh BrownA.J., LycettS.J., WeinertL., et al.2011. Transmission network parameters estimated from HIV sequences for a nationwide epidemic. J. Infect. Dis. 204, 1463–1469.
35.
LemkeP.1988. The maximum leaf spanning tree problem for cubic graphs is NP-complete. IMA Preprint Series No. 428.
36.
LewisF., HughesG.J., RambautA., et al.2008. Episodic sexual transmission of HIV revealed by molecular phylodynamics. PLoS Med. 5, e50.
37.
LiL., AldersonD., DoyleJ. C., et al.2005. Towards a theory of scale-free graphs: Definition, properties, and implications. Internet Math. 2, 431–523.
38.
LuH.I., and RaviR.1998. Approximating maximum leaf spanning trees in almost linear time. J. Algorithms. 29, 132–141.
39.
MartinR.K.1991. Using separation algorithms to generate mixed integer model reformulations. Oper. Res. Lett. 10, 119–128.
40.
MillerC.E., TuckerA.W., and ZemlinR.A.1960. Integer programming formulation of traveling salesman problems. J. ACM, 7, 326–329.
41.
NewmanM.2010. Networks. Oxford University Press, New York, NY.
42.
PapadimitriouC., and YannakakisM.1991. Optimization, approximation, and complexity classes. J. Comput. Syst. Sci. 43, 425–440.
43.
Ragonnet-CroninM., HuY.W., MorrisS.R., et al.2019. HIV transmission networks among transgender women in Los Angeles County, CA, USA: A phylogenetic analysis of surveillance data. Lancet HIV. 6, e164–e172.
44.
RamachandranS., ThaiH., ForbiJ.C., et al.2018. A large HCV transmission network enabled a fast-growing HIV outbreak in rural Indiana, 2015. EBioMedicine, 37, 374–381.
45.
ReichA.2016. Complexity of the maximum leaf spanning tree problem on planar and regular graphs. Theoret.Comput. Sci. 626, 134–143.
SkumsP., ZelikovskyA., SinghR., et al.2018. QUENTIN: Reconstruction of disease transmissions from viral quasispecies genomic data. Bioinformatics, 34, 163–170.
48.
SledzieskiS., ZhangC., MandoiuI., et al.2019. TreeFix-TP: Phylogenetic error-correction for infectious disease transmission network inference. bioRxiv. 1:813931.
49.
VillandreL., StephensD.A., LabbeA., et al.2016. Assessment of overlap of phylogenetic transmission clusters and communities in simple sexual contact networks: Applications to HIV-1. PLoS One, 11, e0148459.
50.
VolzE.M., KoopmanJ.S., WardM.J., et al.2012. Simple epidemiological dynamics explain phylogenetic clustering of HIV from patients with recent infection. PLoS Comput.Biol. 8, e1002552.
51.
WertheimJ.O., Leigh BrownA.J., HeplerN.L., et al.2014. The global transmission network of HIV-1. J. Infect. Dis. 209, 304–313.