Sage Journals: Discover world-class research

Abstract

Understanding the mutational history of tumor cells is a critical endeavor in unraveling the mechanisms that drive the onset and progression of cancer. Modeling tumor cell evolution with labeled trees motivates researchers to develop different measures to compare labeled trees. Although the Robinson–Foulds (RF) distance is widely used for comparing species trees, its applicability to labeled trees reveals certain limitations. This study introduces the k-RF dissimilarity measures, tailored to address the challenges of labeled tree comparison. The RF distance is succinctly expressed as n-RF in the space of labeled trees with n nodes. Like the RF distance, the k-RF is a pseudometric for multiset-labeled trees and becomes a metric in the space of 1-labeled trees. By setting k to a small value, the k-RF dissimilarity can capture analogous local regions in two labeled trees with different size or different labels.

1. INTRODUCTION

In the realm of evolutionary biology, trees serve as fundamental mathematical concepts, offering a versatile framework for modeling the evolution of various entities, including organisms, species, and genes. Beyond their application in understanding biological evolution, trees find practical utility in medical diagnosis within the health care domain. The diversity of tree models has given rise to the significant challenge of effectively comparing different trees to evaluate various inference methods. This challenge has spurred researchers to define robust measures within the space of targeted trees. For example, mutation/clonal trees are introduced to model tumor evolution. In this representation, nodes denote cellular populations and are labeled with the gene mutations present in those populations (Karpov et al., 2019; Schwartz and Schäffer, 2017).

The growth and metastasis of tumors varies from patient to patient. In addition, such variations are significant for cancer treatment. As a result, dissimilarity measures for mutation tree comparison have become a focus of recent research (DiNardo et al., 2020; Jahn et al., 2021; Karpov et al., 2019; Llabrés et al., 2021).

In earlier work on phylogenetic trees, various measures have been proposed to compare two phylogenetic trees. Some examples of such measures are Robinson–Foulds (RF) distance (Robinson and Foulds, 1981), nearest-neighbor interchange (NNI) (Li et al., 1996; Robinson, 1971), Quartet distance (Estabrook et al., 1985), and Path distance (Steel and Penny, 1993; Williams and Clifford, 1971). They are defined under the assumption that the involved trees share the same label set. Consequently, they may not be useful when applied to trees where all nodes are labeled, especially when using different label sets.

1.1. Related work on comparison of labeled trees

To overcome the constraints associated with the above-mentioned measures in the comparison of mutation trees, computational biologists have introduced new dissimilarity metrics for mutation trees. Some of these measures are Common Ancestor Set (CASet) distance (DiNardo et al., 2020), Distinctly Inherited Set Comparison (DISC) distance (DiNardo et al., 2020), and Multi-Labeled Tree Dissimilarity measure (Karpov et al., 2019). Although these distance measures enable efficient comparison of clonal trees, they are defined based on the assumption that mutations cannot occur more than once and mutations will not be lost in the course of tumor evolution. As a result, these metrics exhibit multiple limitations when applied to the comparison of trees used to model complex tumor evolution, wherein mutations may indeed occur multiple times and subsequently be lost.

Apart from the three measures discussed earlier, a few additional dissimilarity metrics have been introduced to facilitate the comparison of mutation trees, including Parent–Child distance (Govek et al., 2018) and Ancestor–Descendant distance (Govek et al., 2018). These measures are metric for “1-mutation” trees, in which nodes are each labeled by one distinct mutation.

There are also measures for mutation trees that are defined through generalization of popular measures that are used for phylogenetic trees. Here, researchers aim to extend the definition of an existing distance, which was mostly used to compare phylogenetic trees with mutation trees. For example, the generalized NNI (Jahn et al., 2021) is defined by some minor modifications of NNI. The other example is the Path distance (Govek et al., 2018). Although these measures are applicable to mutation trees, they are only well defined for mutation trees with the same label sets (Govek et al., 2018; Jahn et al., 2021).

The generalized RF (GRF) distance is another distance introduced recently (Llabrés et al., 2021; Llabrés et al., 2020). This measure is used not only to compare mutation trees or clonal trees but also enables the comparison of species trees and even phylogenetic networks. A useful property of GRF is that its value is significantly contributed by the intersection of clusters or clones of targeted trees. However, the intersection is not quantified in the RF distance, as one only checks whether two cluster or clones of the two involved trees are identical or not when the RF-distance between two trees is computed. As a result, the GRF has a better resolution than the RF distance (Llabrés et al., 2020).

There are some other generalizations of the RF distance, such as Bourque distance (Jahn et al., 2021). The measure is able to compare mutation trees with same or different label sets, and it has linear time complexity. However, like the above distances, it does not allow for multiple occurrences of mutations during the tumor history (Jahn et al., 2021). Other generalizations of the RF distance have also been proposed for gene trees (Briand et al., 2022; Briand et al., 2020).

The aforementioned dissimilarity measures do not apply to some evolutionary models, such as Dollo (Farris, 1977) and the Camin–Sokal model (Camin and Sokal, 1965). This is because mutations may get lost after they are gained in the Dollo model, and the same mutation may occur more than once during the tumor history in the Camin–Sokal model (Llabrés et al., 2020). As far as we know, the only measure introduced to address the problem is the Triplet-based Distance (Ciccolella et al., 2021). The distance allows to compare mutation trees in which nodes have nonempty subsets of mutations as their labels. In addition, it also allows multiple occurrences and losing of mutations during the tumor history (Ciccolella et al., 2021). Despite the applicability of the measure to the larger group of labeled trees, Triplet-based Distance does not apply to labeled trees in which multiple copies of a mutation is observed in the label of a single node.

1.2. Our contributions to tree comparison

In this study, we develop the k-RF dissimilarity measures designed for the comparison of labeled trees. They are first defined for 1-labeled trees (Section 3). Subsequently, we extend these measures to multiset-labeled trees (Section 5). We delve into the mathematical properties of the k-RF measures in Sections 4 and 5. In particular, k-RF is a metric for 1-labeled trees. We also assess the validity of the k-RF measures through comparisons with CASet, DISC, and GRF (Section 5), and the evaluation of their performance in the context of tree clustering (Section 6).

2. CONCEPTS AND NOTATIONS

A (directed) graph consists of a set of nodes and a set of (directed) edges. In graphs, each edge is a pair of distinct nodes. In directed graphs, each edge is a pair of ordered distinct nodes.

Let G be a (directed) graph. V(G) and E(G) are used to denote its node and edge set, respectively. If G is undirected, (u,v) will still be used to denote an edge between u and v with the understanding that (u,v) = (v,u). Let $u, v \in V (G)$ . If $(u, v) \in E (G)$ , we say that u and v are adjacent, the edge (u,v) is incident to u and v, or u and v are two endpoints of (u,v).

The degree of v is defined as the number of edges incident to v. In addition, if G is directed, the indegree and outdegree of v are defined as the number of edges (x,y) such that y = v and x = v, respectively. The nodes of degree 1 are called the leaves in an undirected graph, whereas the nodes of indegree 1 and outdegree 0 are called the leaves in a directed graph. We use Leaf (G) to denote the leaf set for G. Nonleaf nodes are called internal nodes.

A path of length k from u to v consists of a sequence of nodes $u_{0}, u_{1}, \dots, u_{k}$ such that $u_{0} = u$ , $u_{k} = v$ and $(u_{i - 1}, u_{i}) \in E (G)$ for $i = 1, 2, \dots, k$ . The distance from u to v, denoted as $d_{G} (u, v)$ , is the length of the shortest paths from u to v, and it is set to $\infty$ if there is no path from u to v.

If G is undirected, $d_{G} (u, v) = d_{G} (v, u)$ for any $u, v \in V (G)$ . The diameter of G is defined as $m a x_{u, v \in V (G)} d_{G} (u, v)$ and is denoted by diam(G). If G is directed, its diameter is defined as the diameter of its undirected version that has the node set V(G) and edge set $E (G) \cup {(u, v) | (v, u) \in E (G)}$ .

2.1. Trees

A tree T is a graph in which there is a unique path between any two distinct nodes. A binary tree is a tree in which every internal node has degree 3. A line tree is a tree in which every internal node has degree 2. The number of leaves in a line tree is 2.

A directed tree is a directed graph that is a tree if we ignore the orientations of edges.

2.2. Rooted trees

A directed tree is called a rooted tree if it has a special root node from which the edges are directed away. In a rooted tree, indegree of each nonroot node is 1, which implies that there is exactly one path from its root to any other node.

Let T be a rooted tree, $u, v \in V (T)$ such that $u \neq v$ . We say v is a child of u and u is the parent of v if $(u, v) \in E (T)$ . In general, we say v is a descendant of u, and u is an ancestor of v if u is in the unique path from root(T) to v. The set of all children, ancestors and descendants of u are denoted by $C_{T} (u)$ , $A_{T} (u)$ and $D_{T} (u)$ , respectively. Note that $u \notin A_{T} (u)$ and $u \notin D_{T} (u)$ .

A rooted tree is called a binary rooted tree if the root has indegree 0 and outdegree 1, and every other internal node has indegree 1 and outdegree 2.

A rooted tree is called a rooted line tree if each internal node has exactly one child. A rooted tree is called a rooted caterpillar tree if the set of children of each internal node contains at most one internal node.

2.3. Labeled trees

Suppose L is a set and $P (L)$ denotes the set of all subsets of L. We say a tree or rooted tree T is labeled by subsets of L if T is equipped by a map $ℓ : V (T) \to P (L)$ , where $\cup_{v \in V (T)} ℓ (v) = L$ , and $ℓ (v) \neq \emptyset$ for any $v \in V (T)$ . In particular, we say T is 1-labeled on L if $ℓ (v)$ is a singleton for each $v \in V (T)$ , and $ℓ$ is ingective. Moreover, for a 1-labeled tree T on L and $C \subseteq V (T)$ , we define $L (C) = {a \in L | \exists x \in C : ℓ (x) = \{a\}}$ .

2.4. Phylogenetic and mutation trees

Let X be a finite taxon set. A phylogenetic tree (respectively, rooted phylogenetic tree) on X is a binary tree (respectively, binary rooted tree) in which only leaves are labeled by the elements of X, and two distinct leaves have different labels.

A mutation tree on a set M of mutations is a rooted tree in which nodes are labeled with nonempty subsets of M.

2.5. Dissimilarity measures for trees

Let $T$ be a set of trees. A dissimilarity measure on $T$ is a real function d such that $d (T', T'') = d (T'', T') \geq 0$ for any $T', T'' \in T$ . It captures the intuition that the more different two trees are, the higher their measure value is and vise versa. It is called a pseudometric if it satisfies triangle inequality condition. A pseudometric d is called a metric if $d (S, T) \neq 0$ unless S and T are the same trees.

3. THE k-RF MEASURE FOR 1-LABELED TREES

In this section, we first recall the definition of the RF distance and then present k-RF dissimilarity measures for 1-labeled trees for arbitrary k.

3.1. The k-RF measure for 1-labeled unrooted trees

Suppose X is a finite set and T is a 1-labeled tree on X. Each $e = (u, v) \in E (T)$ induces a pair of label subsets on X: $P_{T} (e) = \{L (B_{e} (u)), L (B_{e} (v))\},$ (1) $B_{e} (u) = {w | d_{T} (w, u) < d_{T} (w, v)},$

B_{e} (v) = {w | d_{T} (w, v) < d_{T} (w, u)} .

(2)

We further define: $P (T) = {P_{T} (e) | e \in E (T)} .$ (3)

The RF distance of two 1-labeled trees S and T is defined as: $d_{R F} (S, T) = |P (S) Δ P (T)| .$ (4)

Example 1. For the three 1-labeled trees in Figure 1 , $d_{R F} (S, T) = 4$ but $d_{R F} (S', T) = 12$ although T and $S'$ have the same topology and their label sets are different in only one label.

FIG. 1.

Three 1-labeled trees in Example 1 to illustrate that the Robinson–Foulds distance exhibits a heavy penalty against trees with different labels. Although T and $S'$ is only different in labeling one node, the RF distance is 4 for S and T, but 12 for $S'$ and T. RF, Robinson–Foulds.

The above example illustrates that if two 1-labeled trees have different label sets, their local similarity are not captured by the RF distance. One of the widely used measures for the comparison of sets is the Jaccard distance. It is defined as a fraction whose numerator is the size of the symmetric difference of two sets and whose denominator is the size of their union. Two 1-labeled trees are identical if and only if they have the same set of edges with the understanding that each node is uniquely determined by its label. Hence, we aim to use $|E (S) Δ E (T)|$ and its generalization to measure the dissimilarity between 1-labeled trees S and T.

Let $k \geq 0$ be an integer and let T be a 1-labeled tree. Each edge $e = (u, v)$ induces the following pair of subsets of labels: $\begin{matrix} P_{T} (e, k) = \{L (B_{e} (u, k)), L (B_{e} (v, k))\}, \\ B_{e} (x, k) = {w \in B_{e} (x) | d_{T} (w, x) \leq k}, x = u, v . \end{matrix}$ (5)

Clearly, $B_{e} (u, \infty) = B_{e} (u)$ and $B_{e} (u, 0) = \{u\}$ . We further define: $P_{k} (T) = {P_{T} (e, k) | e \in E (T)} .$ (6)

Definition 1. Let k ≥ 0 and let S and T be two 1-labeled trees. The k-RF dissimilarity score of S and T is defined as: $d_{k - R F} (S, T) = |P_{k} (S) Δ P_{k} (T)| .$ (7)

Example 2. Continuing with Example 1, we have $d_{1 - R F} (S', T) = 4$ , as $P_{T} (e_{i}, 1)$ for $1 \leq i \leq 6$ are: $\begin{matrix} \{\{g\}, \{e, f\}\}, \{\{e, g\}, \{a, f\}\}, \{\{e, f\}, \{a, d\}\}, \\ \{\{a, f\}, \{b, c, d\}\}, \{\{b\}, \{a, c, d\}\}, \{\{c\}, \{a, b, d\}\}, \end{matrix}$

respectively, and $P_{S'} (e'_{i}, 1)$ for $1 \leq i \leq 6$ are: $\begin{matrix} \{\{h\}, \{e, f\}\}, \{\{e, h\}, \{a, f\}\}, \{\{e, f\}, \{a, d\}\}, \\ \{\{a, f\}, \{b, c, d\}\}, \{\{b\}, \{a, c, d\}\}, \{\{c\}, \{a, b, d\}\}, \end{matrix}$

respectively. We also have $d_{1 - R F} (S, T) = 8$ . Thus, 1-RF captures the difference of the trees better than the RF distance.

3.2. The k-RF measure for 1-labeled rooted trees

Let k ≥ 0 be an integer and let T be a 1-labeled rooted tree. For a node $w \in V (T)$ , we define $B_{k} (w)$ and $D_{k} (w)$ as: $B_{k} (w) = {x \in V (T) | \exists y \in A_{T} (w) \cup \{w\} : d (y, w) + d (y, x) \leq k},$ (8) $D_{k} (w) = \{w\} \cup {x \in D_{T} (w) | d (w, x) \leq k} .$ (9)

For each $e = (u, v) \in E (T)$ , we define: $P_{T} (e, k) = (L (D_{k} (v)), L (B_{k} (u) ∖ D_{k} (v))),$ (10) $P_{k} (T) = {P_{T} (e, k) | e \in E (T)} .$ (11)

Definition 2. Let k ≥ 0. The k-RF dissimilarity between two 1-labeled rooted trees S and T is defined as: $d_{k - R F} (S, T) = |P_{k} (S) Δ P_{k} (T)| .$ (12)

Example 3. Consider the two 1-labeled rooted trees S and T in Figure 2 . We have:

FIG. 2.

Two 1-labeled rooted trees used to illustrate the 1-RF in Example 3.

\begin{matrix} P_{T} (e_{1}, 1) = (\{f, h\}, \{b, d\}), P_{T} (e_{2}, 1) = (\{c, f, g\}, \{b, h\}), \\ P_{T} (e_{3}, 1) = (\{c\}, \{f, g, h\}), P_{T} (e_{4}, 1) = (\{g\}, \{c, f, h\}) = P_{S} (ē_{6}, 1), \\ P_{T} (e_{5}, 1) = (\{a, d, e\}, \{b, h\}), P_{T} (e_{6}, 1) = (\{a\}, \{b, d, e\}) = P_{S} (ē_{3}, 1), \\ P_{T} (e_{7}, 1) = (\{e\}, \{a, b, d\}) = P_{S} (ē_{4}, 1) \\ P_{S} (ē_{1}, 1) = (\{b, d\}, \{c, f\}), P_{S} (ē_{2}, 1) = (\{a, d, e\}, \{b, c\}), \\ P_{S} (ē_{5}, 1) = (\{f, g, h\}, \{b, c\}), P_{S} (ē_{7}, 1) = (\{h\}, \{c, f, g\}) . \end{matrix},

This implies that $d_{1 - R F} (S, T) = 8$ .

4. CHARACTERIZATION OF k-RF FOR 1-LABELED TREES

To assess k-RF measures, we initially give their mathematical properties. Subsequently, we present experimental findings regarding their frequency distribution.

4.1. Mathematical properties

Proposition 1. Let S and T be two 1-labeled trees.

(a) For any k ≥ 1, $d_{k - R F} (S, T) = |E (S)| + |E (T)|$ if S and T share at most 2 labels and there are at least two edges in either S or T.

(b) Assume that $L (S) \neq L (T)$ . For $k < m i n \{d i a m (T), d i a m (S)\}$ , $k + 1 \leq d_{k - R F} (S, T) \leq |E (S)| + |E (T)|$ . In addition, the second inequality become equality if $k \leq m i n \{d i a m (T), d i a m (S)\}$ and $|L (S)| = |L (T)|$ .

(d) If $k \geq m a x \{d i a m (S), d i a m (T)\} - 1$ , then $d_{k - R F} (S, T) = d_{R F} (S, T)$ .

Proof. (a) Note that if k ≥ 1 and $|E (T)| \geq 2$ , each $P_{T} (e, k)$ involves at least three labels. If L(S) and L(T) have only two common elements, $P_{T} (e, k) \neq P_{S} (e', k)$ for every $e \in E (T)$ and $e' \in E (S)$ . Thus, we have $P_{k} (S) \cap P_{k} (T) = \emptyset$ , implying that $d_{k - R F} (S, T) = | P_{k} (S) Δ P_{k} (T) |=| P_{k} (T) |+| P_{k} (S) = |E (S)| + |E (T)| .$

(b) The second inequality follows from that $d_{k - R F} (S, T) = |P_{k} (S) Δ P_{k} (T) |\leq| P_{k} (T) |+| P_{k} (S)|$ and $|P_{k} (X)| = |E (X)|$ for $X = S, T$ . We prove the first inequality as follows.

Let $k < m i n \{d i a m (T), d i a m (S)\}$ . Since S and T are 1-labeled, we identify a node with its label in the trees. Without loss of generality, we may assume $v \in V (T) ∖ V (S)$ . Define $N_{T}^{(k)} (v) = {u | d_{T} (u, v) \leq k}$ .

If $N_{T}^{(k)} (v) = V (T)$ , then, $|N_{T}^{(k)} (v)| = |V (T)| \geq d i a m (T) + 1 \geq k + 2$ , as $k < d i a m (T)$ . This also implies that for every $(x, y) \in E (T)$ , $d_{T} (v, x) \leq k$ and $d_{T} (v, y) \leq k$ .

If $N_{T}^{(k)} (v) \neq V (T)$ , there exists at least a node w that is k + 1 or more distance away from v. Since T is connected, we let $P (v, w)$ be a path from v and w with the smallest length. Clearly, the first $k + 1$ nodes in $P (v, w)$ (including v) are all in $N_{T}^{(k)} (v)$ , that is, at least one end of the first k + 1 edges of $P (v, w)$ are found in $N_{T}^{(k)} (v)$ .

In summary, we have proved that there are at least k + 1 edges (x,y) such that either $d_{T} (v, x) \leq k$ or $d_{T} (v, y) \leq k$ . For each of these edges e, v appears in at least one label subset of $P_{T} (e, k)$ and thus $P_{T} (e, k) \notin P_{k} (S)$ . Therefore, $d_{k - R F} (S, T) \geq |P_{k} (T) ∖ P_{k} (S)| \geq k + 1$ .

If $|L (S)| = |L (T)|$ and $k \geq m i n \{d i a m (T), d i a m (S)\}$ , then, $N_{T}^{(k)} (v) = V (T)$ . Therefore, the induced pair $P_{T} (e, k)$ contains v for every edge e of T. On the contrary, the induced pair $P_{S} (e, k)$ does not contain v for each edge e of S. Thus, $P_{k} (S) \cap P_{k} (T) = \emptyset$ and $d_{k - R F} (S, T) = |P_{k} (S) |+| P_{k} (T)| = |E (S)| + |E (T)|$ .

(c) Note that we may represent each node of a 1-labeled tree with its unique label. As a result, $P_{T} (e, 0) = e$ and $P_{S} (ē, 0) = e$ for $e \in E (T)$ and $ē \in E (S)$ . Thus, (c) follows.

(d) It follows from the definition of the k-RF.

Lemma 1. Let k ≥ 0 be an integer. k-RF satisfies the non-negativity, symmetry and triangle inequality conditions.

Proof. Let k ≥ 0. The non-negativity and symmetry conditions are trivial. The triangle inequality $d_{k - R F} (T_{1}, T_{2}) \leq d_{k - R F} (T_{1}, T_{3}) + d_{k - R F} (T_{3}, T_{2})$ is derived from the inequality $P_{k} (T_{1}) Δ P_{k} (T_{2}) \subseteq (P_{k} (T_{1}) Δ P_{k} (T_{3})) \cup (P_{k} (T_{3}) Δ P_{k} (T_{2}))$ for any three 1-labeled trees $T_{1}, T_{2}, T_{3}$ .

Remark 1. Proposition 1 and Lemma 1 can be proved in the same manner for 1-labeled rooted trees.

Proposition 2. The 0-RF is a metric on the space of all 1-labeled rooted trees.

Proof. Let S and T be two 1-labeled rooted trees. By Remark 1, it is enough to show that S and T are identical if $d_{0 - R F} (S, T) = 0$ . By identifying a node with its label in S and T, we obtain that $P_{0} (S) = E (S)$ and $P_{0} (T) = E (T)$ . If $d_{0 - R F} (S, T) = 0$ , $|E (T) Δ E (S)| = 0$ and thus $E (T) = E (S)$ , that is, S and T are identical.

Lemma 2. Let T be a 1-labeled rooted tree with at least two nodes and $ℒ$ be a subset of Leaf (T). Define T′ to be the subtree obtained by the removal of all the leaves of $ℒ$ . Then, for any k,

Proof. Since T is 1-labeled, we identify a node of T with its label in the following discussion. With this convention, for any subset S of nodes, $L (S) = S$ .

Let $Ē (T)$ denote the subset of edges incident to a leaf of $ℒ$ , that is, $Ē (T) = {(x, y) \in E (T) | y \in ℒ}$ . Then, $V (T) = V (T') ⨄ ℒ, E (T) = E (T') ⨄ Ē (T) .$

If $(u, v) \in Ē (T)$ , $v \in ℒ \subseteq L e a f (T)$ and thus $D_{k} (v) = \{v\} \subseteq ℒ .$

For an edge $e = (u, v) \in E (T')$ , $P_{T} (e, k) = (D_{k} (v), B_{k} (u) ∖ D_{k} (v))$ . By Equations (8) and (9),

If $(u, v) \in E (T')$ , $D_{k} (v) ∖ ℒ = D_{k} (v) \cap V (T') \neq \emptyset$ and $(B_{k} (v) ∖ D_{k} (v)) ∖ ℒ = (B_{k} (v) \cap V (T') {]}∖{[} D_{k} (v) \cap V (T')] .$ Therefore, $(D_{k} (v) ∖ ℒ, (B_{k} (v) ∖ D_{k} (v)) ∖ ℒ) = P_{T'} (e, k) .$

This has proved Equation (13).

Proposition 3. Let k ≥ 1 be an integer. k-RF is a metric in the space of all 1-labeled rooted trees.

Proof. Let S and T be two 1-labeled rooted trees. By Remark 1, it is enough to show that $d_{k - R F} (S, T) = 0$ (equivalently, $P_{k} (T) = P_{k} (S)$ ) implies that S and T are identical. To this end, we prove that E(T) can be uniquely determined by $P_{k} (T)$ using mathematical induction.

Since $|E (T)| = |P_{k} (T)|$ , T is a single node if and only if E(T) is empty if and only $P_{k} (T)$ is empty. Therefore, the single-node tree is uniquely determined by $P_{k} (T)$ .

Assume S is uniquely determined by $P_{k} (S)$ for arbitrary 1-labeled tree S such that $|V (S)| < k$ . Consider a 1-labeled tree T such that $|V (S)| = k$ .

For a leaf $v \in L e a f (T)$ , there is a unique edge $e = (u, v)$ entering v. Note that k ≥ 1. Since $D_{k} (v) = \{v\}$ if and only if v is a leaf, we can identify v from $P_{T} (e, k) = (P_{1}, P_{2}) \in P_{k} (T)$ such that $P_{1} = \{v\} .$ .

For $v \in V (T) ∖ L e a f (T)$ , there is a unique edge $e = (u, v)$ entering v. Since k ≥ 1, the children of v are all a leaf if and only if $D_{k} (v) = \{v\} \cup C_{T} (u)$ if and only if $D_{K} (v) ∖ L e a f (T) = \{v\}$ . Therefore, we can identify v whose children are all leaves from the ordered pairs $(P_{1}, P_{2}) \in P_{k} (T)$ such that $P_{1} ∖ L e a f (T)$ contains only v.

Let V′ be the set of all nodes whose children are just leaves and $D_{T} (V') = \cup_{x \in V'} C_{T} (x)$ . Since $V'$ is nonempty, $D_{T} (V') \neq \emptyset$ . Define $E' (T) = {(x, y) \in E (T) | x \in V', y \in D_{T} (V')}$ .

For the tree T′ obtained from T by the removal of the leaves of $D_{T} (V')$ , $|V (T')| = |V (S)| - |D_{T} (V')| < k .$ By Equation (13), $P_{k} (T')$ can be efficiently constructed from $P_{k} (T)$ . By the induction hypothesis, $E (T')$ is uniquely determined by $P_{k} (T')$ . As a result, $E (T) = E (T') \cup E' (T)$ is determined.

This concludes the proof.

Corollary 1. Let k ≥ 0. The k-RF is a metric in the space of all 1-labeled trees.

Proof. If k = 0, the statement follows from the same proof as for Proposition 2. Now, let S and T be two 1-labeled trees and k ≥ 1. By Lemma 1, it is enough to show that if $d_{k - R F} (S, T) = 0$ (equivalently, $P_{k} (T) = P_{k} (S)$ ), then S and T. This can be proved in a manner similar to Proposition 3.

Lemma 3. Let k ≥ 0 and let T be a 1-labeled rooted tree with n nodes. All subsets $D_{i} (w) = \{w\} \cup {x \in D_{T} (w) | d (w, x) \leq i}$ and $L (D_{i} (w))$ for all nodes w and i ≤ k can be computed in at most $2 (k + 1) n$ set operations.

Proof. Since T is 1-labeled, we can identify a node of T with its label. In this way, $D_{i} (w) = L (D_{i} (w))$ for all nodes w and i ≤ k. By ordering the n labels, we represent each subset of labels (and each subset of nodes) as a n-bit 0–1 string, where the i-th bit is 1 if and only if the i-th label (node) is in the subset.

The statement is obvious in the case k = 0, since $D_{0} (w) = \{w\}$ and, clearly, all the $D_{0} (w)$ for $w \in V (T)$ can be computed in at most 2n set operations. We assume k > 0 and prove the statement by induction as follows.

Assume that all the $D_{k - 1} (w)$ for $w \in V (T)$ have been computed in at most 2kn set operations. Assume w has d_w children $u_{1}, u_{2}, \dots, u_{d (w)}$ . Then, $D_{k} (w) = \{w\} \cup (\cup_{i = 1}^{d_{w}} D_{k - 1} (u_{i}))$

This implies that $D_{k} (w)$ for all w can be computed from all $D_{k - 1} (w)$ using set operations. In total, we can compute all subsets in at most $2 n - 1 + 2 k n \leq 2 (k + 1) n$ set operations.

Lemma 4. Let k ≥ 0 and T be a 1-labeled rooted tree with n nodes. Using $L (D_{i} (w))$ for $w \in V (T), 0 \leq i \leq k$ , we can compute $L (B_{k} (w))$ for all w in $O (k n)$ set operations, where $B_{k} (w)$ is defined in Equation (8).

Proof. Since T is a 1-labeled rooted tree, we identify a node with its label. In this way, we just need to show that $B_{k} (w)$ for all nodes w can be computed in $O (k n)$ set operations.

Let r be the root of T. For any node $w \in V (T)$ , let the unique path from r to w be $w_{0} = r, w_{1}, \dots, w_{t} = w .$

Then, we have that $B_{k} (w_{t}) = \cup_{i = 0}^{m i n (k, t)} D_{k - i} (w_{t - i}) .$

Given the subsets $D_{i} (u)$ for all $i \leq k$ and $u \in V (T)$ , the above formula implies that $B_{k} (w_{t})$ can be computed in at most k set operations. In total, we can compute all $B_{k} (w_{t})$ for all $w \in V (T)$ in $O (k n)$ set operations.

Proposition 4. Let S and T be two 1-labeled trees with n nodes and k ≥ 0. Then, $d_{k - R F} (S, T)$ can be computed in $O (k n^{2})$ time.

Proof. We first consider the rooted tree case. Let S and T be two 1-labeled rooted trees with n nodes. Without loss of generality, we may assume that S and T are labeled on the same set L, with $|L| = n$ . (Otherwise, we can consider them labeled on $L = L (S) \cup L (T)$ , with $n \leq |L| \leq 2 n$ .) By Lemma 3 and Lemma 4, we can compute $P_{X} (e, k)$ for all $e \in E (X)$ in $O (k n)$ set operations for X = S,T. Since each edge induces an ordered pair of label subsets and we represent each label subset using a n-bit string, we consider $P_{X} (e, k)$ as a 2n-bit string. In this way, we sort all the edge-induced pairs of label subsets for each tree in $O (n^{2})$ time by radix sort (i.e., indexing) and then compute the symmetric difference of the two set of edge-induced pairs in $O (n^{2})$ time. This concludes the proof.

In the unrooted case, we first root the trees at a leaf. In this way, we can compute all the edge-induced pairs of label subsets in the derived rooted trees in $O (k n^{2})$ time. Since the edges induce unordered pairs of label subsets in the original trees, we rearrange the two label subsets obtained for an edge in such a way that the smallest label in the first subset is smaller than every label in the second one. After the rearrangement, we can radix-sort the edge-induced pairs and compute the k-RF score in $O (n^{2})$ time.

4.2. Distribution of k-RF scores

We examined the distribution of the k-RF dissimilarity scores for 1-labeled unrooted and rooted trees with the same label set and with different label sets.

The distribution of the frequency of the pairwise k-RF scores in the space of n-node 1-labeled unrooted and rooted trees for n from 4 to 7 are presented in Figures 3 to 6, respectively. For each n, it suffices to consider $k = 0, \dots, n - 2$ . Recall that (n − 2) -RF is actually the RF distance. The frequency distribution for the RF distance in the space of phylogenetic trees is known to be Poisson (Steel and Penny, 1993). It seems also true that the pairwise 0-RF and (n − 2) -RF scores have a Poisson distribution in the space of n-node 1-labeled unrooted and rooted trees. However, the distribution of the pairwise k-RF scores is unlikely Poisson when $k = 1, 2, 3$ and $k \neq n - 2$ .

FIG. 3.

The frequency distributions of all pairwise k-RF scores in the space of 1-labeled unrooted (top row) and rooted (bottom row) 4-node trees for $k = 0, 1, 2$ . In the bar-charts, the xaxis represents k-RF scores and the y-axis represents the number of tree pairs with a specific k-RF score.

FIG. 4.

The frequency distributions of all pairwise k-RF scores in the space of 1-labeled unrooted (top row) and rooted (bottom row) 5-node trees, where k ≤ 3. In the bar-charts, the x-axis represents k-RF scores and the y-axis represents the number of tree pairs with a specific k-RF score.

FIG. 5.

The frequency distributions of all pairwise k-RF scores in the space of 1-labeled unrooted (top row) and rooted (bottom row) 6-node trees for k ≤ 4. In each bar-chart, the x-axis represents k-RF scores and the y-axis represents the number of tree pairs whose k-RF equals a given score.

FIG. 6.

The frequency distributions of all pairwise k-RF scores in the space of 1-labeled unrooted (top row) and rooted (bottom row) 7-node trees for k ≤ 5. In each bar-chart, the x-axis represents k-RF scores and the y-axis represents the number of tree pairs whose k-RF equals a given score.

We examined 1,679,616 (respectively, 60,466,176) pairs of 6-node 1-labeled unrooted (respectively, rooted) trees such that the trees in each pair have c common labels, with $c = 3, 4, 5$ . Table 1 shows that the majority of pairs have a largest dissimilarity score of 10.

Table 1.

Number of Pairs of 1-Labeled 6-Node Unrooted (Top) and Rooted (Bottom) Trees That Have c Labels in Common and Have 1-Robinson–Foulds Score d for $c = 3, 4, 5$ and $d = 2, 4, 6, 8, 10$

1-RF	4	6	8	10
3	0	0	3072	1,676,544
4	0	432	16,800	1,662,384
5	340	3720	53,100	1,622,456

1-RF	4	6	8	10
3	0	0	79,872	60,386,304
4	0	7776	419,136	60,039,264
5	4080	65,760	1,310,880	59,085,456

RF, Robinson–Foulds.

5. A GENERALIZATION TO MULTISET-LABELED TREES

In this section, we extend the measures introduced in Section 3 to multiset-labeled unrooted and rooted trees.

5.1. Multisets and their operations

A multiset is a collection of elements in which an element x can occur one or more times (Jűrgensen, 2020). The set of all distinct elements appearing in a multiset A is denoted by Supp (A). In this study, we simply represent A by the monomial $x_{1}^{m_{A} (x_{1})} \dots x_{n}^{m_{A} (x_{n})}$ if $S u p p (A) = \{x_{1}, x_{2}, \dots, x_{n}\}$ , where $x_{i}^{1}$ is simplified to x_i for each i.

Let A and B be two multisets. We say A is a sub-multiset of B, denoted by $A \subseteq_{m} B$ , if for every $x \in S u p p (A)$ , $m_{A} (x) \leq m_{B} (x)$ . In addition, we say that $A = B$ if $A \subseteq_{m} B$ , and $B \subseteq_{m} A$ . Furthermore, the union, sum, intersection, difference, and symmetric difference of A and B are, respectively, defined as follows:

$A \cup_{m} B = \{x^{m a x \{m_{A} (x), m_{B} (x)\}} | x \in S u p p (A) \cup S u p p (B)\}$ ;

$A ⨄_{m} B = \{x^{m_{A} (x) + m_{B} (x)} | x \in S u p p (A) \cup S u p p (B)\}$ ;

$A \cap_{m} B = \{x^{m i n \{m_{A} (x), m_{B} (x)\}} | x \in S u p p (A) \cap S u p p (B)\}$ ;

$A ∖_{m} B = \{x^{m_{A} (x) - m_{B} (x)} | x \in S u p p (A) : m_{A} (x) > m_{B} (x)\}$ ;

$A Δ_{m} B = (A \cup_{m} B) ∖_{m} (A \cap_{m} B)$ ;

where $m_{X} (x)$ is defined as 0 if $x \notin S u p p (X)$ for $X = A, B$ .

Let L be a set and $P_{m} (L)$ be the set of all sub-multisets on L. A tree T is labeled with the sub-multisets of L if T is equipped with a function $ℓ : V (T) \to P_{m} (L)$ such that $\cup_{v \in V (T)} S u p p (ℓ (v)) = L$ and $ℓ (v) \neq \emptyset$ , for every $v \in V (T)$ . We call such a tree as a multiset-labeled tree. For $C \subseteq V (T)$ and $x \in L$ , we define $L_{m} (C)$ and $m_{T} (x)$ as follows: $L_{m} (C) = ⨄_{v \in C} ℓ (v);$ (14)

5.2. The k-RF for multiset-labeled trees

Let T be a multiset-labeled tree. Then, each edge $e = (u, v)$ of T induces a pair of multisets $P_{T} (e) = \{L_{m} (B_{e} (u)), L_{m} (B_{e} (v))\},$ (16)

where $L_{m} ()$ is defined in Equation (14), and $B_{e} (u)$ in Equation (2). Note that Equation (16) is obtained from Equation (1) by replacing $L ()$ with $L_{m} ()$ .

Remark 2. In a multiset-labeled tree T, two edges may induce the same multi-set pair as shown in Figure 7 . Hence, $P (T)$ in Equation (3) is a multiset in general.

FIG. 7.

Two multiset-labeled trees used to show that different edges can give the same label multi-subset pair. Here, $P_{T} (e_{2}) = P_{T} (e_{3}) = \{a b c, a^{2} b^{2} c\}$ .

We use Equations (16), (3), and (4) to define the RF-distance for multiset-labeled trees by replacing $Δ$ with $Δ_{m}$ in Equation (4).

Let k ≥ 0. We use Equations (5), (6), and (7) to define the k-RF for multiset-labeled trees by replacing $L ()$ with $L_{m} ()$ in Equation (5) and replacing $Δ$ with $Δ_{m}$ in Equation (7).

Example 4. Consider the multiset-labeled trees S, $S'$ , and T in Figure 8 . $P_{k} (T), P_{k} (S)$ and $P_{k} (S')$ for $k = 0, 1, \infty$ are summarized in Table 2 . We obtain:

FIG. 8.

Three multiset-labeled trees in Example 4.

Table 2.

Edge-Induced Unordered Pairs of Multisets in the Three Trees in Figure 8 for $k = 0, 1, \infty$

Tree	$P_{0} ()$	$P_{1} ()$	$P_{\infty} ()$
T	$\{c^{2}, e^{2}\}$	$\{c^{2}, c e^{2}\}$	$\{a^{2} b^{2} c^{3} d^{2} e^{2}, c^{2}\}$
	$\{c, e^{2}\}$	$\{a b^{2} c d^{2}, a c^{2}\}$	$\{a^{2} b^{2} c^{3} d^{2}, c^{2} e^{2}\}$
	$\{a c, c\}$	$\{a c^{2}, c^{2} e^{2}\}$	$\{a^{2} b^{2} c^{2} d^{2}, c^{3} e^{2}\}$
	$\{a c, d\}$	$\{a b^{2}, a c^{2} d^{2}\}$	$\{a b^{2} c d^{2}, a c^{4} e^{2}\}$
	$\{a b^{2}, d\}$	$\{a c d, c e^{2}\}$	$\{a b^{2}, a c^{5} d^{2} e^{2}\}$
	$\{c d, d\}$	$\{a^{2} b^{2} c d, c d\}$	$\{a^{2} b^{2} c^{4} d e^{2}, c d\}$
S	$\{c^{2}, e\}$	$\{a c^{2} e^{2}, c^{2}\}$	$\{a^{2} b^{2} c^{3} d^{2} e^{2}, c^{2}\}$
	$\{c e, e\}$	$\{a^{2} b c^{2} d, b d\}$	$\{a^{2} b^{2} c^{2} d^{2}, c^{3} e^{2}\}$
	$\{a c, e\}$	$\{a b^{2} c d^{2}, a c e\}$	$\{a b^{2} c d^{2}, a c^{4} e^{2}\}$
	$\{a c, d\}$	$\{a c^{3} e, c e\}$	$\{a^{2} b^{2} c^{4} d^{2} e, c e\}$
	$\{a b c, d\}$	$\{a c d, c^{3} e^{2}\}$	$\{a b c, a b c^{4} d^{2} e^{2}\}$
	$\{b d, d\}$	$\{a b c, a b c d^{2}\}$	$\{a^{2} b c^{5} d e^{2}, b d\}$
Ś	$\{c^{2}, e^{2}\}$	$\{c^{2}, c e^{2}\},$	$\{a b^{3} c^{3} d^{2} e^{2}, c^{2}\}$
	$\{c, e^{2}\}$	$\{a c^{2}, c^{2} e^{2}\}$	$\{a b^{3} c^{3} d^{2}, c^{2} e^{2}\}$
	$\{a c, c\}$	$\{a c d, c e^{2}\}$	$\{a b^{3} c^{2} d^{2}, c^{3} e^{2}\}$
	$\{a c, d\}$	$\{a c^{2} d^{2}, b^{3}\}$	$\{a c^{4} e^{2}, b^{3} c d^{2}\}$
	$\{b^{3}, d\}$	$\{a c^{2}, b^{3} c d^{2}\}$	$\{a c^{5} e^{2} d^{2}, b^{3}\}$
	$\{c d, d\}$	$\{a b^{3} c d, c d\}$	$\{a b^{3} c^{4} e^{2} d, c d\}$

\begin{matrix} d_{0 - R F} (T, S') = 2; d_{1 - R F} (T, S') = 6; d_{R F} (T, S') = 12; \\ d_{0 - R F} (S, S') = 10; d_{1 - R F} (S, S') = 12; d_{R F} (S, S') = 12 . \end{matrix}

It is not hard to see that both $d_{0 - R F} (T, S')$ and $d_{1 - R F} (T, S')$ reflect the local similarity of the two multiset-labeled trees better than $d_{R F} (T, S')$ .

5.3. k-RF for multiset-labeled rooted trees

Let k ≥ 0 be an integer. We use Equations (10), (11), and (12) to define k-RF for multiset-labeled rooted trees by replacing $L ()$ with $L_{m} ()$ in Equation (10) and replacing $Δ$ with $Δ_{m}$ in Equation (12).

Proposition 5. Let k ≥ 0 be an integer. The k-RF satisfies the non-negativity, symmetry, and triangle inequality conditions. Hence, k-RF is a pseudometric for each k in the space of multiset-labeled (rooted) trees.

Proof. The non-negativity and symmetry conditions follow from the definition of the k-RF. The triangle inequality condition is proved as follows.

Let T₁, T₂, and T₃ be three multiset-labeled trees. We need to show: $d_{k - R F} (T_{1}, T_{2}) \leq d_{k - R F} (T_{1}, T_{3}) + d_{k - R F} (T_{3}, T_{2}) .$

Note that $P_{k} (X)$ denotes the multiset of edge-induced order pairs of sub-multisets in X for $X = T_{1}, T_{2}, T_{3}$ .

If $x^{m (x)} \in P_{k} (T_{1}) Δ_{m} P_{k} (T_{2})$ , we have either $x^{m (x)} \in P_{k} (T_{1}) ∖_{m} P_{k} (T_{2})$ or $x^{m (x)} \in P_{k} (T_{2}) ∖_{m} P_{k} (T_{1})$ . Assume $x^{m (x)} \in P_{k} (T_{1}) ∖_{m} P_{k} (T_{2})$ . Then, $m_{P_{k} (T_{1})} (x) > m_{P_{k} (T_{2})} (x)$ . If $x \notin S u p p (P_{k} (T_{3}) ∖_{m} P_{k} (T_{2}))$ , we have $m_{P_{k} (T_{1})} (x) > m_{P_{k} (T_{2})} (x) \geq m_{P_{k} (T_{3})} (x)$ . This implies that $x \in S u p p (P_{k} (T_{1}) ∖_{m} P_{k} (T_{3}))$ and $m_{P_{k} (T_{1}) ∖_{m} P_{k} (T_{3})} (x) = m_{P_{k} (T_{1})} (x) - m_{P_{k} (T_{3})} (x) \geq m_{P_{k} (T_{1})} (x) - m_{P_{k} (T_{2})} (x) = m (x) .$ Thus, $m (x) \leq m_{P_{k} (T_{1}) Δ_{m} P_{k} (T_{3})} (x) + m_{P_{k} (T_{3}) Δ_{m} P_{k} (T_{2})} (x) .$

On the contrary, if $x \in S u p p (P_{k} (T_{3}) ∖_{m} P_{k} (T_{2}))$ and $m_{P_{k} (T_{3})} (x) \geq m_{P_{k} (T_{1})} (x)$ , we have: $\begin{matrix} m_{P_{k} (T_{3}) ∖_{m} P_{k} (T_{2})} (x) = m_{P_{k} (T_{3})} (x) - m_{P_{k} (T_{2})} (x) \\ \geq m_{P_{k} (T_{1})} (x) - m_{P_{k} (T_{2})} (x) = m (x) . \end{matrix}$

If $x \in S u p p (P_{k} (T_{3}) ∖_{m} P_{k} (T_{2}))$ and $m_{P_{k} (T_{3})} (x) < m_{P_{k} (T_{1})} (x)$ , we have $m_{P_{k} (T_{1})} (x) > m_{P_{k} (T_{3})} (x) > m_{P_{k} (T_{2})} (x)$ , implying $x \in S u p p (P_{k} (T_{1}) ∖_{m} P_{k} (T_{3}))$ . Thus, we have: $\begin{matrix} m (x) = m_{P_{k} (T_{1}) ∖_{m} P_{k} (T_{3})} (x) + m_{P_{k} (T_{3}) ∖_{m} P_{k} (T_{2})} (x) \\ \leq m_{P_{k} (T_{1}) Δ_{m} P_{k} (T_{3})} (x) + m_{P_{k} (T_{3}) Δ_{m} P_{k} (T_{2})} (x) . \end{matrix}$

Finally, if $x^{m (x)} \in P_{k} (T_{2}) ∖_{m} P_{k} (T_{1})$ , we can obtain the same result. In summary, we have: $S u p p (P_{k} (T_{1}) Δ_{m} P_{k} (T_{2})) \subseteq S u p p (P_{k} (T_{1}) Δ_{m} P_{k} (T_{3})) \cup S u p p (P_{k} (T_{3}) Δ_{m} P_{k} (T_{2})) .$

In addition, for each $x \in S u p p (P_{k} (T_{1}) Δ_{m} P_{k} (T_{2}))$ , we have: $m_{P_{k} (T_{1}) Δ_{m} P_{k} (T_{2})} (x) \leq m_{P_{k} (T_{1}) Δ_{m} P_{k} (T_{3})} (x) + m_{P_{k} (T_{3}) Δ_{m} P_{k} (T_{2})} (x) .$

Therefore, we have: $|P_{k} (T_{1}) Δ_{m} P_{k} (T_{2}) |\leq| P_{k} (T_{1}) Δ_{m} P_{k} (T_{3}) |+| P_{k} (T_{3}) Δ_{m} P_{k} (T_{2})|,$

that is, the triangle inequality holds.

For multiset-labeled rooted trees, the proof is similar and hence omitted.

Remark 3. For multiset-labeled trees, $d_{k - R F} (S, T) = 0$ does not imply S and T are identical, as given in Figure 9.

FIG. 9.

Two distinct multiset-labeled trees S and T satisfy that $P_{2} (S) = P_{2} (T) = \{\{a^{2} d^{2}, b\}, \{a b d, a d\}, \{a, a b d^{2}\}\}$ , showing that 2-RF score can be 0 even for distinct trees.

Proposition 6. Let k ≥ 0 and S and T be two (rooted) trees whose nodes are labeled by $L (S)$ and $L (T)$ , respectively. Then, $d_{k - R F} (S, T)$ can be computed in time, where B is the maximum multiplicity of a label appearing in ${P_{T} (e, k) | e \in V (T)} \cup {P_{S} (e, k) | e \in V (S)}$ and $D = |S u p p (L (S)) \cup S u p p (L (T))|$ .

Proof. An algorithm for the 1-labeled case can be modified as follows for computing k-RF on multiset-labeled rooted and unrooted trees:

Represent each label multiset as a D-dimensional vector, in which the integer at position j is the multiplicity of the j-th label. Computing all edge-induced pairs in both trees takes $O (k (|E (S)| + |E (T)|))$ set operations. Each set operation takes D integer operations.

Radix-sort all the edge-induced pairs for S and T in $O (D (|E (S)| + B))$ and $O (D (|E (T)| + B))$ integer operations, respectively.

Compute the symmetric difference of the set of the edge-induced pairs in the two input trees in $|E (S)| + |E (T)|$ set operation. Each set operation takes D integer operations.

In summary, one can compute $d_{k - R F} (S, T)$ using integer operations, as $|E (S)| = |V (S)| - 1$ .

5.4. Correlation of the k-RF and the other measures

Let T and S be two 1-labeled rooted trees with the same label set X. Again, we identify the nodes with their labels in the two trees. For any two subset X′ and X″ of X, we use $d_{J} (X', X'')$ to denote their Jaccard distance. The CASet $\cap$ distance between T and S is defined to be the average $d_{J} (A_{T} (i) \cap A_{T} (j), A_{S} (i) \cap A_{S} (j))$ of a pair of nodes i and j, whereas the DISC $\cap$ distance between T and S is the average $d_{J} (A_{T} (i) ∖ A_{T} (j), A_{S} (i) ∖ A_{S} (j))$ of an order pair (i,j) of nodes DiNardo et al. (2020).

Using the Pearson correlation (PC), we compared the k-RF with CASet $\cap$ , DISC $\cap$ , and GRF (Llabrés et al., 2020) in the space of set-labeled trees for different k from 0 to 28.

First, we conducted the correlation analysis in the space of mutation trees with the same label set. Using a method reported by Jahn et al. (2021), we generated a simulated dataset containing 5000 rooted trees in which the root was labeled with 0 and the other nodes were labeled by the disjoint subsets of $\{1, 2, \dots, 29\}$ , where the trees might have different number of nodes. Using all $(\begin{matrix} 5, 000 \\ 2 \end{matrix})$ pairwise scores for CASet $\cap$ , DISC $\cap$ , GRF, and k-RF, we conducted the PC analysis of k-RF with the other three (Fig. 10, left panel).

FIG. 10.

PC of the k-RF with CASet $\cap$ , DISC $\cap$ , and GRF. The analyses were conducted on rand rooted trees with the same label set (left) and with different but overlapping label sets (right) that were reported in Jahn et al. (2021). The PC became constant for k ≥ 19 in the range k-RF becomes RF. CASet $\cap$ , Common Ancestor Set distance; DISC $\cap$ , Distinctly Inherited Set Comparison distance (DiNardo et al., 2020); GRF, Generalized RF distance (Llabrés et al., 2021).

Our results show that CASet $\cap$ , DISC $\cap$ and GRF were all positively correlated with k-RF. We observed the following facts:

The GRF and k-RF are highly correlated for each k < 8.

The DISC $\cap$ and k-RF are highly correlated for each k ≥ 8.

The 5-RF and 6-RF were less correlated to CASet $\cap$ , DISC $\cap$ and GRF than other k-RF.

The PC between k-RF and CASet $\cap$ (respectively, DISC $\cap$ ) increased when k went from 6 to 15.

Next, we performed PC analysis on trees characterized by distinct yet intersecting label sets. The dataset was created using the identical methodology, comprising a union of five sets of rooted trees, each encompassing 200 trees and sharing the same label set. Dissimilarity scores were calculated for each tree within the initial group and each tree in the remaining groups. Subsequently, we computed the PC between different dissimilarity values. Once more, all dissimilarity measures exhibited a positive correlation, although less pronounced than in the initial scenario [refer to Figure 10 (right)}. This observation could be because of that difference in label sets of two trees makes their k-RF score at least k + 1. However, the difference does not strongly contribute to the other distances because DISC $\cap$ and CASet $\cap$ consider the intersection of label sets (DiNardo et al., 2020), and GRF considers the intersection of clusters.

The right dotplot of Figure 10 shows that the k-RF and DISC $\cap$ had the largest PC for k from 1 to 9, and the k-RF and the CASet $\cap$ had the largest PC for k ≥ 10. Moreover, all the PCs decreased when k changed from 1 to 15. This trend was not observed in the first case. This decreasing trend could be the result of the fact that difference in label sets contributes to k-RF more as k increases.

6. CLUSTERING TREES WITH THE k-RF

A test was designed to compare the k-RF, CASet $\cap$ , DISC $\cap$ , and GRF in terms of clustering labeled trees.

We generated randomly 5 tree families each containing 50 trees using the program reported by Jahn et al. (2021). The nodes were labeled by the subsets of a set of size 30 in the trees of each family. The label sets of any two different families were distinct in only one label. We imposed such restriction on the label sets as in each tree, distinct nodes were labeled by disjoint subsets; hence, each different label between the label sets of two trees induces d pairs that only belong to the tree with the label, where d is the degree of the node with the label. Therefore, the more different the label sets are, the more distinguishable the trees could be by the k-RF.

We computed the pairwise dissimilarity scores for all 250 trees in the 5 families via each measure; we then clustered the 250 trees into c clusters using the K-means clustering method, where c ranges from 2 to 57. The clustering results were assessed using the Silhouette score (Kaufman and Rousseeuw, 2009).

As Figure 11 shows, the correct number of tree families was not recognized by either of the CASet $\cap$ , DISC $\cap$ , and GRF distances. However, among these three measures and the k-RF measures for k ≤ 12, the Silhouette score of CASet $\cap$ was the highest value when the number of clusters was 5. Furthermore, the figure illustrates that the exact number of families was recognized by k-RF when k ranges from 12 to 19. Moreover, the Silhouette score of the k -RF increased when k increased from 8 to 19 This interesting observation may stem from the fact that as k increases, the number of pairs of trees achieving the highest possible k-RF score also increases, thereby enhancing the recognizability of families. It's worth noting that such pairs are guaranteed to exist when k is larger than the minimum diameter of the trees, which is 8 in our case.

FIG. 11.

Silhouette scores of clustering 250 rooted trees with k-RF for $0 \leq k \leq 11 (l e f t)$ and $12 \leq k \leq 19 (m i d d l e)$ and with CASet $\cap$ , DISC $\cap$ , and GRF (right).

7. CONCLUSIONS

The development of an efficient and robust measure for the comparison of labeled trees is important. In this study, we have proposed a novel variant of dissimilarity metrics, namely the k-RF, tailored for labeled trees. The k-RF facilitates the analysis of local structures in labeled trees, accommodating nodes labeled with (not necessarily the same) multisets. Significantly, these metrics find practical applicability in mutation trees used in cancer research.

The RF distance is succinctly expressed as (n − 1)-RF within the space of labeled trees with n nodes. By setting k to a value smaller than n − 1, the k-RF metric can capture analogous local regions in two labeled trees. Of note, for every k, the k-RF is a pseudometric for multiset-labeled trees and becomes a metric in the space of 1-labeled trees. However, the distribution of pairwise k-RF scores in the space of 1-labeled unrooted (or rooted) trees conforms to a Poisson distribution specifically for k = n − 2, and unlikely have the same trend for other values of k ≥ 1.

We verified the k-RF measures through a comprehensive comparison with CASet, DISC (DiNardo et al., 2020) and GRF (Llabrés et al., 2021) on randomly labeled trees generated by a house-made program (Jahn et al., 2021). Our findings revealed a consistent positive correlation between k-RF and each of the other three measures for every value of k. Of note, the correlation values exhibited a tendency to be higher when the measures were applied to assess mutation trees with identical label sets. Furthermore, our study underscored the superior clustering capabilities of k-RF compared with the three mentioned measures.

We would like to emphasize that selecting an appropriate k-RF in practical applications lacks a universal rule of thumb, primarily owing to a shortage of experience in this domain. Perhaps a judicious approach involves choosing a suitable k-RF by carefully considering the topological similarity among the trees under consideration.

Future work includes how to apply the k-RF to designing tree inference algorithms like GraPhyC (Govek et al., 2018) and also how to infer the exact frequency distribution of the k-RF for each k ≥ 1. It is also interesting to investigate the generalization of RF-distance for clonal trees (Llabrés et al., 2020).

The code for computing the pairwise k-RF scores of a group of multiset-labeled trees can be downloaded from https://github.com/Elahe-khayatian/k-RF-measures.git.

Footnotes

ACKNOWLEDGMENTS

This work is an extended version of a conference article that was presented at the RECOMB-CG 2023, held at Istanbul, Turkey. The authors to thank the anonymous reviewer for providing helpful suggestions and comments to our first submission of the work.

AUTHOR DISCLOSURE STATEMENT

The authors declare they have no conflicting financial interests.

FUNDING INFORMATION

This research was partially supported by the Ministerio de Ciencia e Innovación (MCI), the Agencia Estatal de Investigación (AEI) and the European Regional Development Funds (ERDF) through project METACIRCLE PID2021-126114NB-C44, also supported by the European Regional Development Fund (FEDER), by the Agency for Management of University and Research Grants (AGAUR) through grant 2017-SGR-786 (ALBCOM), and by Singapore MOE Tier 1 grant R-146-000-318-114.

References

Briand

, Dessimoz

, El-Mabrouk

, et al. A generalized Robinson–Foulds distance for labeled trees. BMC Genomics, 2020; 21(Suppl. 10):779.

Briand

, Dessimoz

, El-Mabrouk

, et al. A linear time solution to the labeled Robinson–Foulds distance problem. Syst Biol, 2022; 71(6):1391–1403.

Camin

, Sokal

. A method for deducing branching sequences in phylogeny. Evolution, 1965; 19(3):311–326.

Ciccolella

, Bernardini

, Denti

, et al. Triplet-based similarity score for fully multilabeled trees with poly-occurring labels. Bioinformatics, 2021; 37(2):178–184.

DiNardo

, Tomlinson

, Ritz

, et al. Distance measures for tumor evolutionary trees. Bioinformatics, 2020; 36(7):2090–2097.

Estabrook

, McMorris

, Meacham

. Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Syst Zool, 1985; 34(2):193–200.

Farris

JS.

Phylogenetic analysis under Dollo's law. Syst Biol, 1977; 26(1):77–88.

Govek

, Sikes

, Oesper

A consensus approach to infer tumor evolutionary histories. In: Proceedings of 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (BCB’18). ACM Press: New York, NY, USA; 2018; pp. 63–72.

Jahn

, Beerenwinkel

, Zhang

. The Bourque distances for mutation trees of cancers. Algorithms Mol Biol, 2021; 16:9.

10.

Jűrgensen

Multisets, heaps, bags, families: What is a multiset?. Math Struct Comput Sci, 2020; 30(2):139–158.

11.

Karpov

, Malikic

, Rahman

, et al. A multi-labeled tree dissimilarity measure for comparing clonal trees of tumor progression. Algorithms Mol Biol, 2019; 14:17.

12.

Kaufman

, Rousseeuw

. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons: New York, NY, USA; 2009.

13.

, Tromp

, Zhang

. On the nearest neighbour interchange distance between evolutionary trees. J Theoret Biol, 1996; 182(4):463–467.

14.

Llabrés

, Rosselló

, Valiente

. A generalized Robinson–Foulds distance for clonal trees, mutation trees, and phylogenetic trees and networks. In: Proceedings of 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. ACM Press: New York, NY, USA;, 2020; pp. 13:1–13:10.

15.

Llabrés

, Rosselló

, Valiente

. The generalized Robinson–Foulds distance for phylogenetic trees. J Comput Biol, 2021; 28(12):1–15.

16.

Robinson

DF.

Comparison of labeled trees with valency three. J Combin Theory, 1971; 11(2):105–119.

17.

Robinson

, Foulds

. Comparison of phylogenetic trees. Math Biosci, 1981; 53(1–2):131–147.

18.

Schwartz

, Schäffer

. The evolution of tumour phylogenetics: Principles and practice. Nat Rev Genet, 2017; 18(4):213–229.

19.

Steel

, Penny

. Distributions of tree comparison metrics: Some new results. Syst Biol, 1993; 42(2):126–141.

20.

Williams

, Clifford

. On the comparison of two classifications of the same set of elements. Taxon, 1971; 20(4):519–522.

The k -Robinson–Foulds Dissimilarity Measures for Comparison of Labeled Trees