Sage Journals: Discover world-class research

Abstract

In phylogenetic, the diversity measures as UniFrac, weighted UniFrac, and normalized weighted UniFrac are used to estimate the closeness between two samples of genetic material sequences. These measures are widely used in microbiology to compare microbial communities. Furthermore, when the sample size is large enough, very good results have been obtained experimentally. However, some authors do not suggest using them when the sample size is very small. Recently, it has been mentioned that the weighted UniFrac measure can be seen as the Kantorovich-Rubinstein metric between the corresponding empirical distributions of samples of genetic material. Also, it is well known that the Kantorovich-Rubinstein metric complies the metric definition. However, one of the main reasons to establish it is that the sample size is large enough. The goal of this article is to prove that the diversity measures UniFrac are not metrics when the sample size is very small, which justifies why it must not be used in that case, but yes the Kantorovich-Rubinstein metric.

Keywords

phylogenetic phylogenetic tree pseudometric semimetric

Introduction

Phylogenetic is a field of biology that studies how organisms are related during evolution. The basic principle is that the members of an organism set that descend from the same ancestor share an evolutionary history. A problem in phylogenetic analysis is to determine similarities and differences between genetic material sequences. For example, the study of the degree of difference between two samples A and B of genetic material sequences. For this, the diversity measure UniFrac,¹ weighted UniFrac and normalized weighted UniFrac² have been used.

The above diversity measures are used by several authors in microbiology field to compare genetic material samples. For example, Frank et al³ said the diversity measure UniFrac is used to check whether patients with inflammatory bowel disease present samples from different microbial communities to patients without the disease. According to Costello et al,⁴ the weighted UniFrac and normalized weighted UniFrac are used to better understand the structure of the microbial community in skin sites and other body habitats between different individuals and at different times, and it is suggested that these trends may reveal how changes in the microbial cause or prevent diseases. Another application of these measures is given in Charlson et al,⁵ which are used to compare the population of bacteria in the lungs and their relationship with the population of bacteria of the upper respiratory tract, the former in healthy individuals. On the other hand, Ley et al⁶ said the diversity measure was used to measure the difference between bacterial communities in mice intestines, in order to test the effects of kinship and genotype diversity.

Moreover, from the theoretical point of view, diversity measures UniFrac give rise to other measures, for example, Chang et al⁷ proposed a new weighting scheme assuming that the sequences are randomly distributed; this scheme is called weighted UniFrac adjusted variance (VAW-UniFrac) and it is proposed as an improvement of weighted UniFrac. Furthermore, the VAW-UniFrac measure is compared to the UniFrac and weighted UniFrac measures to determine which is more efficient. Chen et al⁸ gave a generalization of the UniFrac diversity measures, this generalization is more usefulness to detect a set of biologically relevant changes than the UniFrac measure.

However, despite its practical application in the microbiology field, in Schloss,⁹ it is mentioned that ‘A recent simulation study concluded that UniFrac is unsuitable as a distance metric and should not be used for multivariate analysis’ that means, it is not appropriate to use diversity measures UniFrac as metrics and they should not be used in multivariate analysis.

Recently, in Evans and Matsen¹⁰ was mentioned that the weighted UniFrac measure is the classical Kantorovich-Rubinstein metric^11–14 or Earth Mover Distance¹⁵ between the corresponding empirical distribution of samples of genetic material on a phylogenetic tree. The above, under assumption that the sample size is large enough. In this way, McClelland and Koslicki¹⁶ propose the earth mover distance UniFrac (EMDUniFrac) and an algorithm to compute it.

In this article, we proof that the original version of diversity measures UniFrac are not metrics but they are pseudosemimetrics. They satisfy the following definition.

Definition 1

Let $X$ be a set, a function $d : X \times X \to [0, + \infty)$ is a pseudosemimetric in $X$ if for all $x, y \in X$ the function $d$ satisfies

If $x = y$ , then $d (x, y) = 0$ .

$d (x, y) = d (y, x)$ .

The above justifies why UniFrac measures can behave unexpectedly for small samples in multivariate data analysis, but it is not the case when the sample size is large enough. Thus, when the sample size is very small, it is recommended to use EMDUniFrac metric.

The rest of the work is developed as follows. In section “Rooted phylogenetic trees,” the necessary concepts will be given to define diversity measures. In section “Diversity measures UniFrac,” the three versions are defined: UniFrac, weighted UniFrac, and normalized weighted UniFrac, and we will show that they are pseudo-parametric; in this way, we prove that they are not metrics and how they are susceptible to small samples. In section “EMDUniFrac,” the UniFrac measures are estimated for some examples and they are compared with EMDUniFrac metric. Finally, some conclusions will be presented in section “Conclusions.”

Rooted Phylogenetic Trees

The diversity measures are calculated on a given phylogenetic tree. In this section, the concepts related to trees will be defined. They will be useful to address diversity measures UniFrac.

Basic definitions

Warnow¹⁷ defines a tree as a connected graph without cycles. A rooted tree $T$ is a tree in which a vertex $r$ is designated as root. The root in phylogenetic represents the common ancestor in the species represented in the tree $T$ . The vertices represented the characteristics that allow to establish the similarities between different species. These characteristics are given by genetic material sequences.

The vertex $w$ is parent of $v$ and $v$ is a child of $w$ , if $w$ and $v$ are vertices in the rooted tree $T$ such that $v \to w$ . Moreover, a vertex $l$ is a leaf if $l$ does not have any children and $T$ is a binary tree if it has vertices with at most two children.

On the other hand, in a tree $T$ , a path from vertex $x$ to vertex $y$ is the sequence of vertices in the graph such that there exist an edge between the vertex $x$ and the next one and so on until $y$ , denoted by $[x, y]$ . A branch $i$ is the vertices set and edges that belong to the path that goes from the leaf $l_{i}$ to the root $r$ . We call leaf set of $T$ to the set $S$ built with different labels that are assigned to tree leaves $T$ and denoted by $L (T)$ . Additionally, a clade of $T$ is a subset $A$ in $L (T)$ that it contains the leaf set of a subtree $T$ , with root in some vertex $v \in T$ and it is denoted by $L (T_{v})$ and $C (T)$ is the clades set $L (T_{v})$ such that $v \in T$ . The set $C (T)$ contains all the singular sets of leaves, a set that contains all the leaves and a clade for each remaining vertex of $T$ .

Otherwise, Warnow¹⁷ associated the parameter $p (e)$ to the edge $e \in T$ where $p (e)$ denotes the probability of changing state where $0 < p (e) < 0.5$ . A model tree Cavender-Farris-Neyman (CFN) is a pair $(T, θ)$ where $T$ is a binary rooted tree with leaf set ${l_{1}, \dots, l_{2}}$ and $θ$ gives the $p (e)$ values for all edges $e \in T$ . Under the CFN model, the number of changes in an edge is modeled by a Poisson random variable with expected value $λ (e)$ . Then, instead of using the probability substitution $p (e)$ in each edge, we will use $λ (e)$ , with the condition that $0 < λ (e)$ for all $e$ .

Thus, the branch length $i$ , denoted by $d (l_{i})$ , is a positive number that represents the rate of change between the root $r$ and the leaf $l_{i}$ , it is

d (l_{i}) = \sum_{e \in [r, l_{i}]} λ (e)

Let $λ_{[i j]}$ be the expected number of changes on the way $[l_{i}, l_{j}]$ on the tree $T$ , it follows that

λ_{[i j]} = \sum_{e \in [l_{i}, l_{j}]} λ (e)

We can see by the definition that $λ$ is the matrix distance on the road in a tree, where the path distance between two leaves is the sum of branch length and all branch lengths are positive. The matrix $λ$ is an additive matrix, which is defined as follows.

Definition 2

A matrix $M_{n \times n}$ is additive if there is a three $T$ with leaf set ${l_{1}, \dots, l_{n}}$ and the lengths of the edges are non-negative, that is branch length of $[l_{i}, l_{j}]$ in $T$ is equal to $M_{[l_{i}, l_{j}]}$ .

Phylogenetic tree construction

To construct a binary rooted phylogenetic tree using two samples $A$ and $B$ , it is necessary to consider the partial order definition.

Definition 3

A partial order is a binary relation $R$ in a set $S$ such that for any $a, b, c \in S$ satisfies

Transitivity: $〈 a, b 〉 \in R$ and $〈 b, c 〉 \in R$ imply that $〈 a, c 〉 \in R$ .

Reflexivity: $〈 a, a 〉 \in R \forall a \in R$ .

Antisymmetry: $〈 a, b 〉 \in R$ and $〈 b, a 〉 \in R$ imply that $a = b$ .

Two elements $a$ and $b$ are compatible if $〈 a, b 〉 \in R$ or $〈 b, a 〉 \in R$ .

Hasse diagram is a graphic scheme of a partially ordered set. To construct the Hasse diagram of a set, a vertex is created for each element of $S$ and a directed edge $x \to y$ if $〈 x, y 〉 \in R$ and x ≠ y. They are sorted from bottom to top, so the directed edges go up. The directed edges are removed $x \to y$ if there is a third vertex $z$ such that $〈 x, z 〉 \in R$ and $〈 z, y 〉 \in R$ .

Let $T$ be a rooted phylogenetic tree and the clades set $C (T)$ . The sequences $A$ and $B$ of genetic material are in relation, if $〈 A, B 〉 \in R$ if and only if $A \subset B$ . We can see that the relation $R$ is partial order.

Now, we will construct the Hasse diagram by $C (T)$ set. A graph is made assigning a vertex for each element in the set $C (T)$ and a directed edge from vertex $A$ to different vertex $B$ if $A \subset B$ . The smallest subset $B$ must be found, and if $A \subset B$ , we put a directed edge from $A$ to $B$ . As containment is transitive, if $A \subset B$ and $B \subset C$ , so $A \subset C$ . Therefore, if there are directed edges from $A$ to $B$ and from $B$ to $C$ so there are edges from $A$ to $C$ , and we can remove the directed edge from $A$ to $C$ without losing information.

The next theorem say that a binary rooted tree $T$ is isomorphic to the Hasse diagram built by $C (T)$ . It is proven by Warnow.¹⁷

Theorem 1

Let $T$ be a rooted tree in which each internal node has two children. Then the Hasse diagram built by $C (T)$ is isomorphic to $T$ . In this way, we can get the binary rooted tree $T$ from Hasse diagram built using the set $C (T)$ .

In the next section, the diversity measures are addressed in their three versions, UniFrac, weighted UniFrac, and normalized weighted UniFrac. Also, we will show that they satisfy the pseudosemimetric definition and we will give examples where diversity measures do not satisfy the metric definition.

Diversity Measures UniFrac

To define the diversity measures UniFrac, it is considered a binary rooted phylogenetic tree $T$ for two samples $A$ and $B$ of genetic material sequences, where sample $A$ has $A_{t}$ sequences and sample $B$ has $B_{t}$ sequences, not necessarily different, that means that $A \cap B \neq \emptyset$ may occur and in each sample could be two or more equal sequences; furthermore, each sample can contain the root or not (common ancestor between species $A$ and $B$ ). Let $T$ be the tree with $n$ branches and let $d (l_{i})$ be the length for each branch, with $i = 1, \dots, n$ , they coincide with the distance from the root ( $r$ ) to the sequence ( $l_{i}$ ) that is in the leaf on $i t h$ branch.

Let $A_{i}$ be the number of vertices in A that are in branch $i$ , analogously, $B_{i}$ the number of vertices in $B$ that are in branch $i$ We define

P_{i}^{A} = \frac{A_{i}}{A_{t}} and P_{i}^{B} = \frac{B_{i}}{B_{t}}

note that they are the proportions of descendant sequences in samples $A$ and $B$ in the $i$ branch, respectively.

Example 1

Consider the rooted tree in Figure 1. It is constructed using samples $A = {r, a_{1}, a_{1}, a_{2}}$ and $B = {r, b_{1}, b_{2}}$ , where $A_{t} = 4$ and $B_{t} = 3$ . The first branch has the sequences $r$ and $b_{1}$ , where $r \in A$ and ${r, b_{1}} \in B$ , so the proportion of descendant sequences in samples $A$ and $B$ on branch 1 are

P_{1}^{A} = \frac{1}{4} and P_{1}^{B} = \frac{2}{3}

(1)

respectively. The second branch has the sequences $r, a_{1}, b_{2}$ where ${r, a_{1}, a_{1}} \in A$ and ${r, b_{2}} \in B$ , in this way

P_{2}^{A} = \frac{3}{4} and P_{2}^{B} = \frac{2}{3}

analogously, the proportions of descendant sequences in third branch are

P_{3}^{A} = \frac{4}{4} and P_{3}^{B} = \frac{1}{3}

Figure 1.

(a) Tree for samples $A$ and $B$ . (b) Tree for samples $A$ and $B$ with the label leaves for $l_{i}$ with $i = 1, 2, 3$ ( $b_{1} = l_{1}, b_{2} = l_{2}, a_{2} = l_{3}$ ).

In later examples, the sequences in leaf on the $i t h$ branch will be denoted by $l_{i}$ (see Figure 1) in order to follow the given notation. This is because to definite the diversity measures, we need the branch length ( $d (l_{i})$ ) whose notation is given for sequences in the $i t h$ leaf.

UniFrac

The diversity measure UniFrac was proposed by Lozupone and Knight¹ and it is defined as

d^{u} (A, B) = \frac{\sum_{i = 1}^{n} d (l_{i}) | I (P_{i}^{A} > 0) - I (P_{i}^{B} > 0) |}{\sum_{i = 1}^{n} d (l_{i})}

(2)

where $I (\cdot)$ is the indicator function. We can see that the absolute value is $0$ or $1$ . It is $1$ when the $i t h$ branch has sequences in samples $A$ or $B$ and it is $0$ when has two samples.

Example 2

Consider the raised tree in Example 1, with $i = 1, 2, 3$ . The proportion of descendant sequences in $A$ and $B$ are greater than $0$ , see the expression (1), then

I (P_{i}^{A} > 0) = 1 and I (P_{i}^{B} > 0) = 1

thus,

| I (P_{i}^{A} > 0) - I (P_{i}^{B} > 0) | = 0

The diversity measure UniFrac version ignores the abundant information about sequences, only consider its presence or absence in the branch.

Proposition 1

The diversity measure UniFrac is a pseudosemimetric.

Proof

We will prove that the diversity measure UniFrac satisfies Definition 1. Moreover, we will give an example where it does not satisfy the metric definition.

If $A = B$ , it is $A_{i} = B_{i}$ for all $i$ and $A_{t} = B_{t}$ , so

I (P_{i}^{A} > 0) = I (P_{i}^{B} > 0)

for all $i$ , thus,

| I (P_{i}^{A} > 0) - I (P_{i}^{B} > 0) | = 0

therefore,

d^{u} (A, B) = 0

2. To prove symmetry, we consider

\begin{matrix} d^{u} (A, B) = \frac{\sum_{i = 1}^{n} d (l_{i}) | I (P_{i}^{A} > 0) - I (P_{i}^{B} > 0) |}{\sum_{i = 1}^{n} d (l_{i})} \\ = \frac{\sum_{i = 1}^{n} d (l_{i}) | I (P_{i}^{B} > 0) - I (P_{i}^{A} > 0) |}{\sum_{i = 1}^{n} d (l_{i})} \\ = d^{u} (B, A) \end{matrix}

Then the diversity measure UniFrac satisfies Definition 1. Additionally, we will give an example that does not satisfy the metric definition.

Let $T_{A B}$ be the tree built from two different samples:

A = {r, l_{1}} and B = {r, l_{2}}

where $r$ is the root and the sequence l₁ ≠ l₂ (see Figure 2). The branch is the path from $r$ to $l_{1}$ and the branch $2$ the path from $r$ to $l_{2}$ , we have to

d^{u} (A, B) = \frac{d (l_{1}) | 1 - 1 | + d (l_{2}) | 1 - 1 |}{d (l_{1}) + d (l_{2})} = 0

however, we supposed that l₁ ≠ l₂. So that if $d^{u} (A, B) = 0$ , it does not imply that $A = B$ .

2. We consider the samples

A = {r, a}, B = {b} and C = {r, c}

and the trees $T_{A B}$ , $T_{A C}$ , and $T_{C B}$ built for samples $A$ and $B$ , $A$ and $C$ , and $C$ and $B$ , respectively (see Figure 3), where

l_{1} = {\hat{l}}_{1} = a, l_{2} = {\bar{l}}_{2} = b and {\hat{l}}_{2} = {\bar{l}}_{1} = c

Moreover, suppose that

d ({\bar{l}}_{1}) < d (l_{1})

(3)

Figure 2.

Tree for samples $A$ and $B$ .

Figure 3.

(a) Tree for samples $A$ and $B$ . (b) Tree for samples $A$ and $C$ . (c) Tree for samples $C$ and $B$ .

If we estimate $d^{u} (A, B)$ , $d^{u} (A, C)$ , and $d^{u} (C, B)$ we have the following:

\begin{matrix} d^{u} (A, B) = \frac{d (l_{1}) | 1 - 0 | + d (l_{2}) | 1 - 1 |}{d (l_{1}) + d (l_{2})} \\ = \frac{d (l_{1})}{d (l_{1}) + d (l_{2})} \end{matrix}

(4)

\begin{matrix} d^{u} (A, C) = \frac{d ({\hat{l}}_{1}) | 1 - 1 | + d ({\hat{l}}_{2}) | 1 - 1 |}{d ({\hat{l}}_{1}) + d ({\hat{l}}_{2})} \\ = \frac{0}{d ({\hat{l}}_{1}) + d ({\hat{l}}_{2})} = 0 \end{matrix}

(5)

\begin{matrix} d^{u} (C, B) = \frac{d ({\bar{l}}_{1}) | 1 - 0 | + d ({\bar{l}}_{2}) | 1 - 1 |}{d ({\bar{l}}_{1}) + d ({\bar{l}}_{2})} \\ = \frac{d ({\bar{l}}_{1})}{d ({\bar{l}}_{1}) + d ({\bar{l}}_{2})} \end{matrix}

(6)

If the triangle inequality is satisfied and we considered expressions (4) to (6), we have that

\frac{d (l_{1})}{d (l_{1}) + d (l_{2})} \leq 0 + \frac{d ({\bar{l}}_{1})}{d ({\bar{l}}_{1}) + d ({\bar{l}}_{2})}

where not necessary

d (l_{1}) <_d ({\bar{l}}_{1})

It contradicts assumption (3). Then, triangle inequality is not satisfied.

Weighted UniFrac

The weighted UniFrac was proposed by Lozupone et al² and is denoted by

d^{w} (A, B) = \sum_{i = 1}^{n} d (l_{i}) | P_{i}^{A} - P_{i}^{B} |

(7)

It uses information about the abundance of the genetic material sequences. If the branch has large length, it means a fast evolution, and it could influence more than other in $d^{w} (A, B)$ .

Proposition 2

The weighted UniFrac is a pseudosemimetric.

Proof

We will prove that the weighted UniFrac satisfies Definition 1.

If we suppose that $A = B$ , we have $A_{i} = B_{i}$ for all $i$ and $A_{t} = B_{t}$ , so

P_{i}^{A} = P_{i}^{B}, for all i

thus,

| P_{i}^{A} - P_{i}^{B} | = 0, for all i

Therefore,

d^{w} (A, B) = 0

To prove symmetry, we consider

\begin{matrix} d^{w} (A, B) = \sum_{i = 1}^{n} d (l_{i}) | P_{i}^{A} - P_{i}^{B} | \\ = \sum_{i = 1}^{n} d (l_{i}) | P_{i}^{B} - P_{i}^{A} | \\ = d^{w} (B, A) \end{matrix}

Then, the weighted UniFrac satisfies Definition 1, but it does not satisfy the metric definition. We show some examples

Consider the different samples

A = {r, a, l_{1}} and B = {r, b, l_{2}}

and the tree $T_{A B}$ built for samples $A$ and $B$ (see Figure 4) that satisfied the next conditions:

A_{t} = B_{t} = 3 and A_{1} = B_{1} = A_{2} = B_{2} = 2

Therefore,

\begin{matrix} d^{w} (A, B) = d (l_{1}) | \frac{2}{3} - \frac{2}{3} | + d (l_{2}) | \frac{2}{3} - \frac{2}{3} | \\ = 0 \end{matrix}

with A ≠ B.

Consider the samples

\begin{array}{l} A = {r, a_{1}, a_{2}}, B = {r, b_{1}, b_{2}, b_{2}, b_{2}} \\ and C = {r, r, c_{1}, c_{1}, c_{1}, c_{1}, c_{1}, c_{1}, c_{2}, c_{2}} \end{array}

and the trees $T_{A B}$ , $T_{A C}$ , and $T_{C B}$ built for samples $A$ and $B$ , $A$ and $C$ , and $C$ and $B$ , respectively (see Figure 5), where

l_{2} = {\hat{l}}_{2} = a_{1}, l_{3} = {\hat{l}}_{3} = a_{2}, l_{1} = {\bar{l}}_{1} = b_{2}, {\bar{l}}_{2} = b_{1} and {\hat{l}}_{1} = c_{1}

(8)

and also, we assume

d (l_{2}) = d (l_{3})

(9)

d (l_{1}) > d ({\hat{l}}_{1})

(10)

Figure 4.

Tree for samples $A$ and $B$ .

Figure 5.

(a) Tree for samples $A$ and $B$ . (b) Tree for samples $A$ and $C$ . (c) Tree for samples $C$ and $B$ .

From equations (8) and (9), we have that

d ({\hat{l}}_{2}) = d ({\hat{l}}_{3}) = d (l_{2}) = d (l_{3})

If we estimate $d^{w} (A, B)$ , $d^{w} (A, C)$ , and $d^{w} (C, B)$ , we have the following:

\begin{matrix} d^{w} (A, B) = d (l_{1}) | \frac{1}{3} - \frac{4}{5} | + d (l_{2}) | \frac{2}{3} - \frac{2}{5} | \\ + d (l_{3}) | \frac{2}{3} - \frac{2}{5} | \\ = d (l_{1}) \frac{7}{15} + d (l_{2}) \frac{8}{15} \end{matrix}

(11)

\begin{matrix} d^{w} (A, C) = d ({\hat{l}}_{1}) | \frac{1}{3} - \frac{4}{5} | + d ({\hat{l}}_{2}) | \frac{2}{3} - \frac{2}{5} | \\ + d ({\hat{l}}_{3}) | \frac{2}{3} - \frac{2}{5} | \\ = d ({\hat{l}}_{1}) \frac{7}{15} + d ({\hat{l}}_{2}) \frac{8}{15} \end{matrix}

(12)

\begin{array}{l} d^{w} (C, B) = d ({\bar{l}}_{1}) | \frac{4}{5} - \frac{4}{5} | \\ + d ({\bar{l}}_{2}) | \frac{2}{5} - \frac{2}{5} | = 0 \end{array}

(13)

If the triangle inequality is satisfied, using equalities (11) to (13), we have

d (l_{1}) \frac{7}{15} + d (l_{2}) \frac{8}{15} \leq d ({\hat{l}}_{1}) \frac{7}{15} + d ({\hat{l}}_{2}) \frac{8}{15} + 0

where we can get

d (l_{1}) <_d ({\hat{l}}_{1})

this contradicts the supposition (10), so the weighted UniFrac does not comply with the triangle inequality.

We proved that weighted UniFrac satisfies Definition 1; however, it is not a metric.

Normalized weighted UniFrac

The normalized weighted UniFrac was proposed by Lozupone et al² and it is given by

d_{n}^{w} (A, B) = \frac{\sum_{i = 1}^{n} d (l_{i}) | P_{i}^{A} - P_{i}^{B} |}{D}

(14)

where the normalizing factor is

D = \sum_{j = 1}^{m} d (j) (Q_{j}^{A} + Q_{j}^{B})

(15)

with $m$ the number of different sequences in $A \cup B$ and $d (j)$ the distance from the root to the sequence $j \in (A \cup B)$ ; furthermore,

Q_{j}^{A} = \frac{α_{j}}{A_{t}} and Q_{j}^{B} = \frac{β_{j}}{B_{t}}

(16)

where $α_{j}$ and $β_{j}$ are the number of times that the sequence $j$ is observed in samples $A$ and $B$ , respectively.

Example 3

In Example 1, $A \cup B = {a_{1}, a_{2}, b_{1}, b_{2}}$ , where the sequences proportions in sample $A$ are

Q_{a_{1}}^{A} = \frac{2}{4}, Q_{a_{2}}^{A} = \frac{1}{4}, Q_{b_{1}}^{A} = 0, Q_{b_{2}}^{A} = 0

and the sequences proportions in sample $B$ are

Q_{a_{1}}^{B} = 0, Q_{a_{2}}^{B} = 0, Q_{b_{1}}^{A} = \frac{1}{3}, Q_{b_{2}}^{B} = \frac{1}{3}

The normalized weighted UniFrac is less sensitive to branches with a long length and is determined by branches with different proportions.

Proposition 3

The normalized weighted UniFrac is a pseudosemimetric.

Proof

We will prove that the normalized weighted diversity measure UniFrac satisfies Definition 1.

1. Analogous to 2. of Proposition 2, we have

d_{n}^{w} (A, B) = 0

2. To prove symmetry, we consider

\begin{matrix} d_{n}^{w} (A, B) = \frac{\sum_{i = 1}^{n} d (l_{i}) | P_{i}^{A} - P_{i}^{B} |}{\sum_{j = 1}^{m} d (j) (Q_{j}^{A} + Q_{j}^{B})} \\ = \frac{\sum_{i = 1}^{n} d (l_{i}) | P_{i}^{B} - P_{i}^{A} |}{\sum_{j = 1}^{m} d (j) (Q_{j}^{B} + Q_{j}^{A})} \\ = d_{n}^{w} (B, A) \end{matrix}

Thus, the normalized weighted diversity measure UniFrac satisfies with Definition 1. Now, examples where it does not satisfy:

1. We consider the example in item (1) from Proposition 2. Therefore,

d_{n}^{w} (A, B) = 0

2. Consider the samples

\begin{array}{l} A = {r, a_{1}, a_{2}}, B = {r, b_{1}, b_{1}, b_{1}, b_{2}} \\ and C = {r, r, c_{1}, c_{1}, c_{1}, c_{1}, c_{1}, c_{1}, c_{2}, c_{2}} \end{array}

and the trees $T_{A B}$ , $T_{A C}$ , and T_CB built for samples $A$ and $B$ , $A$ and $C$ , and $C$ and $B$ , respectively (see Figure 6), where

l_{1} = b_{1}, l_{2} = {\hat{l}}_{2} = a_{1}, l_{3} = {\hat{l}}_{3} = a_{2}, {\bar{l}}_{1} = {\hat{l}}_{1} = c_{1}, and {\bar{l}}_{2} = c_{2}

(17)

and additionally assume

d (l_{3}) = d (l_{2})

(18)

d (l_{1}) = d (b_{2})

d ({\bar{l}}_{2}) = d ({\bar{l}}_{1})

d (l_{1}) < d ({\hat{l}}_{1})

(19)

Figure 6.

(a) Tree by samples $A$ and $B$ . (b) Tree by samples $A$ and $C$ . (c) Tree by samples $C$ and $B$ .

Note that equations (17) and (18) imply that

d (l_{3}) = d (l_{2}) = d ({\hat{l}}_{2}) = d ({\hat{l}}_{3})

Thus,

\begin{array}{l} d_{n}^{w} (A, B) \\ = \frac{d (l_{1}) | \frac{1}{3} - \frac{4}{5} | + d (l_{2}) d (l_{3}) | \frac{2}{3} - \frac{2}{5} | | \frac{2}{3} - \frac{2}{5} |}{d (l_{1}) \frac{3}{5} + d (b_{2}) \frac{1}{5} + d (l_{3}) \frac{1}{3} + d (l_{2}) \frac{1}{3}} \\ = \frac{d (l_{1}) \frac{7}{15} + d (l_{2}) \frac{8}{15}}{d (l_{1}) \frac{4}{5} + d (l_{2}) \frac{2}{3}} \end{array}

(20)

\begin{array}{l} d_{n}^{w} (A, C) \\ = \frac{d ({\hat{l}}_{1}) | \frac{1}{3} - \frac{4}{5} | + d ({\hat{l}}_{2}) | \frac{2}{3} - \frac{2}{5} | + d ({\hat{l}}_{3}) | \frac{2}{3} - \frac{2}{5} |}{d (c_{2}) \frac{1}{5} + d ({\hat{l}}_{1}) \frac{3}{5} + d ({\hat{l}}_{2}) \frac{1}{3} + d ({\hat{l}}_{3}) \frac{1}{3}} \\ = \frac{d ({\hat{l}}_{1}) \frac{7}{15} + d ({\hat{l}}_{2}) \frac{8}{15}}{d ({\hat{l}}_{1}) \frac{4}{5} + d ({\hat{l}}_{2}) \frac{2}{3}} \end{array}

(21)

\begin{array}{l} d_{n}^{w} (C, B) \\ = \frac{d ({\bar{l}}_{1}) | \frac{4}{5} - \frac{4}{5} | + d ({\bar{l}}_{2}) | \frac{2}{5} - \frac{2}{5} |}{\tilde{D}} = 0 \end{array}

(22)

with $\tilde{D}$ the respective normalizing factor. As the triangle inequality is satisfied using the equalities (20)-(22), we have

\frac{d (l_{1}) \frac{7}{15} + d (l_{2}) \frac{8}{15}}{d (l_{1}) \frac{4}{5} + d (l_{2}) \frac{2}{3}} \leq \frac{d ({\hat{l}}_{1}) \frac{7}{15} + d ({\hat{l}}_{2}) \frac{8}{15}}{d ({\hat{l}}_{1}) \frac{4}{5} + d ({\hat{l}}_{2}) \frac{2}{3}} + 0

from we can get

d ({\hat{l}}_{1}) <_d (l_{1})

it contradict the supposition (10). Therefore, the normalized weighted diversity measure UniFrac does not satisfy the triangle inequality.

We proved that normalized weighted diversity measure UniFrac is a pseudosemimetric. Next, we will give examples where we calculated the diversity measures UniFrac on a tree illustrate by McClelland and Koslicki¹⁶ and we will compare with EMDUniFrac.

EMDUniFrac

Based on Evans and Matsen,¹⁰ the EMDUniFrac is proposed in McClelland and Koslicki,¹⁶ Given two samples $A$ and $B$ of genetic material and their associated abundances, we can estimate two probability distributions $P$ and $Q$ on their phylogenetic tree $T$ that represent the fraction of a given sample that appears at each node in $T$ . Let $D$ be the matrix of all pairwise distances between nodes in $T$ and $Γ (P, Q)$ describe the space of all ways in which one community can be transformed into the other. The $(i, j) t h$ entry of $M \in Γ (P, Q)$ indicates the total abundance of $M_{i, j}$ has been moved from node $i$ in sample $P$ to node $j$ in sample $Q$ . In this way, the EMDUniFrac is given by

E M D U n i F r a c (P, Q) = \min_{M \in Γ (P, Q)} \sum_{i, j \in T} D_{i, j} M_{i, j}

it represents the minimum amount of ‘work’ required to transform the distribution $P$ into the distribution $Q$ along the phylogenetic tree. It has been previously show that EMDUniFrac(P, Q) is equivalent to weighted UniFrac distance when the sample size is large enough.¹⁰ However, we will give examples where the EMDUniFrac distance and the diversity measures UniFrac are different between them.

Considerate the tree $T$ as in Figure 1(b) in McClelland and Koslicki¹⁶ where EMDUniFrac(P, Q) is $0.2333$ . We calculate the diversity measure UniFrac on $T$ :

\begin{matrix} d^{u} (A, B) = \frac{d (l_{1}) (1) + d (l_{2}) (1) + d (l_{3}) (0) + d (l_{4}) (1)}{\frac{6}{5}} \\ = \frac{\frac{3}{10} (3)}{\frac{6}{5}} = \frac{3}{4} = 0.75 \end{matrix}

Thus,

E M D U n i F r a c (P, Q) = / d^{u} (A, B)

both under $T$ .

It is important to mention the samples size is very small.

The weighted diversity measure UniFrac (see expression 7) on the tree $T$ is

\begin{array}{l} d^{w} (A, B) = d (l_{1}) | P_{1}^{A} - P_{1}^{B} | + d (l_{2}) | P_{2}^{A} - P_{2}^{B} | \\ + d (l_{3}) | P_{3}^{A} - P_{3}^{B} | + d (l_{4}) | P_{4}^{A} - P_{4}^{B} | \\ = \frac{3}{10} | 0 - \frac{1}{3} | + \frac{3}{10} | \frac{1}{2} - 0 | + \frac{3}{10} | \frac{1}{2} - \frac{1}{3} | \\ + \frac{3}{10} | 0 - \frac{2}{3} | \\ = \frac{3}{10} (\frac{1}{3} + \frac{1}{2} + \frac{1}{6} + \frac{2}{3}) = \frac{1}{2} = 0.5 \end{array}

thus, can see that

E M D U n i F r a c (P, Q) = / d^{w} (A, B)

Now, we obtain the normalized weighted UniFrac value as

\begin{array}{l} d_{n}^{w} (A, B) \\ = \frac{\frac{1}{2}}{d (1) \frac{1}{3} + d (2) \frac{1}{2} + d (3) \frac{1}{2} + d (4) \frac{1}{3} + d (5) (0) + d (6) \frac{1}{3}} \\ = \frac{\frac{1}{2}}{\frac{3}{10} \frac{1}{3} + \frac{3}{10} \frac{1}{2} + \frac{3}{10} \frac{1}{2} + \frac{3}{10} \frac{1}{3} + \frac{1}{5} (0) + \frac{1}{5} \frac{1}{3}} \\ = \frac{\frac{1}{2}}{\frac{1}{10} + \frac{3}{20} + \frac{3}{20} + \frac{1}{10} + \frac{1}{15}} = \frac{\frac{1}{2}}{\frac{1}{2} + \frac{1}{15}} = \frac{15}{17} = 0.8823 \end{array}

Thus,

E M D U n i F r a c (P, Q) = / d_{n}^{w} (A, B)

Therefore, the diversity measures UniFrac and EMDUniFrac are different between them. Then, we can say the diversity measures UniFrac are not equal to EMDUniFrac(P, Q) if the samples size is not large enough.

On the other hand, considerate the tree $T$ as the Figure 1(b) in McClelland and Koslicki,¹⁶ it is built for the different samples $A = {3, 4, 5, 7}$ and $B = {1, 2, 6, 7}$ , we calculate the weighted UniFrac measure as $T$ :

\begin{array}{l} d^{w} (A, B) = d (l_{1}) | P_{1}^{A} - P_{1}^{B} | + d (l_{2}) | P_{2}^{A} - P_{2}^{B} | \\ + d (l_{3}) | P_{3}^{A} - P_{3}^{B} | + d (l_{4}) | P_{4}^{A} - P_{4}^{B} | \\ = \frac{3}{10} | \frac{2}{4} - \frac{2}{4} | + \frac{3}{10} | \frac{2}{4} - \frac{2}{4} | + \frac{3}{10} | \frac{2}{4} - \frac{2}{4} | \\ + \frac{3}{10} | \frac{2}{4} - \frac{2}{4} | \\ = \frac{3}{10} (0) = 0 \end{array}

however, samples $A$ and $B$ are different. So that if $A = / B$ , it does not imply that $d^{w} (A, B) = / 0$ .

Conclusions

In this article, we prove that diversity measures UniFrac, weighted UniFrac, normalized weighted UniFrac satisfy the positive property, symmetry property, and the implication that if the samples are equal then the diversity measures are zero. On the other hand, examples were presented where the diversity measures mentioned do not comply the metric definition. We prove that diversity measures comply the pseudosemimetric definition.

Although measures UniFrac are used in microbiology as a tool to measure the proximity between samples of genetic material large enough and showing a good performance, as mentioned in the literature,^3–7 when the sample size is small, no it is appropriate to use it in that sense. The previous thing due to the lack of the properties previously amended, as Schloss said. In section “EMDUniFrac,” we could see examples where the diversity measures UniFrac and EMDUniFrac are different between them; in this way, we can say the diversity measures UniFrac are not equivalent to EMDUniFrac if the samples size is not large enough. Furthermore, if we calculate the weighted UniFrac for two different small samples, it does not imply that weighted UniFrac is zero. Then an alternative for diversity measures UniFrac is the Kantorovich-Rubinstein metric¹⁰ or EMDUniFrac metric.^18–20

Footnotes

Acknowledgements

This article was developed under the Project PRODEP UV-PTC-779 of Mexico Government.

Funding:

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests:

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Author Contributions

JARA performed the analytic calculations and MLAG supervised the project. Both JARA and MLAG authors contributed to the final version of the manuscript.

ORCID iD

Martha Lorena Avendaño Garrido

References

Lozupone

Knight

UniFrac: a new phylogenetic method for comparing microbial communities. Appl Environ Microbiol. 2005;71:8228–8235.

Lozupone

Hamady

Kelley

Knight

Quantitative and qualitative β diversity measures lead to different insights into factors that structure microbial communities. Appl Environ Microbiol. 2007;73:1576–1585.

Frank

Amand

ALS

Feldman

Boedeker

Harpaz

Pace

NR.

Molecular-phylogenetic characterization of microbial community imbalances in human inflammatory bowel diseases. Proc Natl Acad Sci U S A. 2007;104:13780–13785.

Costello

Lauber

Hamady

Fierer

Gordon

Knight

Bacterial community variation in human body habitats across space and time. Science. 2009;326:1694–1697.

Charlson

Bittinger

Haas

et al . Topographical continuity of bacterial populations in the healthy human respiratory tract. Am J Respir Crit Care Med. 2011;184:957–963.

Ley

Bäckhed

Turnbaugh

Lozupone

Knight

Gordon

JI.

Obesity alters gut microbial ecology. Proc Natl Acad Sci U S A. 2005;102:11070–11075.

Chang

Luan

Sun

Variance adjusted weighted UniFrac: a powerful beta diversity measure for comparing communities based on phylogeny. BMC Bioinformatics. 2011;12:118.

Chen

Bittinger

Charlson

et al . Associating microbiome composition with environmental covariates using generalized UniFrac distances. Bioinformatics. 2012;28:2106–2113.

Schloss

PD.

Evaluating different approaches that test whether microbial communities have the same structure. ISME J. 2008;2:265.

10.

Evans

Matsen

FA.

The phylogenetic Kantorovich-Rubinstein metric for environmental sequence samples. J R Stat Soc Series B Stat Methodol. 2012;74:569–592.

11.

Rachev

ST.

Probability Metrics and the Stability of Stochastic Models, Volume 269. Hoboken, NJ: John Wiley & Son Ltd; 1991.

12.

Rachev

Rüschendorf

Mass Transportation Problems, Volume I: Probability and Its Applications. New York, NY: Springer; 1998.

13.

Villani

Topics in Optimal Transportation. Providence, RI: American Mathematical Society; 2003.

14.

Villani

Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Berlin, Germany: Springer; 2008.

15.

Levina

Bickel

. The earth mover’s distance is the mallows distance: some insights from statistics. In: Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, vol. 2. New York, NY: IEEE; 2001:251–256.

16.

McClelland

Koslicki

EMDUnifrac: exact linear time computation of the Unifrac metric and identification of differentially abundant organisms. J Math Biol. 2018;77:935–949.

17.

Warnow

Computational Phylogenetics. An Introduction to Designing Methods for Phylogeny Estimation. Cambridge, UK: Cambridge University Press; 2017.

18.

Srinivasan

Hoffman

Morgan

et al . Bacterial communities in women with bacterial vaginosis: high resolution phylogenetic analyses reveal relationships of microbiota to clinical criteria. PLoS ONE. 2012;7:e37818.

19.

Smith

McAndrew

Chen

et al . The cervical microbiome over 7 years and a comparison of methodologies for its characterization. PLoS ONE. 2012;7:e40425.

20.

Livermore

Mattes

TE.

Phylogenetic detention of novel cryptomycota in an Iowa (United States) aquifer and from previously collected marine and freshwater targeted high-throughput sequencing sets. Environ Microbiol. 2013;15:2333–2341.

A Commentary on Diversity Measures UniFrac in Very Small Sample Size

Abstract

Keywords

Introduction

Definition 1

Rooted Phylogenetic Trees

Basic definitions

Definition 2

Phylogenetic tree construction

Definition 3

Theorem 1

Diversity Measures UniFrac

Example 1

UniFrac

Example 2

Proposition 1

Proof

Weighted UniFrac

Proposition 2

Proof

Normalized weighted UniFrac

Example 3

Proposition 3

Proof

EMDUniFrac

Conclusions

Footnotes

Acknowledgements

Funding:

Declaration of conflicting interests:

Author Contributions

ORCID iD

References