Sage Journals: Discover world-class research

Abstract

Using topological summary tools such as persistence landscapes have greatly enhanced the practical usage of topological data analysis to analyze large-scale, noisy, and complex datasets. A central element of persistence landscape usage involves computing the top- $k$ landscapes. This article presents a novel output-sensitive plane sweep algorithm for computing the top- $k$ persistence landscapes in optimal time and space: significantly outperforming previous algorithms. Our algorithm can determine in optimal $O (n * \log (n))$ if a given birth-death pair appears in the top- $k$ landscapes. The runtime performance of the approach on a botnet dataset and several synthetically generated point cloud topologies, showing that the algorithm can achieve significant speedups for these datasets due to its better algorithmic design. The speedups seen range from slightly worse (in some extreme examples) to equal compared to previous works while returning exactly the same output and is significantly faster when filtering is used (15x for birth-death pairs when removing 75% of birth-death pairs). Filtering is shown to maintain machine learning performance on both synthetically generated and real world datasets while providing orders of magnitude speedup depending on how intensive of filtering is done. Due to the introduced algorithm’s algorithmic design, the speedup seen is greater when filtering using the introduced birth-death filtering algorithm. The software is freely provided in Rust with Python bindings online.

Keywords

Mining methods and algorithms data mining topological data analysis computational geometry computational topology

Introduction

Topological data analysis (TDA) is the application of topological functions and methods to datasets in order to gain greater insight into the properties of a dataset. It is based on the work done in algebraic topology and promises to provide more robust methods than those based on geometric metrics due to its focus on properties which persist under continuous deformation. It has shown to be an effective way at gaining insight into datasets in a new way.^1–3 Due to providing some initial promising results in fields which traditionally struggle with noisy, high dimensional and incomplete datasets, there is an increasing amount of research efforts.

TDA has been shown in some cases to give data scientists a new way to understand and explore the underlying structure of their data by allowing them to compute the topology of the data. Since its initial discovery, different data structures and techniques with varying properties have been presented, enabling rising interest for their application in data science and machine learning. Although, it is still an open problem for how to best incorporate the information for machine learning.

The discovery of persistent homology allowed for the ideas from algebraic topology to be applied to the point clouds seen in data science and machine learning.^4,5 Persistent homology can be represented in many different forms. Bubenik et al.⁶ introduced a representation that holds desirable statistical properties: the persistence landscape (PL). Among these properties is being stable under perturbation which is useful in machine learning. Persistent homology allowed researchers to apply TDA techniques to datasets and use existing statistical machine learning algorithms. As topology captures the invariants of a dataset under continuous deformation, it has been seen that these properties have better generalizability in some problem domains than geometric properties.^1,7

Machine learning pipelines using PL have performed competitively in some time series classification for networking datasets,⁷ and time series prediction in financial markets.^2,3 Unfortunately, the algorithms for computing PL are not well suited for larger datasets. The best algorithm for computing the top- $k$ PLs requires tracking every birth-death pair during its lifespan. There is room for improvement as many short-lived birth-death pairs exist with no chance of making it to the top- $k$ landscapes. This is especially true when noise is present in a dataset: one of the domains where the technique found application.

This article introduces novel algorithms to address this issue. It presents algorithms which have a lower time complexity than the previous state-of-the-art. The presented algorithms only compute a constant number of extraneous operations for each returned landscape segment, making the algorithm output sensitive. Some algorithms’ time complexity can be better expressed in terms of both their input and output size when their output size can vary greatly. The algorithms are output sensitive algorithms and are common in computational geometry. Thus, the algorithm sees the greatest relative speed up when the top- $k$ landscapes returned are significantly smaller than the total number of landscapes (e.g. noisy datasets) but even in unfavorable conditions performs on par with the previous best.

This output-sensitive algorithm is made possible by taking advantage of a new relationship between persistence pairs and PLs. Due to this relationship, we have provided an algorithm that can determine with $n$ birth-death pairs in optimal $O (n * \log (n))$ time if a given birth-death pair appears in the top- $k$ landscapes.

This article’s novel algorithm for PL generation from a set of birth-death pairs is the new state-of-the-art in terms of asymptotic complexity when computing the top- $k$ landscapes. Our algorithms are validated on synthetic and real-world datasets. A comparison with the previous best for PL generation is provided with a varying size of top- $k$ landscapes returned.

The work presented here provides the following novel contributions

Improving the computation of PL by only a single pass over the data with optimal time complexity (this is of crucial concern when the entire dataset cannot fit in memory)

Filtering the persistence pairs to remove noise (not discussed by Bubenik et al.⁸) and of importance to PL’s use in noisy datasets, an area PL has been applied to and found success in.

Quantifying the speedups that can be expected when limiting the number of landscapes returned (using synthetic and real-world datasets).

Providing an implementation of the algorithms in Rust with Python bindings.

This article proceeds as follows: the “Related work” section discusses related work in TDA, the “Background” section provides a background of the elements of TDA, the “Algorithms” section presents the novel algorithm for PL generation and its runtime analysis, the “Persistence pairs filtering” section presents the persistence pairs filtering algorithm and its runtime analysis, the “Experiments” section contain experiments done comparing this algorithm against the previous best on synthetic and real world datasets, and the “Conclusion” section provides a conclusion and summary of the results.

Related work

TDA is concerned with calculating topological invariants of datasets. Topological invariants from algebraic topology are concerned with computing the properties of spaces that are unchanged under continuous deformation. The use of global structures contrasts with geometric properties that are concerned with local structures in the data. This enables TDA techniques to be resilient to high noise levels in the data. The most common invariant to track in TDA is persistent homology. The seminal work on persistent homology^4,5,9 enabled the application of algebraic topology on point cloud data. Persistent homology gives a multiscale interpretation of the data, allowing its topology to be computed and tracked at varying levels of connectedness. Other topological structures have built on persistent homology to add other properties of interest. One of these extensions, the PL,⁶ is stable under perturbation and lives in a Banach or Hilbert space, allowing it to be used in statistical machine learning models. Bubenik et al.⁶ also introduced the idea of the $λ_{k}$ function, the $k th$ highest PL, PLs when originally presenting PLs. The top- $k$ PLs are defined as the top- $k = {λ_{i} | i \leq k}$ . Using the top- $k$ PLs gives greater robustness to noise in the dataset by ignoring short-lived topological properties (see “Persistence landscapes” section for more detail).

PLs have found applications in many domains including financial market prediction,^2,3 improving deep learning,^1,10 brain artery trees,¹¹ music recognition,¹² similarity of pore-geometry in nanoporous materials,¹³ and functional networks.¹⁴

There are other TDA methods that capture persistent homology, including persistence images,¹⁵ which have seen success in time series forecasting,^16,17 3D shape segmentation¹⁸ and for studying point clouds.¹⁹ Persistence images allow for the use of machine learning techniques from computer vision to be used by embedding topological properties in matrices.

Bubenik et al.,⁸ who originally introduced PLs,⁶ presented a method for generating PLs from a set of birth-death pairs. They accomplish this by performing a sort on the set of birth-death pairs, followed by a single scan of the ordered set of pairs for each landscape to be generated. This results in a time of $O (n \log n + K n)$ , where $n$ is the number of birth-death pairs and $K$ is the number of PLs to compute. They also present an approximate, grid-based algorithm for the generation of the PLs. The grid-based algorithm lays an evenly spaced grid over the set of birth-death pairs and computes the PL values when they intersect the grid. This gives an approximate solution but has a better time than the exact solution. The approximate solution has time $O (M n \log n)$ , where $M + 1$ is the grid size and $n$ is the number of birth-death pairs.

Although the exact algorithm from Bubenik et al.⁸ has optimal worst-case time with respect to the input size of the problem, it does this at the detriment of the more likely case that there are less than $O (n^{2})$ intersections in the output where $n$ is the number of birth-death pairs. An algorithm that is tuned to the more likely case could perform faster on average. This is done by taking inspiration from fields such as computational geometry, where some algorithms are output-sensitive. Output-sensitive algorithms explore the idea that when the output size of a problem can vary greatly, even when the input size is held constant, it can be helpful to compare algorithms in terms of their output size and input size instead of their input size alone. This is the case when generating PLs, as the output size is dependent on the number of intersections between landscapes, which is a term that can quickly overrule the other variables in the analysis as $O (n^{2})$ , where $n$ is the number of birth-death pairs.

There are also two other areas where unnecessary work can be avoided. First, the original algorithm considers every birth-death pair during the PLs calculation. If only the top- $k$ landscapes are returned from the algorithm where $k << n$ , the number of birth-death pairs, then not every birth-death pair will appear in the PLs. When this is the case, filtering out the persistence pairs that can never appear in the top- $k$ landscapes saves time.

Additionally, the original, exact algorithm makes $K$ passes over the data. This amplifies the negative impact of the birth-death pairs, which do not appear in the top- $k$ landscapes. There is also the case of a birth-death pair is in the top- $k$ landscapes, but only appears in the $k th$ landscape. In this case, the algorithm will scan over the pair $k - 1$ times before it is used. This gives a worst-case time of $O (n * k)$ if $k$ landscapes are returned, where $n$ is the number of birth-death pairs. Bubenik et al.⁸ do briefly state that it is possible to modify their algorithm to only compute the top- $k$ landscapes. This comment is only related to stopping their algorithm early. Their released package does not provide an API to implement this functionality.

Background

TDA is a group of techniques that take advantage of a dataset’s topology to gain further insight into the data. It has been used with datasets that are noisy, incomplete, and have high dimensionality. Additionally, algorithms for efficiently computing meaningful representations for point clouds have been introduced for statistical learning. A central topic of TDA and the technique used by PL is persistent homology. Persistent homology captures the geometry of a point cloud by quantifying the shape and lifetime of the various n-dimensional “holes” within the point cloud. For a comprehensive self-contained mathematical introduction to TDA see Postol.²⁰

Simplices and simplicial complex

The starting point in TDA is the analysis of simplicial complexes. Simplicial complexes are a geometric notion of an abstract simplicial complex. An abstract simplicial complex is a family of sets that are closed under subsets (i.e. all subsets are valid members of the family of sets). Informally, simplices generalize the notion of a line or triangle to n dimensions. More formally, an n-simplex is a polytype of n dimensions defined as the convex hull of its $n + 1$ vertices. A simplicial complex is similar to a graph where each subgraph of $n$ points that is a complete graph creates an $n - 1$ -dimensional simplex. For each complete graph, $G$ in the simplicial complex $C$ , each subgraph of $G$ is also a complete graph that forms its simplex, is also a part of $C$ . Simplices in a simplicial complex are connected along the shared lower-dimensional simplices. If no shared lower-dimensional simplices exist between two simplices, then they are not connected.

Filtrations

Filtration is a way of building up a simplicial complex starting from the empty set. Additionally, filtrations are indexed and totally ordered: $\emptyset = K_{0} \subseteq K_{1} \dots$ $\subseteq K_{n - 1} \subseteq K_{n} = K$ , where each $K_{n}$ is a simplicial complex. Filtrations are used in computational topology to extract topological properties from a set of points.

One standard filtration used in TDA is the Vietoris-Rips filtration, commonly known as the Rips filtration. The Rips filtration uses a set’s geometric properties to derive topological properties. Given a collection of points, define a threshold $t$ and a distance function $d (x, y)$ . The Rips filtration begins with the threshold set to $0$ and increases until the filtration is completed. All $s \subseteq S$ , where $\forall x, y \in s, d (x, y) < t$ for the current threshold form a simplex in the simplicial complex for the given subcomplex $S$ . This results in an ordered set of subcomplexes that encode a point cloud’s geometry.

$k$ -Skeleton

The $k$ -skeleton is a subset of a simplicial complex, where only simplices up to a dimension $k$ are kept. When choosing the $k$ -skeleton to compute persistence over. The $2$ -skeleton, when using the Rips filtration, captures holes caused by loops in the time series. This allows the model to determine how the time series’ cyclic structure changes over time and is used by Postol et al.⁷ to get their promising results. Others have had success using other filtration processes, such as the lower star filtration, which Liu et al.¹² showed promising results when classifying the instruments found in an audio track.

Persistent homology and persistent cohomology

Persistent homology and cohomology are both used in TDA, and, in most cases, produce equivalent results when their results are represented in a persistence representation.²¹ from a practitioner’s

They are used to transform an ordered set of simplicial complexes from a filtration into birth-death pairs. These birth-death pairs represent the first and last threshold from the filtration that a topological property was observed during the filtration. Persistent homology and cohomology can be computed in various dimensions, where the dimension determines the type of topological objects being tracked.

This is typically visualized in lower dimensions as follows: 0-d persistent homology represents connected components; 1-d persistent homology represents “holes,” and 2-d persistent homology represents “voids.” Although this does not easily generalize to higher dimensions, some readers may find the intuition helpful homology at large.

It should be noted that some features may not have a numerical death value (represented in our framework as a death of $\infty$ ) if the feature still exists at the end of the filtration.

Persistence representations

There are multiple ways to visualize the birth-death pairs. Barcodes are made by creating a number line for the threshold value. Then, for each birth-death pair, a line is created starting at the birth and ending at the death. Usually, these lines are drawn not to intersect each other to create a more pleasing visualization. Another persistence representation is the persistence diagram, where the birth-death pairs are plotted in 2-d.

Persistence landscapes

To compute the PL, we take the birth-death pairs, and for each pair, we create two line segments. The first line segment will start at (b, 0), where b is the birth of that pair, and end at $(((d - b) / 2) +$ $b, (d - b) / 2)$ . The second line segment will begin at $(((d - b) / 2) + b, (d - b) / 2)$ and end at $(d, 0)$ . Equation (1) shows the piecewise formulation of this function for a given birth-death pair provided by Bubenik et al.⁸
$f_{(b, d)} = {\begin{matrix} 0 & x \notin (b, d) \\ x - b & x \in (b, \frac{b + d}{2}] \\ - x + d & x \in (\frac{b + d}{2}, d) \end{matrix}$
(1)

Suppose a birth-death pair has a greater “lifespan” (the difference between birth and death) than another. In that case, the pair with the longer lifespan in the PL will peak higher than the other. This quantifies a feature’s importance overall and at a specific threshold by observing the height. We have a PL once we create all of these line segments and plot them on the same graph. The $λ_{k} (x)$ landscape can be defined as the k-max values of the landscape. If we take the $λ_{1}$ landscape, this value can be thought of as the highest strength of any homology feature at every threshold. The $λ_{2}$ landscape would be the set of max points if all the points in $λ_{1}$ were removed. An example can be seen in Figure 1.

Figure 1.
Persistence landscape. $λ_{1}$ is dash-dotted, $λ_{2}$ is solid, and $λ_{3}$ is dashed.

A key element of PLs and a main motivator for their creation and adoption in machine learning is that they are stable under perturbation. This stability allows them to be more easily compared in noisy environments than other topological representations, such as persistence barcodes and persistence diagrams. The goal is that this allows for statistical machine learning algorithms to take advantage of PLs with minimal modification. It should be noted that determining the optimal way to use topological information in machine learning models is still an open problem. Additionally, a PL exists in a Banach or Hilbert space, which makes the formulation of some statistical properties easier.

Algorithms

Two novel algorithms are presented in this section that, when combined, result in a significant speedup in the time needed to compute the top- $k$ PLs given a set of birth-death pairs. There is first an output-sensitive algorithm for computing the top- $k$ PLs. Then there is a preprocessing step of the set of birth-death pairs that will, in optimal $O (n \log n)$ , remove all birth-death pairs that can never appear in the top- $k$ landscapes, where $n$ is the number of persistence pairs.

It should be noted that the persistence pair filtering step does not need to be done, as it is only for computational efficiency. Additionally, the persistence pairs filtering step can be used with any other algorithm for generating PLs from birth-death pairs. This is because it both takes in and outputs a set of birth-death pairs in the same format.

Plane-sweep landscape generation

This section provides an output-sensitive algorithm for computing PLs by taking advantage of the geometric structure of the problem. Bubenik et al.⁸ were the first to propose that this might be helpful.

First, it is helpful to understand the simplest manifestation of the problem and build up from there. Given a collection of every birth-death pair and every intersection between birth-death pairs (including which two pairs are involved in the intersection), it is trivial to determine the top- $k$ landscapes. All one has to do is perform a plane sweep of the data and keep track of the ordering of line segments based on the y coordinate. All that is needed is to track when a line segment entered and left a given k-max position. The disadvantage of this technique is that every intersection must be calculated by brute force beforehand, which upper bounds the algorithm by $O (n^{2})$ , where $n$ is the number of persistence pairs.

Fortunately, through careful modifications to the brute force algorithm, it is possible only to perform a minimal amount of additional work for each birth-death pair that appears in the top- $k$ landscapes: creating an output-sensitive algorithm. This is the algorithm built up in this section. The first stage is the definition of a preprocessing step that eliminates all birth-death pairs that do not appear in the top- $k$ landscapes. Pairs that do not appear in the top- $k$ landscapes can be safely removed without affecting the output of the algorithm, as previously shown in the “Persistence pairs filtering” section. Using the findings of the “Persistence pairs filtering” section, a simple plane-sweep algorithm can be run before we generate the PLs to filter out birth-death pairs that do not appear in the top- $k$ landscapes. This is done by keeping an ordering of the current alive birth-death pairs for a given $x$ coordinate ordered by birth. If a given birth-death pair never appears in the $k$ position or lower for any $x$ coordinate, it can be removed.

Secondly, we define an algorithm that checks for intersections only with birth-death pairs currently a part of the top- $k$ landscapes. This means that all of those intersections must be part of the final diagram. The key to this algorithm comes when we add a new pair to the top- $k$ landscapes. If there are less than $k$ birth-death pairs alive, then the new pair is automatically added to the top- $k$ landscapes.

To add new birth-death pairs into the top- $k$ landscapes when there are already $k$ pairs alive, we check for an intersection with the new pair and the bottom persistence pair in the top- $k$ landscapes if that bottom pair has a negative slope at the status line. If this is the case, the problem is such that the two pairs must intersect because of the “Persistence pairs filtering” section.

This algorithm is inspired by the basic line segment intersection algorithm taught in computational geometry.²² When observing the PL problem from an abstract level, the two problems are similar.

Intersection discovery order

When we add a line segment to our working set, it is simple to calculate all the intersections of this new segment with all the previous existing segments. This can be done quickly, $O (n)$ , where $n$ is the number of persistence pairs, and by doing so for each segment, we know all the intersections. The question then becomes: is the order in which we discover these intersections enough to immediately order them into $λ_{k}$ landscapes, or must we order them after we have all of them? It turns out that the order in which these intersections are discovered is the correct order if we add line segments based on their minimum point (from smallest to largest). The following proof by contradiction explains the logic.

Base case: The first line we add is correct and belongs to the top landscape, as there are no other segments in the working group. It is therefore in a valid state.

Induction: Add the next line and the resulting intersections. Without loss of generality, consider a single of these new intersections. The intersection can only be added to one landscape and must be the next intersection for the landscape. If it were not, then that means there is a line segment that intersects the line segment formed by the current last point in the landscape and our new proposed point.

Case 1: If this segment has a positive slope, then we are saying we are missing a negative slope segment, $n$ . In this case, we are adding a negative slope segment (in order to intersect with the previous positive slope segment of the landscape). This segment must start in the correct landscape because it is picking up where the positive slope segment it is paired with left off. So for the intersection to be in the wrong place, $n$ would need to be a negative slope that is above our segment (for this to be the case, it would have had to start before our positive segment). This cannot be the case, so we have included all previous negative slope segments.

Case 2: If this segment has a negative slope, then we are saying we are missing a positive slope segment, $p$ . $p$ must be on the same x coordinate as all other segments and must have the same slope as all other segments. Because the segment just added was a positive slop segment (in order to intersect with the previous negative slope segment of the landscape), this would mean our missing segment must have a starting point less than the current segment. This cannot be the case (as it would have been processed first), so we have included all previous positive slope segments.

Conclusion: After each line segment and intersection, we maintain the correctness of the working group. Once all line segments and intersections are added to the working group, the landscapes of the working group are the correct landscapes overall.

This shows that intersections are added when they are discovered and do not need to be sorted or tracked (other than appending them to the output list for each $λ_{k}$ ).

Algorithm 1
Compute top- $k$ persistence landscapes

Require: $p a i r s$ : list of all birth-death pairs

Require: $k$ : number of landscapes to calculate

1: $e v e n t s \leftarrow g e n e r a t e E v e n t P o i n t s (p a i r s)$

2: for $e \in e v e n t s$ do

3: $p o s \leftarrow a d d T o S t a t u s (e)$

4: if $p o s < k$ then

5: $λ [p o s] . a p p e n d (e)$

6: end if

7: if $i s B i r t h P o i n t (e)$ then

8: $a d d T o S t a t u s (e)$

9: end if

10: if $i s P e a k P o i n t (e)$ or $i s B i r t h P o i n t (e)$ then

11: $i \leftarrow e$

12: while $i \leftarrow n e w I n t e r s e c t i o n (i, λ, k)$ do

13: $h a n d l e_i n t e r s e c t i o n s (i, λ, k)$

14: end while

15: end if

16: if $i s D e a t h P o i n t (e)$ then

17: $r e m o v e F r o m S t a t u s (e)$

18: end if

19: end for

20: return $λ$

Algorithm 2
Handle intersections

Require: $i$ : intersections to be handled

Require: $λ$ : working copy of the landscapes

Require: $k$ : number of landscapes to calculate

1: $(s 1, s 2) \leftarrow n e x t (i)$

2: $s w a p_p o s (s 1, s 2)$

3: if $s 1. p o s < k$ then

4: $λ [s 1. p o s] . a p p e n d (i n t e r s e c t i o n (s 1, s 2))$

5: end if

6: if $s 2. p o s < k$ then

7: $λ [s 2. p o s] . a p p e n d (i n t e r s e c t i o n (s 1, s 2))$

8: end if

Implementation details

Algorithm 1 contains the critical points of the algorithm but does leave out some bookkeeping. The rest of this section aims to fill in those gaps.

The algorithm is inspired by the plane-sweep line segment intersection.²² Our algorithm differs from the original algorithm in the following key ways. We still have an ordering, but now our geometric objects consist of two line segments, defined by three points: birth, peak, and death. One line segment goes from birth to peak, and the other line segment goes from peak to death. Our status structure is the ordering of birth-death pairs, where having a higher y-coordinate at the sweep line’s current position denotes a higher position in the status structure. We should also note that our sweep line always starts on the far left of our collection of objects perpendicular to the x-axis. Our event points are the birth, peak, and death points that define each persistence pair in the PL. We check for intersection points the same way as in the line segment intersection algorithm: only checking for intersections between two pairs if they are neighbors in the status structure.

The algorithm does the following, given a set of birth-death pairs. First, it generates the initial event points in line 1. The initial event points correspond to the following for each birth-death pair: the birth, the peak, and the death of a given topological feature.

We need to take different actions at each event point. At a birth point, we must add the corresponding persistence pair to the status structure in line 3. From the definition of birth-death pairs, we can guarantee that when a persistence pair first appears; it must be lower than any other persistence pair in the status structure. As a result, we do not have to search the status structure to determine where it should be inserted to maintain the ordering. This fact allows us to get away with a simple status structure with constant insert and access times, such as a linked list. We must also check to see if this new persistence pair intersects with its neighbors in line 10, but because it is at the bottom of the status structure, it can only have one neighbor: the pair above it. We only check for this intersection if the neighbor above is after its peak. If it is still rising, it could switch positions with persistence pairs above it before intersecting with our new pair. If our new pair intersects with its neighbor, handle the intersection by swapping positions and logging to the output if needed, and then recursively check for any more intersections that may now exist. Finally, if our new pair has a position in the status structure in the top- $k$ , that means it is a part of one of the top- $k$ landscapes. As such, we keep track of it to report later. This is done in lines 4 and 5. We use $k$ linked lists in our implementation to return to the user at the end of the algorithm. This point is added to the respective linked list.

We check in with our bookkeeping for peak points to see if the peak point belongs to a persistence pair that is part of the top- $k$ landscapes in lines 4 and 5. If it is, we track it in the way previously mentioned: adding it to the end of the respective linked list. We also check for intersections as our persistence pair has changed slopes and could now intersect with any persistence pair below it, if and only if the persistence pair below it is still rising. If we find any intersections, we handle them right away. This is done in Algorithm 2.

Lastly, we can guarantee that the corresponding persistence pair must be in the bottom position in the status structure for a death point. If this were not the case, then there would be at least one other point, the one below it, that would have to end earlier or intersect with our persistence pair before it could die itself. The reason is that all persistence pairs must end at the same y coordinate; they all have the same slope after their peak, and each persistence pair consists of only two line segments in the PL. As a result, we can delete the bottom persistence pair from the status structure, and be confident that it is the one the end point is referencing in line 16. Before we do this, we check to see if it is currently part of the top- $k$ landscapes. If it is, we save the point to its respective linked list in lines 4 and 5.

Runtime

The algorithm’s runtime is relatively complex to analyze, especially when combined with the results from the “Persistence pairs filtering” section and the “Intersection discovery order” subsection. First, for every event point, we only do constant work, not including handling intersections. This is because the three possible actions for an event are as follows: add to status (birth), check for intersection (birth, peak), add to a linked list (all), and remove from the status structure (death). For all of these actions, we maintain pointers between all the structures and, with proper bookkeeping, only incur an addition of $O (1)$ runtime per event. For each intersection, we incur $O (1)$ by swapping, appending to the output structure, and only performing one unproductive intersection check per valid intersection and valid event point. This is because all event points put through the algorithm must appear at some point in the output as a result of the “Persistence pairs filtering” section. In the entire structure, there can only be $O (n^{2})$ intersections, where $n$ is the number of persistence pairs. Thus, our complexity from the initial sort/creation of event points ( $O (n \log (n)$ ), processing all event points ( $O (n)$ ) and processing all intersections ( $O (n^{2})$ ) gives a runtime of $O (n^{2})$ . This is optimal and is the same as the original algorithm from Bubenik et al.⁸

As the number of intersections per given structure of the same input size can vary greatly, it is crucial to consider the output-sensitive runtime of the algorithm. An algorithm that always performs as if every birth-death pair intersects would be leaving significant performance on the table if the number of intersections, $I$ , is $I <<< n$ in a given application domain, where $n$ is the number of persistence pairs. When performing output-sensitive complexity analysis on our algorithm, we see that it performs at $O (n \log (n))$ , where $n$ is the number of persistence pairs. This results from never checking for more than a single intersection per valid point included in the output.

Additionally, due to being able to perform in-place sorting and only storing valid output results, the memory complexity of the algorithm is also optimal.

The algorithm is $O (n^{2})$ , outputsensitive $O (n \log (n))$ when unsorted, output sensitive $O (n)$ when sorted, and $O (n^{2})$ memory complexity with $O (n)$ memory complexity when compared to the output size, where $n$ is the number of persistence pairs. As a result, it is optimal in big-O, output sensitivity analysis, and memory complexity.

Unlike most output-sensitive algorithms, this algorithm performs competitively with the approach from Bubenik et al.⁸ even in the worst case, and there are many cases where it performs much better when birth-death filtering is present. Additionally, Bubenik et al.⁸ must make multiple passes over the data, which prevents its use with large datasets if the entire dataset does not fit in memory. The “Plane sweep algorithm analysis validation” section quantifies the speedup in practice compared with the approach from Bubenik et al.⁸ When all taken together, this algorithm opens the door for TDA-PL to be applied to larger datasets and filter existing datasets for significant speedups. This is shown experimentally in the “Experiments” section, with minimal impact on model performance in some domains.

Integration

It is sometimes helpful to take to $L^{2}$ norm of the PLs This is a trivial task because we have a discrete function, and no external library is needed. One simply finds the area under each line segment to the x-axis for every PL. This is accomplished easily in $O (n)$ , where $n$ is the number of points in all the PLs.

Persistence pairs filtering

The main variables that affect the time it takes to generate PLs are the number of birth-death pairs and intersections between birth-death pairs. During real-world applications of the PL model, only the top- $k$ landscapes are typically kept in the final representation. However, there has been no time-efficient method for determining if a given birth-death pair is in the top- $k$ landscapes. This section presents a new property of persistence pairs, which leads to an optimal $O (n * \log (n))$ algorithm for determining this desirable property where $n$ is the number of persistence pairs.

Overview

Given a set of birth-death pairs, create a diagram where all persistence pairs are parallel lines, and pairs with an earlier birth begin at a higher y-coordinate than those with later births.

Perform a line sweep scan of the pairs, maintaining an ordering on the lines where higher y-coordinates are higher in the status structure. Keep track of all pairs that appear in the top- $k$ positions in the status structure. This process can be thought of as an observer looking straight down on the pairs that can see through $k - 1$ pairs without loss of generality. If the observer can see the persistence pairs, then it appears in the top- $k$ landscapes. An algorithm to accomplish this in $O (n)$ time if the lines are sorted, or $O (n * \log (n))$ if sorting is required follows where $n$ is the number of persistence pairs.

Theorem 5 with Theorem 1 shows that a persistence pair appears in the top- $k$ pairs seen by the observer during the algorithm if and only if it appears in the top- $k$ PLs. To prove this, it is shown that if the observer can see a persistence pair, then all persistence pairs that have overlapping lifespans are either subsets of the given pair or one of the following two cases which will intersect in such a way that the persistence pair is in the top- $k$ . If the birth of another persistence pair is less than our given pair’s birth, it is proven that they must either cross before our given pair’s corresponding PL’s peak or die before our given pair’s birth. The other case is for persistence pairs born after our given pair’s birth and is not a subset of our given persistence pair. For these, it is proven that they must cross after our given pair’s peak or have their ranges not intersect. These facts are enough to prove that the ordering of the lines from the line sweep algorithm stated above corresponds to the maximum position of the birth-death pairs in the PL. When discusses the birth and death of a persistence pair $p$ , the birth of $p$ is shown as $p^{b}$ and the death of $p$ is shown as $p^{d}$ . Subscripts are used to refer to different persistence pairs in the set of all pairs for the problem.

Proof
Theorem 1
Given a line, $l$ , perpendicular to a set of persistence pairs arranged as a diagram where persistence pairs with lower births have higher y coordinates, the line segment with the highest y-coordinate that intersects $l$ will be a part of the highest PL.
Proof.
If there is only one birth-death pair intersecting the line $l$ , then it is evident that there exists an x-coordinate when our birth-death pair, $p$ , is a part of the highest PL (the current x-coordinate where we are, which contains no other birth-death pairs).

If multiple birth-death pairs are intersecting $l$ , there exists a point where the birth-death pair that has the highest y-coordinate, $p$ , is higher than all the previous birth-death pairs (who are not currently intersecting $l$ ) while being above all birth-death pairs below it that are currently intersecting $l$ .

We prove that all previous lines must have crossed $p$ before $p$ peaked, and all other birth-death pairs either cross $p$ after $p$ ’s peak, or do not cross $p$ at all.
Previous persistence pairs. As a result of Lemma 4, all birth-death pairs must peak before they can cross any birth-death pairs below them. Additionally, for a pair of birth-death pairs to cross, one must be traveling to its peak while the other is traveling to its death. It follows that all previous birth-death pairs intersect $p$ before its peak, or do not intersect at all. If we can observe a given birth-death pair from our observer, there is no birth-death pair that is a superset of $p$ such as in Lemma 2. Therefore, the only way for a previous birth-death pair to not intersect our given birth-death pair would be in the range of $p$ to not overlap with the other birth-death pair’s range. So if we can see $p$ from the observer, any birth-death pair whose birth is less than $p^{b}$ must either not intersect or intersect $p$ before $p$ ’s peak.

Other intersecting lines. Refer to Lemma 4.

Lemma 2
If a persistence pair $p_{1}$ is a superset of another persistence pair $p_{2}$ , then their corresponding line segments do not cross.
Proof.
All birth-death pairs in the PL have the same slope to their peak and the same slope after their respective peak, which is always halfway between the birth and death points. As a result, if one persistence pair has a birth $p_{1}^{b}$ and a death $p_{1}^{d}$ and another persistence pair has a birth $p_{2}^{b}$ and death $p_{2}^{d}$ such that $p_{1}^{b} > p_{2}^{b}$ and $p_{1}^{d} < p_{2}^{d}$ , the line segments corresponding to the $p_{1}$ persistence pair will never cross $p_{2}$ (as shown in Figure 2). Additionally, because all persistence pairs begin at $y = 0$ , the persistence pair $p_{1}$ exists inside $p_{2}$ when plotted and, therefore, does not affect if $p_{2}$ appears in the top- $k$ landscapes.
Figure 2.
Persistence pairs superset example (Lemma 2).
Lemma 3
Given two birth-death pairs, $p_{1}$ and $p_{2}$ , if $p_{1}^{b} < p_{2}^{b}$ and $p_{1}^{d} < p_{2}^{d}$ then $p_{1}$ and $p_{2}$ must intersect (case in Figure 3).
Figure 3.
Persistence pairs intersection example.

Proof.
The initial ordering of the two lines based off of y-coordinate right after $p_{2}$ is born is ( $p_{1}$ and $p_{2}$ ) and the ordering right before $p_{1}$ dies is ( $p_{2}$ and $p_{1}$ ). If this were not the case, then $p_{2}$ would end before $p_{1}$ because they share the same slope right before they die; once their slope changes to $- 1$ , it cannot change, and they die at the same y-coordinate in the PL. As a result, $p_{1}$ and $p_{2}$ must intersect. This is needed for Lemma 4.
Lemma 4
Given two birth-death pairs, $p_{1}$ and $p_{2}$ , if $p_{1}^{b} < p_{2}^{b}$ and $p_{1}^{d} \leq p_{2}^{d}$ , then $p_{1}$ must reach its peak before $p_{1}$ and $p_{2}$ cross.
Proof.
All persistence pairs in the PL have the same slope to their respective peaks and after the peak to their death. As a result, two birth deaths cannot cross while both of them are either on their way to their respective peaks or deaths; one must be on its way to a peak while the other is on its way to its death when they cross.

If $p_{2}$ peaked before $p_{1}$ , then they would not cross. The reason is that the slope after the peak is the same for both. Because $p_{2}$ is already inside $p_{1}$ in the PL, if $p_{2}$ peaked before $p_{1}$ , it will end before $p_{1}$ , and we will have Lemma 2. Therefore, $p_{2}$ cannot peak before $p_{1}$ because $p_{1}^{d} \leq p_{2}^{d}$ .

If $p_{2}$ and $p_{1}$ shared a death point, then that means the line segments they define after their peaks are collinear because they have an equal slope, by definition, and share a point, their death. In order for them to share this line, $p_{1}$ must peak before $p_{2}$ because $p_{1}$ has an earlier birth and, therefore, would peak higher than $p_{2}$ . It must, as a result, begin its descent to its death before $p_{2}$ begins its descent for them to meet up on the same line. Therefore, $p_{1}$ peaks before $p_{2}$ if they share a death point, and they cross after $p_{1}$ ’s peak.

If $p_{1}^{d} < p_{2}^{d}$ , then Lemma 3 shows they intersect in the PL. This cross happens when one pair travels to its peak while the other travels to its death. The pair is traveling to its peak at the point of intersection ends after the one traveling to its death. The reason is that after the intersection, the pair traveling to its peak will be higher than the other pair. Additionally, once a pair has begun its descent, it cannot change slopes. As a result, the pair that is on its descent at the intersection point must die first. This proves that, because $p_{2}$ ends after $p_{1}$ , $p_{1}$ peaked before the intersection, or else $p_{1}$ would end after $p_{2}$ .
Theorem 5
If a birth-death pair is the $n$ -th closest to the observer, then it is a part of the $n$ -th PL.
Proof.
Applying the logic from Lemma 6 recursively leads immediately to this conclusion.
Lemma 6
If a birth-death pair is the second closest to the observer, then it is a part of the second PL.
Proof.
If a birth-death pair is second closest to the observer, that means that there is only one birth-death pair that is closer than our given birth-death pair to the observer. Imagine removing this closest birth-death pair. The result of this operation would be that the second-closest birth-death pair is now the closest birth-death pair. This is a result of adding in a birth-death pair, which can only move other birth-death pairs down in rank by a maximum of one position in the ordering. This shows that our given birth-death pair must be a part of the second PL.

Importance

The result of this new property of persistence pairs is that we can now determine in $O (N)$ if sorted, or $O (n \log (n))$ if unsorted, if a given birth-death pair appears in the top- $k$ landscapes, where $n$ is the number of persistence pairs. This limits the number of intersections in the “Plane-Sweep landscape generation” subsection to only be the intersections that appear in the final output.

Experiments

Experimental validation of the methods was conducted on both synthetic and real-world datasets. An analysis and summary of the results of these experiments follow.

Synthetic dataset

To experimentally validate the performance of the new plane sweep algorithm against the original exact algorithm from Bubenik et al.⁸ as the number of points in the topology grows, point clouds of various topologies were generated. For this experimental evaluation, the proposed work is implemented using a Rust package created by the authors named “fast-pl.” This was done to compare the algorithm’s time bound while eliminating as many variables as possible. It allows us to change any variable in the dataset, including noise levels. This gives us the ability to quantify if any of these variables affect the relative run time between the two algorithms. We included different topologies to see if this would have any effect on the runtimes as well. Point clouds for the torus, Swiss roll, d-sphere, and infinity sign were generated. Additionally, a dataset of uniformly distributed birth-death pairs was generated to mimic how the original algorithm’s time-bound growth was quantified by Bubenik et al.⁸

For each type of topology, datasets were generated from 100 to 10,000 points in steps of size 100. Some datasets stopped before 10,000 points because the compute resources available did not have enough memory to generate the dataset properly. When generating datasets of uniform persistence pairs, the number of persistence pairs used goes from 100 to 10,000 with steps of 100. For each set of persistence pairs, 30 samples were generated, for a total of 3000 datasets per type. The TaDAsets package²³ was used to generate the datasets. The parameters used to generate each type of topology can be seen in Table 1.

Table 1.
TaDAsets parameters.

Topology c a r d Ambient Noise

Torus 2 1 – – 200 0.2

Swiss roll – – 4 – 200 0.2

d-Sphere – – 4 10 200 0.2

Infinity sign – – – – – 0.2

The same datasets were run through our implementation as well as the PL Toolbox implementation⁸ (the original and standard method used to compute PL). We ran our code, fast-pl, allowing for a variable number of top- $k$ landscapes. Percentages refer to the maximum number of landscapes that will be kept with respect to the number of persistence pairs present (i.e., fast-pl 25% will keep the top 1000 landscapes when given 4000 persistence pairs). Datasets are generated using the TaDAsets Python package, which is used to evaluate TDA algorithms. The runtime data shown in Figures 4 to 8 and Table 2 show that the presented algorithm is equal to or faster than PL Toolbox.⁸ Table 2 shows the comparison between PL Toolbox⁸ and the proposed algorithms, fast-pl, with the average of 30 datasets, each containing 2000 points. The output of PL Toolbox⁸ and fast-pl 100% is exactly the same PL data. When birth-death filtering is done, as in fast-pl 50% and fast-pl%25, some information is lost. Table 2 agrees with Figures 4 to 8 in showing that fast-pl is faster than PL Toolbox⁸ in these circumstances. Although the others did find cases where fast-pl did not perform better than PL Toolbox,⁸ the authors were unable to find a generalizable situation where fast-pl performed significantly worse than PL Toolbox.⁸ Additionally, when birth-death pair filtering was used fast-pl always performed significantly faster than PL Toolbox.⁸ Some of these edge cases can be seen Figure 4. Even in these cases, fast-pl performed on par with PL Toolbox⁸ and was never seen to perform significantly or consistently worse. The only disadvantage of fast-pl where PL Toolbox would be preferred as if having source code in C++ is a beneficial to ones workflow over Rust/Python (the languages fast-pl is written in). The runtime advantage of the fast-pl only increases when the number of intersections when filtered increases. A basic machine learning platform was used to test the effect of birth-death filtering on the synthetic datasets shown in Table 2. This consisted of taking all the generated classes as a single dataset. The $L^{2}$ norm of each PL was taken for each sample in the dataset and the resulting vector was used to train a logistic regression model. As can be seen in Table 3 filtering even a large percentage of birth-death pairs did not have any impact on the performance of the model while providing significant time saving both in terms of generating the PLs (up to 15x faster in this case) and when training the model due to the dataset being smaller. These results were also seen when using a real-world dataset in the “Botnet experiments” subsection.

Figure 4.
Runtime comparison on the synthetic torus dataset.

Figure 5.
Runtime comparison on the synthetic Swiss roll dataset.

Figure 6.
Runtime comparison on the synthetic d-sphere dataset.

Figure 7.
Runtime comparison on the synthetic infinity sign dataset.

Figure 8.
Runtime comparison of uniformly distributed persistence pairs.

Table 2.
Classification performance between algorithms.

Metric PL Toolbox⁸ fast-pl 100% fast-pl 50% fast-pl 25%

Accuracy 1.0 1.0 1.0 1.0

Macro-weighted precision 1.0 1.0 1.0 1.0

Macro-weighted recall 1.0 1.0 1.0 1.0

F1-score 1.0 1.0 1.0 1.0

Table 3.
Average synthetic experiment runtime results in seconds of 30 datasets each with 2000 points (lower is better).

Topology PL Toolbox⁸ fast-pl 100% fast-pl 50% fast-pl 25%

torus 0.1130 0.0971 0.0255 0.0077

Swiss roll 0.1130 0.0971 0.0242 0.0084

d-Sphere 0.1136 0.0942 0.0252 0.0081

Infinity sign 0.1129 0.0990 0.0245 0.0084

Uniform 0.0350 0.0369 0.0149 0.0049

Botnet experiments

We tested our algorithm against the previous best using a real-world botnet dataset. This was chosen because this problem is an active field of research in cybersecurity and TDA-PL has shown promise in being able to classify network traffic.⁷ Therefore, as computer network analysis is a field that has shown interest in using TDA and botnet detection is an open problem in that field, botnet detection was chosen to confirm the algorithms in a real-world practical setting.²⁴ There are many ways to apply TDA to real-world problems for computer network analysis. This showcases one way of how to use the provided algorithms to perform data analysis on an applicable, complex, real-world problem. It is not the goal of this work to provide state-of-the-art results in botnet detection but to show the potential performance gains of the newly introduced algorithms in a real-world, practical setting. The data pipeline and machine learning metrics provided are in line with standard practices in the field of computer network analysis.

Botnet detection is a more complex pipeline that uses PL and has found success in financial time series classification and prediction.^2,3 By using a real-world dataset, we can validate if the success of the new algorithm in the synthetic experiments holds for use cases where it would be used in practice. It should be noted that the previous best approximation algorithm was rewritten in Python in these experiments to work with existing processing pipelines.

Preprocessing and feature engineering are critical steps for all time series efforts. For these experiments, Python 3.8 with NumPy and pandas were used. Ripser was the TDA package used for the Vietoris-Rips filtration, as it was shown by Otter et al.²⁵ to be the best for the task and has a Python implementation using the work of Tralie et al.²⁶ We used representative pattern matching (RPM) for the classification of the resulting data from the TDA stage.

Using the datasets described below, we treat each pcap packet capture as a set of points. Given these ordered sequences of points, the pipeline does the following to transform a multi-attribute time series into a univariate time series. First, it separates the time series into ordered sets of equal size determined by a user-defined window size, $w$ . It creates the first of these by looking at the first $w$ points in the time series. Creating the next ordered set will shift this window over by $s$ points, where $s$ is the number of skip points defined by the user. This results in the points $s$ to $w + s$ being the next window. It continues this process until it cannot skip over $s$ points and still maintain a window size of $w$ . For each of these windows, it will compute the PL using the Vietoris–Rips filtration.

We must convert each of these landscapes into a single value that can represent the overall homology of the window and can be compared to other windows. This is accomplished using the $L^{2}$ norm of each landscape. The $L^{p}$ norm is a good choice, as Bubenik et al.⁶ showed it provides a Banach space structure. The $L^{2}$ norms form a natural ordering based on the window order. This ordering on the $L^{2}$ norms forms a univariate time series. From this, one could use any univariate time series analysis technique for classification or anomaly detection in order to either classify or detect anomalies, respectively.

The experiments used two different machines: a high-end desktop (HEDT) and a multiuser server. All other users were logged off the multiuser system for all experiments containing relative speedup results, apart from user experimenting. This was done to minimize the experimental noise introduced into the results from other users’ activities. Times were captured by taking the wall time difference before and after the TDA step only (dimensionality reduction stage). This was done to demonstrate the time savings from changing the algorithms used. The exact specifications of each machine are given in Table 4.

Table 4.
Hardware specifications.

HEDT Server

Memory 48 GB 125 GB

CPU Xeon Silver 4114 2x Xeon E5-2640v4

OS Windows 10 VMware Sphere

User-level OS Windows 10 Ubuntu 18.04LTS

Multiuser system No Yes

HEDT: high-end desktop; CPU: central processing unit; OS: operating system.

We examined two botnet datasets. The first is ISCX Botnet 2014 from the University of New Brunswick Canadian Institute for Cybersecurity.²⁷ This dataset has the three main attributes of a good dataset: generality, realism, and representativeness. They accomplished this by including many types and manifestations of botnets from multiple data collection points. The data format is multiple pcap files. Having access to the raw pcap files greatly increases the different types of techniques that can be evaluated on the dataset as pcap is the industry standard for network capture and replay. The result is a greater number of people who can use the dataset to evaluate their algorithm, making it easier for the dataset to become a standard for evaluation.

The training dataset contains seven different botnets. Each of which is either IRC, HTTP, or P2P, with 43.92% of the traffic being malicious. The testing set contains 16 different botnets. Each of these botnets is one of the previously seen types from the training dataset, with 44.97% being malicious. All the original botnet types appear in the testing dataset. This enables one to test the previously mentioned different types of generalizations for the model in question.

Preprocessing

The following preprocessing was done on the dataset:

pcap to CSV and feature selection using Argus

sample labeling in python

feature engineering in Python

test/train split

CSV to JSON in Python

Feature selection using Argus

The ISCX dataset comes as pcap files, which means that feature selection is needed. Feature selection is done using openargus. The features are given in Table 5. These features are chosen out of the possible 145 because not all the available features are thought to be helpful in the botnet detection problem.²⁸ These are very similar to the features that are used by Homayoun et al.²⁸ The reason for the difference is that some of the features never changed in value in the ISCX botnet dataset and thus added no information. If these features in the dataset were to change, it might be worthwhile to add them back. More feature extraction could be done to the dataset than what was done here.

Table 5.
Features extracted from the ISCX 2014 Botnet dataset.

Feature Argus argument

Source IP address saddr

Destination IP address daddr

Direction of transaction dir

Protocol proto

Record total duration dur

Total transaction packet count pkts

Packet count from source to destination spkts

Packet count from destination to source dpkts

Source interpacket arrival time (ms) sintpkt

Source idle interpacket arrival time (ms) sintpktidl

Destination interpacket arrival time (ms) dintpkt

Destination idle interpacket arrival time (ms) dintpktidl

Total transaction bytes bytes

Source to destination transaction bytes sbytes

Destination to source transaction bytes dbytes

Source active interpacket arrival time (ms) sintpktact

Destination active interpacket arrival time (ms) dintpktact

Mean of the flow packet size transmitted by the src (initiator) smeansz

Mean of the flow packet size transmitted by the dst (target) dmeansz

Minimum packet size for traffic transmitted by the src sminsz

Minimum packet size for traffic transmitted by the dst dminsz

Maximum packet size for traffic transmitted by the src smaxsz

Maximum packet size for traffic transmitted by the dst dmaxsz

Bits per second load

Source bits per second sload

Destination bits per second dload

Packets per second rate

Source packets per second srate

Destination packets per second drate

The output from open Argus is customizable to most formats; CSV is used here. The source IP, destination IP, and port were not used during classification to enable better generalization to new networks.

Time-series split

The input to RPM needs to be formatted as a collection of time series. Thus, we must decide how to group the data and where to split the different samples. We decided to order all the data sequentially according to time and then split the data into 500 sample chunks with no overlap between any chunks. This number was chosen as it was the minimum to allow for patterns to be found. This results in having the maximum number of samples possible for testing and training and ensuring that botnets are detected as soon as possible if deployed in a real-time system.

Data labeling

As a result of the data provided by the ISCX botnet being provided as pcap files, the data is unlabeled. ISCX provides a list of the malicious IPs seen in the network capture. The data is labeled so that if there is a bad connection in the sample chunk, then the entire chunk is labeled as malicious. If there is no bad connection in the chunk, then the chunk is labeled as benign.

Feature engineering

Extra feature engineering was done on top of the initial feature selection from open Argus. The data was read into a Python file as a CSV, then min-max normalization and one-hot encoding are done using pandas and NumPy. Then, features with all NaN values are removed to shrink the dataset’s size, as these features provided no additional information for the model. Lastly, the data is exported as a CSV.

Test/train split

The testing and training datasets are manipulated in a subset of the experiments. If there is manipulation of the testing and training data in an experiment, the following done. A random value between 0 and 1 is assigned to each sample chunk. If the value is greater than a user-defined threshold value, then that value is assigned to the training set. If it is less than the value, it is assigned to the testing set. This was only done when the original training and testing datasets were not used, and the original training dataset was split into new training and testing sets.

CSV to JSON

The final stage of the data preprocessing pipeline is a conversion from CSV to JSON. This is done for compatibility reasons to use our implementation of RPM. The conversion is done using pandas in Python and is provided along with the other source code on http://github.com/tph5595/.

Plane sweep algorithm analysis validation

The purpose of these experiments is to see if the plane-sweep algorithm from the “Plane-Sweep landscape generation” subsection performed better than the original approximation algorithm.⁸

Method

A grid search is done using the ISCX Botnet 2014 dataset on both algorithms to test this. The grid search is performed on the window size (5–40 with a step size of 5) and the max Rips radius (1–5 with a step size of 1). Experiments were run with the persistence pair filtering step, keeping every birth-death pair and only keeping the top $20^{2}$ pairs. This is done to determine what is providing the speedup. The results can be found in Table 6. There is a speedup from both the persistence pair filtering step and the plane-sweep PL algorithm. Filtering out persistence pairs to this level did not affect the weighted F1 or weighted MCC at the 95% confidence level, but did have a meaningful effect on the time as shown in Table 7. Further experiments were done to see what percentage of birth-death pairs are needed before there is a significant drop-off in the model’s quality. With a 95% confidence level, there is no difference in the weighted F1 or MCC when only keeping the top PL versus keeping all the PLs when using the ISCX Botnet 2014 dataset. Also, with a 95% confidence level, the plane-sweep algorithm performed better than the algorithm from Bubenik et al.⁸

Table 6.
Algorithm runtime (seconds) at the 95% confidence level.

Algorithm Mean Variance

PL Toolbox⁸ (approximation method) 758.2–866.2 20370.3–50050.9

fast_pl (exact) 700.2–709.7 143.7–353.2

fast_pl Filtered (approximation) 399.2–430.7 1739.1–4273.0

Table 7.
Mean weighted MCC and weighted F1.

Algorithm Weighted MCC Weighted F1

PL Toolbox⁸ (approximation method) 0.0452 0.023

fast_pl (exact) 0.504 0.093

fast_pl Filtered (approximation) 0.51 0.0999

Generalization

The initial findings on the ISCX Botnet 2014 dataset were underwhelming compared to the success achieved using TDA on other datasets.² The result was a weighted F1 of $0.595$ and a weighted MCC of $0.246$ when using the provided testing and training datasets. In order to determine why the model was performing poorly on this dataset compared to other models, such as Homayoun et al.,²⁸ the provided training dataset is split into testing (30%) and training (70%) sets. This helped diagnose the problem because it removes the variable of new botnet types in the original testing dataset, as can be seen in Table 8.

Table 8.
Max metrics on different datasets.

Original split Training split Test/train equal

Weighted MCC 0.246 0.727 0.993

Weighted F1 0.595 0.867 0.997

Training the model on this new dataset increased the weighted F1 to $0.867$ and the weighted MCC to $0.727$ . This confirms that the model was having trouble generalizing to new types of botnets. The problem is let to future work but could result from not having enough training data, which is known to affect performance, as shown by Postol et al.⁷ This shows that this TDA method can reduce the dimensionality of the data to a single dimension while keeping the data separable.

Experiments were conducted to quantify the effect of the loss of information caused by using TDA for time series dimensionality reduction for botnet classification. The model was trained and tested on the ISCX training dataset. The result was that the model was able to achieve a weighted F1 of 0.997 and a weighted MCC of 0.993 (only one misclassified example). This can be seen in Table 8 which shows TDA’s ability to reduce the dimensionality of the botnet time series data from 26 dimensions to one with minimal loss of information. This instills confidence that univariate time series analysis techniques can be used on multivariate data. There is a greater understanding of how to analyze univariate than multivariate time series data—thus, opening the door for these techniques to be used on multivariate time series data. Further work should be done with TDA with other time series analysis models, techniques, and features to determine the best combination.

It should be noted that the best parameters for all the different experiments in this section were the same: a window size of 20 and a maximum rip radius of 4. There is no reason this should have been the case, as different training sets were created for all different experiments. Therefore, it is logical to assume that some attribute is inherent to the network that is causing this. This helps with training various TDA implementations on the same network, as it seems only a single search for the correct parameters might be needed.

Software details

The software used for the experiments in this article is available on GitHub in Rust at http://github.com/tph5595/fast-pl. The project is also available in a pip package available at the Python package index at https://pypi.org/project/fast-pl-py/. It can be installed using Python 3.8 and pip, using the name fast–pl–py. The methods allow for the algorithms presented to be easily imported and used with existing data analysis programs written in these languages. Additionally, examples are included at http://github.com/tph5595 which use the software in real-world settings.

Conclusion

We have shown that PLs can be computed significantly faster than previously possible. This was done by gaining two independent insights on the geometric structure of the problem. First, by using a preprocessing step to filter out birthdeath pairs from the PL calculation that cannot appear in the top- $k$ landscapes along. Secondly, recognize that all landscapes can be computed at the same time without the need for sorting. This was implemented using an output-sensitive landscape generation algorithm, further lowering the asymptotic time bound for PL generation. This has been validated on synthetic and real-world datasets and pipelines. As a result, PLs can be used on significantly larger datasets, which expands their use in general. Additionally, a modular pipeline for using PL in network multivariate time series is presented and shown to be an effective dimensionality reduction technique. An implementation of this code has been released to the public on GitHub in Rust and Python Pip.

Future work should be done to optimize the code, and rewriting in a lower language may be useful in some circumstances. Additionally, work still needs to be done with PL to ensure that it can be used to its fullest potential in the face of noise and when integrated into complex processing pipelines like those seen in time series analysis.

Topology	c	a	r	d	Ambient	Noise
Torus	2	1	–	–	200	0.2
Swiss roll	–	–	4	–	200	0.2
d-Sphere	–	–	4	10	200	0.2
Infinity sign	–	–	–	–	–	0.2

Metric	PL Toolbox⁸	fast-pl 100%	fast-pl 50%	fast-pl 25%
Accuracy	1.0	1.0	1.0	1.0
Macro-weighted precision	1.0	1.0	1.0	1.0
Macro-weighted recall	1.0	1.0	1.0	1.0
F1-score	1.0	1.0	1.0	1.0

Topology	PL Toolbox⁸	fast-pl 100%	fast-pl 50%	fast-pl 25%
torus	0.1130	0.0971	0.0255	0.0077
Swiss roll	0.1130	0.0971	0.0242	0.0084
d-Sphere	0.1136	0.0942	0.0252	0.0081
Infinity sign	0.1129	0.0990	0.0245	0.0084
Uniform	0.0350	0.0369	0.0149	0.0049

	HEDT	Server
Memory	48 GB	125 GB
CPU	Xeon Silver 4114	2x Xeon E5-2640v4
OS	Windows 10	VMware Sphere
User-level OS	Windows 10	Ubuntu 18.04LTS
Multiuser system	No	Yes

Feature	Argus argument
Source IP address	saddr
Destination IP address	daddr
Direction of transaction	dir
Protocol	proto
Record total duration	dur
Total transaction packet count	pkts
Packet count from source to destination	spkts
Packet count from destination to source	dpkts
Source interpacket arrival time (ms)	sintpkt
Source idle interpacket arrival time (ms)	sintpktidl
Destination interpacket arrival time (ms)	dintpkt
Destination idle interpacket arrival time (ms)	dintpktidl
Total transaction bytes	bytes
Source to destination transaction bytes	sbytes
Destination to source transaction bytes	dbytes
Source active interpacket arrival time (ms)	sintpktact
Destination active interpacket arrival time (ms)	dintpktact
Mean of the flow packet size transmitted by the src (initiator)	smeansz
Mean of the flow packet size transmitted by the dst (target)	dmeansz
Minimum packet size for traffic transmitted by the src	sminsz
Minimum packet size for traffic transmitted by the dst	dminsz
Maximum packet size for traffic transmitted by the src	smaxsz
Maximum packet size for traffic transmitted by the dst	dmaxsz
Bits per second	load
Source bits per second	sload
Destination bits per second	dload
Packets per second	rate
Source packets per second	srate
Destination packets per second	drate

Algorithm	Mean	Variance
PL Toolbox⁸ (approximation method)	758.2–866.2	20370.3–50050.9
fast_pl (exact)	700.2–709.7	143.7–353.2
fast_pl Filtered (approximation)	399.2–430.7	1739.1–4273.0

Algorithm	Weighted MCC	Weighted F1
PL Toolbox⁸ (approximation method)	0.0452	0.023
fast_pl (exact)	0.504	0.093
fast_pl Filtered (approximation)	0.51	0.0999

	Original split	Training split	Test/train equal
Weighted MCC	0.246	0.727	0.993
Weighted F1	0.595	0.867	0.997

Footnotes

ORCID iD

Taylor Henderson

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Appendix

References

Gabrielsson

Nelson

Dwaraknath

, et al. A topology layer for machine learning, 2020. https://proceedings.mlr.press/v108/gabrielsson20a.html.

Gidea

Katz

. Topological data analysis of financial time series: landscapes of crashes. Phys A: Stat Mech Appl 2017; 491: 820–834.

Gidea

Goldsmith

Katz

, et al. Topological recognition of critical transitions in time series of cryptocurrencies, 2020. https://doi.org/10.1016/j.physa.2019.123843.

Edelsbrunner

Letscher

Zomorodian

. Topological persistence and simplification. Discrete Comput Geom 2002; 28: 511–533.

Zomorodian

Carlsson

. Computing persistent homology. Disc Comput Geom 2005; 33: 249–274.

Bubenik

. Statistical topological data analysis using persistence landscapes. J Mach Learn Res 2015; 16: 77–102.

Postol

Diaz

Simon

, et al. Time-series data analysis for classification of noisy and incomplete internet-of-things datasets. In: 2019 18th IEEE international conference on machine learning and applications (ICMLA). pp.1543–1550.

Bubenik

Dłotko

. A persistence landscapes toolbox for topological statistics. J Symb Comput 2017; 78: 91–114.

Singh

Memoli

Carlsson

. Topological methods for the analysis of high dimensional data sets and 3D object recognition. In: Botsch M, Pajarola R, Chen B et al. (eds.) Eurographics symposium on point-based graphics. The Eurographics Association. ISBN 978-3-905673-51-7. DOI: 10.2312/SPBG/SPBG07/091-100.

10.

Kim

Zaheer

, et al. Pllay: efficient topological layer based on persistence landscapes. In: Proceedings of the 34th international conference on neural information processing systems. NIPS’20, Red Hook, NY, USA: Curran Associates Inc. ISBN 9781713829546.

11.

Bendich

Marron

Miller

, et al. Persistent homology analysis of brain artery trees, 2016. DOI: 10.1214/15-AOAS886.

12.

Liu

Jeng

Yang

. Applying topological persistence in convolutional neural network for music audio signals. CoRR 2016; abs/1608.07373.

13.

Lee

Barthel

Dłotko

, et al. Quantifying similarity of pore-geometry in nanoporous materials. Nat Commun 2017; 8: 15396.

14.

Stolz

Harrington

Porter

. Persistent homology of time-dependent functional networks constructed from coupled time series. Chaos: Interdiscipl J Nonl Sci 2017; 27: 047410.

15.

Adams

Emerson

Kirby

, et al. Persistence images: a stable vector representation of persistent homology. J Mach Learn Res 2017; 18: 1–35.

16.

Zeng

Graf

Hofer

, et al. Topological attention for time series forecasting. In: Ranzato M, Beygelzimer A, Dauphin Y et al. (eds.) Advances in neural information processing systems, Vol. 34, pp.24871–24882. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2021/file/d062f3e278a1fbba2303ff5a22e8c75e-Paper.pdf.

17.

Chen

Segovia

Gel

. Z-gcnets: time zigzags at graph convolutional networks for time series forecasting. In: Meila M and Zhang T (eds.) Proceedings of the 38th international conference on machine learning, Proceedings of Machine Learning Research, volume 139, pp.1684–1694. PMLR. https://proceedings.mlr.press/v139/chen21o.html.

18.

Wong

Vong

. Persistent homology based graph convolution network for fine-grained 3D shape segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV). pp.7098–7107.

19.

Zhou

Dong

Lin

. Learning persistent homology of 3D point clouds. Comput Graph 2022; 102: 269–279.

20.

Postol

. Algebraic topology for data scientists, 2023. 2308.10825.

21.

de Silva

Morozov

Vejdemo-Johansson

. Dualities in persistent (co)homology. Inverse Probl 2011; 27: 124003.

22.

Berg

Cheong

Kreveld

, et al. Computational geometry: algorithms and applications. 3rd ed. Santa Clara, CA, USA: Springer-Verlag TELOS, 2008. ISBN 3540779736.

23.

Saul

Tralie

. Scikit-tda: topological data analysis for Python, 2019. https://doi.org/10.5281/zenodo.2533369.

24.

Alauthman

Aslam

Al-kasassbeh

, et al. An efficient reinforcement learning-based botnet detection approach. J Netw Comput Appl 2020; 150: 102479.

25.

Otter

Porter

Tillmann

, et al. A roadmap for the computation of persistent homology. EPJ Data Sci 2017; 6: 17.

26.

Tralie

Saul

Bar-On

. Ripser.py: a lean persistent homology library for python. J Open Source Softw 2018; 3: 925.

27.

Biglar Beigi

Hadian Jazi

Stakhanova

, et al. Towards effective feature selection in machine learning-based botnet detection approaches. In: 2014 IEEE conference on communications and network security. pp.247–255.

28.

Homayoun

Ahmadzadeh

Hashemi

, et al. BoTShark: a deep learning approach for botnet traffic detection. Cham: Springer International Publishing, 2018. ISBN 978-3-319-73951-9, pp. 137–153. DOI: 10.1007/978-3-319-73951-9_7.

Efficient persistence landscape generation

Abstract

Keywords

Introduction

Related work

Background

Simplices and simplicial complex

Filtrations

k -Skeleton

Persistent homology and persistent cohomology

Persistence representations

Persistence landscapes

Algorithms

Plane-sweep landscape generation

Intersection discovery order

Implementation details

Runtime

Integration

Persistence pairs filtering

Overview

Proof

Importance

Experiments

Synthetic dataset

Botnet experiments

Preprocessing

Feature selection using Argus

Time-series split

Data labeling

Feature engineering

Test/train split

CSV to JSON

Plane sweep algorithm analysis validation

Method

Generalization

Software details

Conclusion