Sage Journals: Discover world-class research

Abstract

High performance computing (HPC) applications that work with redundant sequences of data can benefit from their deduplication. We study this problem on the symmetry-adapted no-core shell model (SA-NCSM), where redundant sequences of different kinds naturally emerge in the data of the basis of the Hilbert space physically relevant to a modeled nucleus. For a fast solution of this problem on multicore architectures, we propose and present three multithreaded algorithms, which employ either concurrent hash tables or parallel sorting methods. Furthermore, we present evaluation and comparison of these algorithms based on experiments performed with real-world SA-NCSM calculations. The results indicate that the fastest option is to use a concurrent hash table, provided that it supports sequences of data as a type of table keys. If such a hash table is not available, the algorithm based on parallel sorting is a viable alternative.

Keywords

Data sequences deduplication multithreading nuclear structure parallel algorithm symmetry-adapted no-core shell model unique sequences

1. Introduction

This work deals with the problem of deduplication of sequences of computationally processed data, which can be of any nature. Let len(X) denote the length of a sequence X. We say that two sequences X and Y are equal, and write X = Y, if they are both of the same length and all their corresponding elements (elements with the same index) are equal.

For the sake of consistency with algorithm pseudocode and computer code, we stick to the zero-based indexing even in the mathematical notation throughout this text. Let us consider a programming problem of mapping indices 0 ≤ i < n − 1 to variable-length sequences S_i, where

S_{i} = (a_{0}^{(i)}, a_{1}^{(i)}, \dots, a_{l e n (S_{i}) - 1}^{(i)}) .

(1)

In addition, assume that S_i = S_j holds for many combinations of i and j. Under this assumption, storing all the sequences S₀, …, S_n−1 separately in a computer memory would take an unnecessary large space. An alternative approach is to store only sequences that are unique. Let us denote these unique sequences by U_I

\begin{array}{c} {U_{I} : 0 \leq I < N a n d U_{I} \neq U_{J} i f I \neq J} \\ = {S_{i} : 0 \leq i < n} \end{array} .

(2)

Example 1

Let n = 5, S₀ = (A, C, D), S₁ = (B, D), S₂ = (B, D), S₃ = (A), and S₄ = (A, C, D). Then, S₀ = S₄ and S₁ = S₂. The number of unique sequences is thus N = 3 and one possible way to enumerate them is U₀ = (A, C, D), U₁ = (B, D), and U₂ = (A).

Let X ⋅ Y denote the operation of concatenating elements of two sequences X and Y. In Example 1, storing all n = 5 sequences in a computer memory would require an array of length 11

\begin{array}{c} e l e m s = [S_{0} \cdot S_{1} \cdot S_{2} \cdot S_{3} \cdot S_{4}] & = [A, C, D, B, D, B, D, A, A, C, D] \end{array} .

The mapping of indices to sequences can then be implemented by another array

f i r s t = [0,3,5,7,8,11] .

Here

1. first[i] is the index of the first element of S_i in elems,

2. first[n] equals the length of elems, and

3. the length of S_i can be derived as len(S_i) = first[i + 1] − first[i].

Consequently,

\begin{array}{c} S_{i} = (e l e m s [f i r s t [i]], e l e m s [f i r s t [i] + 1], \\ \dots, e l e m s [f i r s t [i + 1] - 1]) . \end{array}

Storing N = 3 unique sequences would require an array of length 6

E l e m s = [U_{0} \cdot U_{1} \cdot U_{2}] = [A, C, D, B, D, A] .

Similarly to first, we can construct the corresponding array First for unique sequences as follows

F i r s t = [0,3,5,6] .

This array is now related to unique sequences

\begin{array}{c} U_{I} = (E l e m s [F i r s t [I]], E l e m s [F i r s t [I] + 1], \\ \dots, E l e m s [F i r s t [I + 1] - 1]) . \end{array}

Therefore, we additionally need to map the indices i of S_i to the indices I of U_I, which can be implemented with another array

M a p = [0,1,1,2,0] .

Now, S_i = U_I where I = Map[i].

Let m and M denote the total number of elements of the sequences S_i and U_I, respectively

m = \sum_{i = 0}^{n - 1} l e n (S_{i}) a n d M = \sum_{I = 0}^{N - 1} l e n (U_{I}) .

The lengths of the arrays elems and first, which represent the sequences S_i, are m and n + 1, respectively. The lengths of the arrays Elems, First, and Map, which represent the unique sequences U_I, are M, N + 1, and n, respectively. The memory requirements for storing the sequences S_i and the unique sequences U_I are thus O(m + n) and O(M + N + n), respectively.

In our example, m + n = 16 and M + N + n = 15, so it does not seem that introducing an extra level of indirection with unique sequences (Map) would be of much benefit. Our work addresses applications where n and m are very large numbers, and where N ≪ n and M ≪ m. In such applications, the memory savings can be significant.

However, reducing all the sequences S_i into a set of unique sequences U_I, that is, their deduplication, represents some application overhead. To reduce this overhead as much as possible, we need an efficient sequences deduplication algorithm. This work focuses on HPC applications and multicore runtime environments (such as supercomputer nodes), and proposes and evaluates three parallel multithreaded algorithms for deduplication of data sequences.

This article is an extended version of our conference paper Langr and Dytrych (2023), where we introduced a parallel multithreaded algorithm for identification of unique sequences based on parallel sorting. Here, we additionally

• present two alternative parallel algorithms based on concurrent hash tables,

• experimentally evaluate and compare all three proposed algorithms, and

• provide more thorough analysis and discussion of all the algorithms in relation to the characteristics of input data.

2. Related work

The defined problem may be considered as a specific instance of the generic problem domain referred to as data deduplication. Data deduplication describes methods that are mostly used to reduce the amount of data stored in a storage system or transferred over a network by eliminating redundant parts; see, for example, Meister (2013); Xia et al. (2016); Zhang and Deng (2017). Our problem represents data deduplication for particular types of data (sequences). Problems of this type are relatively rarely addressed in the literature. We have found some relevant papers applying deduplication methods in order to compress the memory footprints of sparse matrices; see Karakasis et al. (2013); Kourtis et al. (2008); Willcock and Lumsdaine (2006). However, they focus on the data reduction results and acceleration of subsequent matrix operations. In contrast to our work, they do not propose any efficient parallel deduplication algorithms.

3. Application

Our need for fast parallel multithreaded deduplication of data sequences comes from our nuclear structure calculations that leverage the symmetry-adapted no-core shell model (SA-NCSM); see, for example, Barrett et al. (2013); Launey et al. (2016). With SA-NCSM calculations, we try to model atomic nuclei, that is, to obtain their nuclear wave functions. The SA-NCSM first forms a basis for the Hilbert space physically relevant to a given nucleus. Since each basis is infinite-dimensional, only its finite subset of basis states/functions is taken into consideration. This subset is selected by a single even number—so-called basis cutoff parameter Nmax (it represents the maximum allowed number of harmonic oscillator quanta above the minimum for a given nucleus). The higher Nmax, the more basis functions are selected. The subspace of the Hilbert space that is spanned by the selected basis functions is called a model space. For each SA-NCSM calculation, the model space is defined by three main input parameters: the number of protons, the number of neutrons, and Nmax.

Once the basis for a model space is formed, a Hamiltonian matrix is constructed such that its rows and columns are spanned by the basis functions. Finally, the resulting wave functions are obtained in the form of linear combinations of basis functions. The coefficients of these linear combinations are elements of the eigenvectors of the Hamiltonian matrix. The entire SA-NCSM workflow for a given nucleus consists of a series of calculations with increasing Nmax. This process is stopped when the eigenvalues converge for all the energy states in which one is interested.

Our implementation of the SA-NCSM is the software framework called LSU3shell; see Dytrych et al. (2016). It is written in the C++ programming language, and its parallelization is based on the hybrid MPI + OpenMP programming model, which makes it applicable to practically any contemporary large-scale supercomputer; see Message Passing Interface Forum (2021). The SA-NCSM employs a sophisticated mathematical background provided by group theory and the theory of representation, and the group of particular interest is the special unitary group SU(3); see Heine (1960).

Within LSU3shell, a basis is organized into so-called IpIn blocks. Each IpIn block is related to a particular distribution of protons and neutrons in the harmonic oscillator shells. A combination of two IpIn blocks forms a submatrix of a Hamiltonian matrix. Internally, an IpIn block consists of multiple SU(3) proton-neutron irreducible representations (PN irreps). Each PN irrep is mapped to a contiguous chunk of matrix rows/columns. Their number is given by the PN irrep multiplicity factor in the IpIn block. A particular combination of PN irreps thus forms a matrix block in the submatrix given by two IpIn blocks. For more details on the construction of the basis and its organization; see Langr et al. (2018, 2019).

The total number of PN irreps for all IpIn blocks is typically large, but the total number of distinct PN irreps is much smaller. LSU3shell, therefore, stores the information about PN irreps in a standalone database. Then, for each IpIn block, its PN irreps are represented by a sequence of PN irreps database indices (which are integer numbers).

To process submatrix blocks independently—for instance, to enable finer level parallelization—the following information is required for each PN irrep of given IpIn blocks:

1. Its index in the PN irreps database.

2. Its multiplicity factor.

3. First matrix row/column of the corresponding matrix block.

For each IpIn block, its PN irrep indices, their multiplicity factors, and block rows/columns represent three sequences of data (all of them are integer numbers). Fast and independent processing of submatrix blocks would require having these sequences stored in memory for all IpIn blocks. However, this is infeasible in practice, since the total number of elements of these sequences is typically too large. For illustration, the SA-NCSM basis for the ¹²C, Nmax = 12 model space has n = 73,676,583 IpIn blocks, and the total amount of their PN irreps is m = 1,214,960,841. Storing the sequences of PN irrep indices, multiplicity factors, and block rows/columns for all IpIn blocks would thus require tens of gigabytes of memory.

LSU3shell thus stores in memory only the sequences of PN irrep indices (these sequences make up the majority of memory footprints of bases in SA-NCSM calculations). The multiplicity factors and row/column numbers are calculated on demand from the basis data, which is costly. However, recently, we performed an analysis of the sequences of PN irrep indices (shortly II) and multiplicity factors (shortly MF), and revealed that there are many duplicities among them. There cannot be any duplicities in the sequences of row/column numbers; however, when we transformed the sequences of row/column numbers into sequences of their offsets with respect to the first row/column of each IpIn block (shortly RCO), then the duplicities naturally emerged as well. The metrics for the exemplary ¹²C, Nmax = 12 model space are shown in Table 1.

Table 1.

Counts of all and unique sequences and their elements for the¹²C nucleus and Nmax = 12 SA-NCSM basis.

Sequences	Total count (n)	Total number of elements (m)	Unique count (N)	Unique number of elements (M)
PN irrep indices (II)	73,676,583	1,214,960,841	7,995	188,177
Multiplicity factors (MF)	73,676,583	1,214,960,841	3,769	141,025
Row/column offsets (RCO)	73,676,583	1,214,960,841	6,973	222,810

The same analysis for multiple different model spaces produced similar results. This means that, in general, if we deduplicate the sequences, then we can store them in memory in all 3 cases (II, MF, and RCO) with affordable memory footprints. This conclusion brought us to the problem of how to quickly identify and remove duplicated sequences and reduce them into sets of sequences that are unique (i.e., deduplicate them). Generally, in SA-NCSM calculations, this problem is solved by each MPI process independently (for its mapped parts of the basis). Therefore, we focused on developing algorithms for fast and scalable parallel multithreaded deduplication of data sequences.

4. Algorithms

Let us start with a single-threaded (ST) solution of the sequences deduplication problem. We can find such a solution relatively easily provided that there is a hash table that can use variable-length sequences as table keys available. In this table, we perform lookup for each input sequence. If the lookup fails, it implies that a new unique sequence has been identified. Then, we need to:

1. Insert the sequence into the hash table and map it to the next available unique sequence index.

2. Append the position of the elements of the identified unique sequence into the array First.

3. Append the elements of the identified unique sequence into the array Elems.

Additionally, the index of the unique sequence is mapped to the index of the input sequence. The pseudocode of the described method is presented as Algorithm 1. In this pseudocode as well as in the following ones, we assume the output arrays are implicitly defined and are initially empty.

Algorithm 1: Single-threaded (ST) deduplication of sequences

This algorithm cannot be directly parallelized by splitting the loop iterations between multiple threads. First, this loop is inherently sequential due to appending data mainly to the Elems and First arrays with initially unknown lengths. Moreover, implementations of hash tables provided by performance-aware programming languages (such as std::unordered_map in C++) are generally not thread-safe.

Below, we present three parallel multithreaded algorithms for deduplication of data sequences. The first two are based on parallelization of Algorithm 1, where a concurrent hash table is employed. The third algorithm is an alternative that does not require a concurrent hash table but is based on parallel sorting.

4.1. Parallel algorithm MTSK

Our first parallel algorithm is based on a concurrent hash table that, similarly to the single-threaded solution, can use variable-length sequences as table keys; we call this algorithm MTSK (multithreaded; sequence keys). The pseudocode of this algorithm is presented as Algorithm 2.

The main loop is parallelized such that its iterations are split between all the threads. In contrast to the single-threaded algorithm, we now cannot perform lookup and insertion as two subsequent operations since another thread could do the insertion in between. Instead, we try to insert each input sequence and check if this operation was successful. If it was, then a unique sequence has been identified and now, we can evaluate its index by atomically incrementing a shared counter. This index is then set to the inserted table record. If the insertion failed, then we need to check whether the unique sequence index has already been set, and, eventually, wait for its update.

In the parallelized loop, we cannot set the content of the Elems array, since its length is originally unknown. This also implies that we cannot set the content of the First array. Instead, we construct both these arrays after the main loop is finished. To find the values of the elements of the First array, an auxiliary array with the lengths of unique sequences is used and the prefix sum operation is applied to it.

4.2. Parallel algorithm MTIK

Our second parallel algorithm is based on a concurrent hash table as well, but now, we do not use sequences as table keys. Instead, we use the indices of input sequences as keys and consider two keys i and j being equal when S_i = S_j; we call this algorithm MTIK (multithreaded; index keys). The advantage of this algorithm is that it can be implemented even if there is no concurrent hash table that supports variable-length sequences as table keys available. However, instead, the hash table needs to support a user-provided operation for evaluation of the equality of two keys.

Algorithm 2: Multithreaded deduplication of sequences; version MTSK

The pseudocode of this algorithm is not presented since it is almost the same as Algorithm 2. The main differences are as follows:

1. The concurrent hash table CHT is “indexed” by i instead of by S_i.

2. The records of CHT are i → I instead of U_I → I, and U_I is represented by S_i.

The downside of this algorithm is that it generally requires multiple accesses to the sequences S_i. This may be less efficient, especially in cases where these sequences are not stored in memory, but generated on demand instead. In our SA-NCSM calculations, this case applies for the MF and RCO sequences.

4.3. Parallel algorithm MTPS

In contrast to the first two parallel algorithms, our third parallel algorithm does not require a concurrent hash table. This algorithm was originally proposed in our conference paper; see Langr and Dytrych (2023), and we call it here MTPS (multithreaded; parallel sorting). It is based on calculating the hashes for all the input sequences S_i and their subsequent logical grouping with respect to the same hashes. This then enables independent processing of each group of sequences by a single thread, since equal sequences (having the same elements) belong to a single group. Moreover, if a sufficient hash function is used, then all the sequences in the same group are likely to be equal.

The pseudocode of the proposed algorithm is presented as Algorithm 3, Algorithm 4, and Algorithm 5. First, an auxiliary array H is constructed, where each element contains information about a sequence index and its hash. Then, this array is sorted considering hashes as sorting keys, which effectively places the information about sequences with the same hashes next to each other. Each part of H with the same hash defines a single group of sequences. In the next step, these groups are divided into T segments, where T denotes the number of threads. Now, each thread can operate independently on its segment. Since each unique sequence corresponds to a single group, each thread can process its local groups to identify its unique sequences. The information about them is encoded in thread-local arrays Elems_j and First_j. When thread-local unique sequences are identified, sequences within a group are compared for equality. This is necessary since two sequences with different elements may have the same hash. Information about index mapping is stored directly in the Map array, but this mapping now applies only to thread-local unique sequences. Finally, thread-local unique sequences are merged into global unique sequences, that is, into the resulting arrays Elems and First. This requires the indexes of the elements and the indexes in Map to be incremented accordingly. The increments are computed as prefix sums (scans) on the lengths of the thread-local arrays Elems_j and First_j.

Almost all the steps of the algorithms are executed in parallel. There are a few single-threaded operations, which are computationally trivial (prefix sums are performed on T values only). The advantage of this algorithm is that it does not require a concurrent hash table. However, the price for this benefit is that the input sequences need to be accessed twice—once when their hashes are calculated and second time where unique sequences in groups are identified.

5. Implementation issues

To evaluate all the proposed algorithms, we implemented them in C++ with OpenMP threading. This allowed us to

Algorithm 3: Multithreaded deduplication of sequences; version PS; part I

use these implementations on the data generated in SA-NCSM calculations by the LSU3shell program. For the single-threaded algorithm (ST), we used the hash table provided by the C++ standard library in the form of the std::unordered_map class template. For multithreaded hash table-based algorithms (MTSK and MTIK), we used the tbb::concurrent_unordered_map class template provided by oneTBB library (formerly known as Intel TBB), which targets HPC applications and is commonly installed and used in supercomputing environments. This concurrent hash table implementation does support both the variable-length sequences as keys and user-provided operator for evaluation of equality of table keys. Moreover, it is suitable for our purposes since, due to the fact that it does not support erasure of table records (which we do not need), it is generally more efficient than concurrent hash tables that support records removal.

For the implementation of the parallel sort-based algorithm (MTPS), we chose the parallel multi-way mergesort, in particular, its OpenMP implementation provided by the libstdc++ parallel mode; see Singler and Konsik (2008). Initially, we also tried to use a few variants of parallel quicksort. However, it turned out that all of them were significantly slower than the mentioned mergesort. We attribute this behavior to the fact that quicksort is generally much more data-sensitive and does not efficiently deal with cases where there are only relatively few distinct sorting keys in the sorted data.

Algorithm 4: Multithreaded deduplication of sequences; version PS; part II

Algorithm 5: Multithreaded deduplication of sequences; version PS; part II

For calculation of the hashes of the sequences S_i, we adopted the method provided by the C++ Boost library. This method combines hashes of individual sequence elements in the following way

\begin{array}{l} h_{0}^{(i)} & = h a s h (a_{0}^{(i)}), \end{array}

\begin{array}{l} h_{k > 0}^{(i)} & = h a s h c o m b i n e (h_{k - 1}^{(i)}, a_{k}^{(i)}), \end{array}

\begin{array}{l} h a s h (S_{i}) & = h_{l e n (S_{i}) - 1}^{(i)} \end{array}

where

h a s h c o m b i n e (h, a) = h \oplus (h a s h (a) + {9 E 3779 B 9}_{16} + (h ≪ 6) + (h ≫ 2)) .

The ⊕, ≪, and ≫ operations denote the exclusive bit OR (XOR), the left bit shift, and the right bit shift, respectively. The hashes of individual input sequence elements were calculated in our case by the C++ standard library std::hash functors.

6. Experiments and results

To evaluate the proposed sequences deduplication algorithms, we applied them to our SA-NCSM nuclear structure calculations introduced in Section Application. Within experiments, we then measured the runtime of the algorithms for deduplication of sequences of all three mentioned types—PN irrep indices (II), PN irrep multiplicity factors within IpIn blocks (MF), and block row/column offsets (RCO)—on various SA-NCSM problems/model spaces. The characteristics of these model spaces are their II, MF, and RCO sequences are shown in Table 2. In all cases, the number of unique sequences N is of five to six orders of magnitude lower than the number of input sequences n, and, the total number of elements of unique sequences M is four to five orders of magnitude lower than the total number of elements of input sequences m. The application of deduplication can thus result in a significant memory savings in SA-NCSM calculations.

Table 2.

Relevant characteristics of the model spaces and their sequences used for experiments.

Model space		Seq.	Total	Total number	Unique	Unique number
Nucleus	Nmax	type	count (n)	of elements (m)	count (N)	of elements (M)
¹⁰B	18	II	1.01 ⋅ 10⁹	2.95 ⋅ 10¹⁰	1.38 ⋅ 10⁴	4.73 ⋅ 10⁵
¹⁰B	18	MF	1.01 ⋅ 10⁹	2.95 ⋅ 10¹⁰	8.26 ⋅ 10³	4.31 ⋅ 10⁵
¹⁰B	18	RCO	1.01 ⋅ 10⁹	2.95 ⋅ 10¹⁰	1.50 ⋅ 10⁴	6.78 ⋅ 10⁵
¹²C	14	II	4.29 ⋅ 10⁸	8.52 ⋅ 10⁹	1.25 ⋅ 10⁴	3.59 ⋅ 10⁵
¹²C	14	MF	4.29 ⋅ 10⁸	8.52 ⋅ 10⁹	6.91 ⋅ 10³	3.03 ⋅ 10⁵
¹²C	14	RCO	4.29 ⋅ 10⁸	8.52 ⋅ 10⁹	1.29 ⋅ 10⁴	4.78 ⋅ 10⁵
¹²B	14	II	3.16 ⋅ 10⁸	7.06 ⋅ 10⁹	1.33 ⋅ 10⁴	4.23 ⋅ 10⁵
¹²B	14	MF	3.16 ⋅ 10⁸	7.06 ⋅ 10⁹	7.12 ⋅ 10⁴	3.46 ⋅ 10⁵
¹²B	14	RCO	3.16 ⋅ 10⁸	7.06 ⋅ 10⁹	1.47 ⋅ 10⁴	5.81 ⋅ 10⁵
¹⁶O	12	II	2.75 ⋅ 10⁸	5.16 ⋅ 10⁹	1.42 ⋅ 10⁴	4.35 ⋅ 10⁵
¹⁶O	12	MF	2.75 ⋅ 10⁸	5.16 ⋅ 10⁹	9.69 ⋅ 10³	4.41 ⋅ 10⁵
¹⁶O	12	RCO	2.75 ⋅ 10⁸	5.16 ⋅ 10⁹	1.76 ⋅ 10⁴	6.68 ⋅ 10⁵
¹⁸F	10	II	1.90 ⋅ 10⁸	4.35 ⋅ 10⁹	1.33 ⋅ 10⁴	4.87 ⋅ 10⁵
¹⁸F	10	MF	1.90 ⋅ 10⁸	4.35 ⋅ 10⁹	1.00 ⋅ 10⁴	5.27 ⋅ 10⁵
¹⁸F	10	RCO	1.90 ⋅ 10⁸	4.35 ⋅ 10⁹	1.80 ⋅ 10⁴	7.71 ⋅ 10⁵
²²Na	8	II	4.28 ⋅ 10⁸	1.13 ⋅ 10¹⁰	1.99 ⋅ 10⁴	8.86 ⋅ 10⁵
²²Na	8	MF	4.28 ⋅ 10⁸	1.13 ⋅ 10¹⁰	1.71 ⋅ 10⁴	1.08 ⋅ 10⁶
²²Na	8	RCO	4.28 ⋅ 10⁸	1.13 ⋅ 10¹⁰	2.94 ⋅ 10⁴	1.50 ⋅ 10⁶

As a testbed for performance evaluation, we used nodes of the Karolina supercomputer operated by IT4Innovations in Ostrava, Czech Republic. A single Karolina node consisted of two AMD EPYC 7H12 CPUs each having 64 computational cores. Each CPU was additionally configured to contain four NUMA nodes, each with 16 CPU cores.¹

On this system, we measured the sequences deduplication runtime for all the combinations of the following experiment parameters:

1. Different SA-NCSM model spaces: ¹⁰B, ¹²C,¹² B,¹⁶O, ¹⁸F, and ²²Na.

2. Different types of sequences: II, MF, and RCO.

3. Different deduplication algorithms: ST, MTSK, MTIK, and MTPS.

4. Different number of threads mapped to specific parts of the computational node: 1 thread (T = 1) mapped to a single NUMA node, 16 threads (T = 16) mapped to a single NUMA node, 64 threads (T = 64) mapped to a single CPU (single socket), 128 threads (T = 128) mapped to the entire computational node (both its CPUs/sockets).

Since the number of combinations of these parameters is large, we present here some of their subsets. First, we show the absolute deduplication runtime measured for the single NUMA node case (T = 16). The reason why this case is of the major importance for us is that it corresponds with the typical setup for large-scale SA-NCSM calculations, where the mapping of MPI processes to individual NUMA nodes generally gives the best application performance. The results for these measurements are shown in Table 3. For comparison, we also show the runtime for the single-threaded algorithm. Figure 1 shows the same results as Table 3, but the numbers are normalized by the runtime of the single-threaded algorithm. According to these results, the fastest deduplication is, in the single NUMA node setup, provided by the MTSK algorithm. For the II sequences, the MTPS algorithm gives only slightly slower runtime; however, for the MF and RCO sequences, the difference is more significant. This corresponds with the fact the MTPS algorithm needs to access each input sequence twice, which does not matter much for the II sequences, which are stored in memory, but it becomes significant for the MF and RCO sequences, which are generated on-the-fly. This issue is even more severe for the MTIK algorithm, where each input sequence is accessed multiple times (generally more than twice). With the single NUMA node setup, this algorithm was the slowest one in all the tested cases.

Table 3.

Deduplication runtime for different SA-NCSM model spaces, different types of sequences, and different algorithms measured for the single NUMA node setup.

Model space		Seq.	Algorithm runtime [s)
Nucleus	Nmax	Type	ST (T = 1)	MTSK (T = 16)	MTIK (T = 16)	MTPS (T = 16)
¹⁰B	18	II	248.1	32.9	65.5	35.5
¹⁰B	18	MF	418.7	44.6	108.0	73.7
¹⁰B	18	RCO	457.7	45.5	109.9	71.9
¹²C	14	II	83.3	12.8	25.5	13.5
¹²C	14	MF	144.7	16.9	40.0	25.9
¹²C	14	RCO	150.7	17.0	40.7	27.2
¹²B	14	II	65.6	9.1	18.3	10.1
¹²B	14	MF	113.8	12.4	31.0	19.4
¹²B	14	RCO	115.8	12.3	30.8	21.4
¹⁶O	12	II	51.0	7.8	15.1	8.7
¹⁶O	12	MF	89.9	10.1	24.5	16.6
¹⁶O	12	RCO	88.9	10.1	24.9	16.6
¹⁸F	10	II	40.9	5.4	11.4	6.1
¹⁸F	10	MF	71.2	7.5	18.7	12.6
¹⁸F	10	RCO	74.1	7.7	18.8	12.5
²²Na	8	II	101.5	13.6	27.3	14.9
²²Na	8	MF	174.6	18.4	43.3	13.8
²²Na	8	RCO	175.9	19.3	44.8	28.9

Figure 1.

Normalized deduplication runtime for different SA-NCSM model spaces, different types of sequences, and different algorithms measured for the single NUMA node setup. (a) II, (b) MF, and (c) RCO.

Next, in Figure 2, we show the results of the same measurements performed for a single-threaded algorithms runs (T = 1). These results allow us to evaluate the overhead of each parallel algorithm. Obviously, the MTSK algorithm provides runtime practically equivalent to the ST algorithm; sometimes, it is even faster. This indicates that the implementation of the concurrent hash table provided by the oneTBB library is very efficient. The single-threaded overhead of the MTIK and MTPS algorithms is much higher. Specifically, for the on-the-fly generated sequences (MF and RCO), the measured times are around twice in comparison with the ST algorithm.

Figure 2.

Normalized deduplication runtime for different SA-NCSM model spaces, different types of sequences, and different algorithms measured for the single-threaded algorithms runs. (a) II, (b) MF, and (c) RCO.

Next, we evaluated the strong scalability (fixed problem size) for all the parallel algorithms with respect to the different number of threads mapped to a single NUMA node, single CPU (socket), and the entire computational node (both CPUs). The results obtained for the ¹⁰B, Nmax = 10 model space are shown in a logarithmic scale in Figure 3. These results indicate that the MTSK and MTIK algorithms scale well with the increasing number of threads, but the scalability of the MTPS algorithm is worse, which is significant specifically for the entire-node setup. It seems that this algorithm is more memory-bound, therefore, it suffers from the configurations where the number of cores highly exceeds the number of memory controllers.

Figure 3.

Normalized deduplication runtime for different number of threads, different types of sequences, and different algorithms measured for the ¹⁰B, Nmax = 10 model space. (a) II, (b) MF, and (c) RCO.

Finally, we compared the proposed parallel algorithms among themselves for different numbers of used threads. The results obtained for the ¹⁰B, Nmax = 10 model space are shown in a logarithmic scale in Figure 4. As already noted, the MTSK algorithm was the fastest one in all cases. For the single NUMA node runs, the MTPS algorithm gave similar performance for the II sequences, and it was only slightly slower for the MF and RCO sequences. In the single-CPU cases, the MTIK and MTPS algorithms provided comparable runtimes, but for the entire-node cases, the MTPS algorithm was mostly the slowest one.

Figure 4.

Normalized deduplication runtime for different algorithms, different types of sequences, and different numbers of threads measured for the ¹⁰B, Nmax = 10 model space.

7. Discussion

Our measurements indicate that all the proposed parallel multithreaded algorithms considerably reduce the time needed for the solution of the sequences deduplication problem in SA-NCSM applications. The highest speedup was achieved by the MTSK algorithm, which appears to be generally superior. The only drawback of this algorithm is that it requires a concurrent hash table that can use variable-length sequences as sorting keys. In case of the C++ programming language, such a hash table is not part of the standard library; therefore, the implementation of this algorithm needs some third-party solution.

On the contrary, the MTPS algorithm is hash table-free. Instead, it requires a parallel sorting method, which, since the C++17 standard, is a part of the standard library itself; see ISO/IEC (2017). For some applications, this may be a significant advantage, since it may provide better portability of application code. The drawback of this algorithm is the dual access of each input sequence. First, the sequences are accessed in order of their growing indices (when their hashes are calculated). However, in the second case (when unique sequences in groups are identified), the sequences are accessed in an irregular order. This second access is generally slower due to less efficient prefetching and related lower utilization of caches. These effects are significant in setups where the number of used cores exceeds the number of memory controllers. The MTPS algorithm is thus best suitable for single NUMA node runs, where its performance and efficiency is similar to the MTSK algorithm.

Out of all the multithreaded algorithms, the MTIK algorithm was in our measurements mostly the least efficient one. Its only advantage in comparison with MTSK is that the unique sequences do not need to be stored in memory twice (once in the hash table and second in the resulting arrays). In our application, where the number of unique sequences was very low, these effects were practically negligible. However, in applications where the number of unique sequences (N) would not be that low compared to the number of input sequences (n), the difference might be noticeable.

8. Conclusions

The contribution of this article is the proposal of three parallel multithreaded algorithms for solution of the problem of deduplication of sequences of data. In addition, we presented an experimental evaluation and comparison of these algorithms using data from a real-world HPC application—nuclear structure calculations that adopted the SA-NCSM. In these calculations, the redundant sequences of various kinds naturally emerge in the computer representation of model space bases. We demonstrated that, for multiple different model spaces, the proposed algorithms could significantly reduce the runtime required to deduplicate the sequences of basis data, which, in practice, resulted in significant memory savings for each MPI process. Finally, we compared the proposed algorithms and discussed their advantages and drawbacks, as well as their target suitable application use cases.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Czech Science Foundation under Grant No. 22-14497S, by the Ministry of Education, Youth and Sports of the Czech Republic under Grant No. CZ.02.1.01/0.0/0.0/16_019/0000765, and by the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:90140).

ORCID iD

Daniel Langr

Note

Author biographies

Daniel Langr is a research assistant professor at the Department of Computer Systems, Faculty of Information Technology, Czech Technical University in Prague. His research interests include sparse matrix computations, parallel sorting algorithms, efficient implementations of parallel algorithms, and tuning codes for large supercomputers.

Tomáš Dytrych is a researcher at the Nuclear Physics Institute of the Czech Academy of Sciences. His research interests include the physics of atomic nuclei, computational group theory, and the application of modern programming techniques for high performance computing of nuclear structure.

References

Barrett

Navrátil

Vary

(2013) Ab initio no core shell model. Progress in Particle and Nuclear Physics 69: 131–181. DOI: 10.1016/j.ppnp.2012.10.003.

Dytrych

Maris

Launey

, et al. (2016) Efficacy of the SU(3) scheme for ab initio large-scale calculations beyond the lightest nuclei. Computer Physics Communications 207: 202–210. DOI:10.1016/j.cpc.2016.06.006

Heine

(1960) Group Theory in Quantum Mechanics. Oxford, UK: Pergamon Press. DOI: 10.1016/C2013-0-01646-5

ISO/IEC (2017) ISO/IEC 14882:2017: Information Technology — Programming Languages — C++. Washington, DC, USA: ISO/IEC

Karakasis

Gkountouvas

Kourtis

, et al. (2013) An extended compression format for the optimization of sparse matrix-vector multiplication. Parallel and Distributed Systems, IEEE Transactions on 24(10): 1930–1940. DOI:10.1109/TPDS.2012.290.

Kourtis

Goumas

Koziris

(2008) Optimizing sparse matrix-vector multiplication using index and value compression. In: Proceedings of the 5th Conference on Computing Frontiers, New York, NY, USA: ACM, pp. 87–96. DOI:10.1145/1366230.1366244.

Langr

Dytrych

(2023) Parallel identification of unique sequences in nuclear structure calculations. In: Proceedings of the 14th International Conference on Parallel Processing and Applied Mathematics (PPAM 2022), Lecture Notes in Computer Science, volume 13827. Berlin: Springer. DOI:10.1007/978-3-031-30445-3_30

Langr

Dytrych

Launey

, et al. (2019) Accelerating many-nucleon basis generation for high performance computing enabled ab initio nuclear structure studies. The International Journal of High Performance Computing Applications 33(3): 522–533. DOI:10.1177/1094342019838314

Langr

Dytrych

Oberhuber

, et al. (2018) Efficient parallel generation of many-nucleon basis for large-scale ab initio nuclear structure calculations. In: Proceedings of the 12th International Conference on Parallel Processing and Applied Mathematics (PPAM 2017), Lecture Notes in Computer Science, volume 10778. Berlin: Springer, pp. 341–350. DOI:10.1007/978-3-319-78054-2_32

(2016) Symmetry-guided large-scale shell-model theory. Progress in Particle and Nuclear Physics 89: 101–136. DOI: 10.1016/j.ppnp.2016.02.001

11.

Meister

(2013) Advanced data deduplication techniques and their application. PhD Thesis, Mainz, Germany: Johannes Gutenberg-Univrsität.

12.

Message Passing Interface Forum (2021) MPI: A Message-Passing Interface Standard Version 4.0. Available at: https://www.mpi-forum.org/docs/mpi-4.0/mpi40-report.pdf.

13.

OpenMP Architecture Review Board (2021) OpenMP Application Programming Interface Version 5.2. Available at: https://www.openmp.org/specifications/.

14.

Singler

Konsik

(2008) The GNU Libstdc++ parallel mode: Software engineering considerations. In: Proceedings of the 1st International Workshop on Multicore Software Engineering. New York, NY, USA: ACM, pp. 15–22. DOI:10.1145/1370082.1370089

15.

Willcock

Lumsdaine

(2006) Accelerating sparse matrix computations via data compression. In: Proceedings of the 20th Annual International Conference on Supercomputing, New York, NY, USA: ACM, pp. 307–316. DOI:10.1145/1183401.1183444

16.

Xia

Jiang

Feng

, et al. (2016) A comprehensive study of the past, present, and future of data deduplication. Proceedings of the IEEE 104(9): 1681–1710. DOI:10.1109/JPROC.2016.2571298

17.

Zhang

Deng

(2017) An overview on data deduplication techniques. In: Balas

Jain

Zhao

(eds.) Information Technology and Intelligent Transportation Systems. Berlin: Springer International Publishing, pp. 359–369. DOI:10.1007/978-3-319-38771-0_35.

Parallel multithreaded deduplication of data sequences in nuclear structure calculations

Abstract

Keywords

1. Introduction

2. Related work

3. Application

4. Algorithms

4.1. Parallel algorithm MTSK

4.2. Parallel algorithm MTIK

4.3. Parallel algorithm MTPS

5. Implementation issues

6. Experiments and results

7. Discussion

8. Conclusions

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

Note

Author biographies

References