Sage Journals: Discover world-class research

Abstract

Modern supercomputers feature an ever-increasing degree of parallelism, particularly in the number of cores per node. These high core counts are considered in our flexible implementation of allreduce, which was implemented specifically with shared-memory communication in mind. At a high level, our algorithm consists of a reduce_scatter stage followed by an allgather stage, or a reduce stage followed by a broadcast stage, and allows for different factors (aka multi-radix) to be applied at each. The reduce and broadcast operations are also considered as standalone functions. Where barriers are required, they are integrated into the algorithm using counters to track progress. To accommodate the complexity of this approach, our implementation is split into a setup phase and an execution phase. The setup phase occurs only once for a given set of parameters, and is responsible for determining the algorithm that will be run each time the allreduce is called within the execution phase. We present two interfaces: a persistent collective interface (an MPI 4.0 feature), which inherently aligns with our setup and execution phases, and a blocking interface, where the setup is performed the first time a specific algorithm is required. Using these methods, we achieve speedups of half an order of magnitude compared to the persistent and blocking allreduce implementations of MPICH and Open MPI on a dual-socket node with AMD EPYC processors, and almost an order of magnitude on a four-socket node with NVIDIA Grace-Hopper processors. Reductions on vectors residing on both CPU and GPU memory are performed. Our implementation also achieves good performance on multiple nodes. A standard benchmark of the application CP2K is sped up by 2.5%. Notably, for long messages, our implementation achieves the same performance as NCCL.

Keywords

MPI collective communication allreduce multi-core reduction

1. Introduction

As core counts per node continue to grow on both modern supercomputing architectures and personal computer architectures, intra-node collective communication plays an increasing role in enabling efficient use of available resources. In this contribution, the essential operations reduce, broadcast, reduce_scatter, allgather, and allreduce are optimised both for standalone application and incorporation into allreduce inter-node communication. We describe a performant intra-node (shared memory) implementation of these operations, building upon our previous work on inter-node allreduce (Jocksch et al., 2021).

For the intra-node implementation, three algorithms are improved (cyclic shift, recursive exchange, and tree), and one is introduced: cyclic copy-in reduction. We utilise the first three with flexible factors (multi-radix)¹ instead of the classical constant radix of two. Further, technical improvements to these three algorithms are done in our study. Algorithmic schemes and benchmarks are presented. The latter cyclic copy-in algorithm is presented without benchmarks for completeness. We discuss and benchmark these algorithms mostly for the complex allreduce collective communication. The reduce_scatter and allgather collectives are part of the allreduce, and are therefore not evaluated individually. The on-node reduce and broadcast are also considered as standalone operations.

The inter-node communication we present is also an extension of our previous work (Jocksch et al., 2021). In contrast to our earlier work which primarily elaborates on the cyclic shift algorithm, the focus of this contribution is the recursive exchange. An algorithmic scheme for non-trivial factors is presented. Again, only the complex allreduce algorithm itself is benchmarked, not its components.

The optimisations considered here are well suited for applications where communication patterns with exactly the same parameters are repeated often. Particularly, repeated calls of allreduce are a part of many applications, such as Neko (Karp et al., 2022) and distributed training for artificial intelligence. This suggests that there may be an advantage in performing a complex algorithmic setup once in the initialisation phase of the communication; the optimised algorithm may then be run as many times as needed, amortising the initial overhead. This approach is called persistent collective communication (Holmes et al., 2021); it became a feature of the Message Passing Interface (MPI) standard in version 4.0, and implementations are available in libraries such as MPICH (Gropp et al., 1999), Open MPI (Gabriel et al., 2004), and MPC (Bouhrour et al., 2022).

Alternatively, for applications such as CP2K (Kühne et al., 2020), classical blocking collective communication is supported. When using this interface, the initialisation phase of the algorithms is deferred until the first call to the collectives, known as lazy initialisation. Successive calls upon the blocking collectives executes the algorithms with minor modifications. The setup of the algorithms in memory is discarded when the MPI communicator is released. For further speedups, the initialised algorithms are cached in ‘wisdom’ files for future reuse.

Key features of our intra-node communication algorithms include the use of shared memory, a combination of reduction operations and synchronisation barriers (Mohamed El Maarouf et al., 2023), and the ability to use flexible factors as already mentioned. From the four algorithms discussed for the shared memory implementation: recursive exchange, cyclic shift, tree, and cyclic copy-in reduction, the recursive exchange and tree algorithms support non-commutative operations (i.e., operations for which order of operands affects the result), while the cyclic shift and cyclic copy-in reduction are only used for commutative operations. Allreduce based on recursive exchange and cyclic shift consist of a reduce_scatter phase followed by an allgather phase. There is also the option to apply recursive exchange and cyclic shift to allreduce recursively. When using this option, the recursive allreduce is performed after the reduce_scatter phase and before the allgather phase. In CPU memory, our approach to MPI parallelism is to use shared send and receive buffers made available through the Cross Partition Memory (XPMEM) library (Hashmi et al., 2018) for reductions. GPU memory is shared analogously using CUDA inter-process communication.

Our inter-node allreduce recursive exchange algorithm is implemented in three phases: reduce_scatter followed by recursive allreduce and allgather, as in the cyclic shift version originally introduced by Bruck and Ho (1993). In principle, the inter-node allreduce can be executed on top of reduce_scatter and allgather intra-node communication. Then the execution is: (1) intra-node reduce_scatter, (2) inter-node allreduce, and (3) intra-node allgather. For performance reasons, the inter-node allreduce is implemented alternatively on top of reduce and broadcast intra-node communication by using the tree algorithm for intra-node communication: (1) intra-node reduce, (2) inter-node allreduce, and (3) intra-node broadcast. Many modern architectures have multiple sockets per node or other non-uniform memory access (NUMA). This has been taken into account for communication algorithms in the literature and is also considered in this contribution, enabling our intra-node communication algorithms to work with multiple sockets without modification.

In the following, we describe the recursive exchange algorithm for both intra-node communication and inter-node communication, as well as cyclic shift and tree algorithms for intra-node communication. We also discuss the interplay between intra-node and inter-node communication and parametrisation of these algorithms. Finally, we explore our implementations on multiple platforms, including both multi-core CPU- and GPU- based architectures, and analyse their performance. This includes single- and multi-node benchmarks with comparisons to MPICH, Open MPI, and NCCL (NVIDIA, 2025), as well as an evaluation of our algorithms’ performance with the application CP2K using our optimised allreduce and reduce_scatter_block. For a description of the allreduce algorithms based on reduce_scatter and allgather demonstrated with pure CPU benchmarks, see our contribution Jocksch et al. (2024).

2. Newly optimised algorithms

This section introduces four algorithms designed to optimise collective communication operations, particularly for modern high-performance computing (HPC) architectures with high core counts and multi-socket nodes. The algorithms presented are tailored to leverage shared-memory communication, flexible factorisations, and synchronisation barriers for efficiency in both intra-node and inter-node communications.

2.1. Recursive exchange

The recursive exchange algorithm is a kind of butterfly algorithm suited for non-commutative operators with flexible factors as a generalisation of the radix of two. It maps naturally onto multi-socket nodes if placed on the nodes such that a minimum data transfer is performed between sockets. Figure 1 shows the data flow between the processes for eight MPI tasks. The order of the tasks is chosen for an optimal performance for a possible application on a two-socket node, with tasks t0–t3 and t4–t7 on socket zero and one, respectively. Our recursive exchange algorithm for allreduce consists of a reduce_scatter stage and an allgather stage. In our example the reduce_scatter stage of our algorithm in step one a factor of four is applied in step two a factor of two. The allgather stage is described with steps three and four, which have the factors two and four, respectively.

Figure 1.

Recursive exchange on a node with eight tasks (t0–t7).

2.1.1. Intra-node communication

The recursive exchange algorithm consists of a reduce_scatter stage and an allgather stage, both of which are repeated for each factor in the factorisation of the number of tasks in the exchange. The optional allreduce stage between reduce_scatter and allgather we found not to be advantageous for intra-node communication. As in our inter-node allreduce implemented in Jocksch et al. (2021), the factors are also flexible. Barriers are integrated as counters into the algorithm, as demonstrated for OpenMP in Mohamed El Maarouf et al. (2023), meaning that flags are set and read individually in a way that the data needed by the different steps of the algorithm is guaranteed to be valid when processed. For an example of a standalone barrier see Section 3.1.

Figure 2 shows the recursive exchange algorithm for the same parameters as chosen in Figure 1. The messages are first reduced (reduce_scatter) with a factor of four (meaning four tasks cooperate) and then with a factor of two, before being redistributed (via allgather) with factors of two and four.

Figure 2.

Recursive exchange on a node with eight tasks. (a) the initial state (b) after reduce_scatter with a factor of four (c) after reduce_scatter with a factor of two (d) after allgather with a factor of two (e) the final state, after allgather with a factor of four.

All tasks have initially data vectors containing eight times the task number (0–7). In the reduce_scatter steps the resulting data vector (reduce) is split in factor (here four and two) pieces which are distributed to the processes participating (scatter). This means in the example that after the first reduce_scatter step the numbers 0–3 and 4–7 are summed up and distributed in the vectors. Consequently, after the second reduce_scatter step the diagonal of the matrix contains everywhere the sum of the numbers 0–7 which equals 28. The subsequent allgather steps connect the data vectors again and ensure that every process owns a full copy of the data. In short we write the factorisation as {−4, −2, 2, 4} where negative numbers refer to reduce_scatter and positive ones to allgather.

The vector of the messages to be reduced is split into separate parts, as indicated in Figure 2 as different rows with different colours. The parts do not need to have the same size. Currently, two message partitioning heuristics have been implemented: splitting the message into smaller messages of as equal size as possible, and using a fixed size (typically the cache line size) for the parts of the split message, except the last pieces. For the latter, one part being non-zero and all other parts being zero is a possible solution. In this case, our algorithm becomes identical to a tree reduce and broadcast with a flexible number of leaves.

If between reduce_scatter and allgather allreduce is called the algorithm is applied recursively. Since recursive exchange is recursive by itself there seems to be no new algorithmic option. However, in our implementation, at different recursion levels, different message partitioning heuristics can be applied.

The recursive exchange algorithm is straightforward to apply: the number of tasks is factorised into all of its prime factors, and then these prime factors (or combinations of them) determine the size of the exchange at each step of the algorithm. If these factorisations do not deliver an efficient variant of the algorithm (i.e., the number of tasks is a large prime number), and the cyclic shift algorithm (Section 2.2) cannot be applied, the MPICH / Open MPI implementation is called instead. Internally, both of these libraries follow the building blocks algorithm of Rabenseifner (Thakur et al., 2005) which arranges the reductions to groups of tasks of size 2ⁿ.

The speedup compared to the libraries mentioned is not only due to the integrated barriers but also to the optimal cache usage for optimal factors (Section 2.1.3). As an example we take the allgather step where the data originally owned by one process is distributed by this process to the memory of the other processes in multiple steps. When distributed to the first other processes memory, the data read is present in the cache and reused when written to further processes. This is contrary to a popular simpler implementation known from the literature where every process collects all data from the other processes memory. Cached values cannot be reused then.

2.1.2. Inter-node communication

For our implementation of inter-node communication, we assume a fully connected network with potentially multiple communication ports per node, or a network which behaves in approximately the same manner. Similar to our intra-node communication algorithms, our inter-node allreduce is implemented in three phases: a reduce_scatter, followed by a call to a nested allreduce, and then an allgather, as shown in Jocksch and Piccinali (2023a). Although we only evaluate benchmarks with multiple MPI processes per node (Section 5), we demonstrate recursive exchange with only one process per node for illustration purposes.

Figure 3 shows the recursive exchange’s algorithmic scheme for four nodes with one task per node, using a factorisation of {−2, 2, 2}.

Figure 3.

Algorithmic scheme depicting allreduce on four nodes with one MPI task per node; left: buffers, middle: nodes n0 to n3, right: script for node n0.

The three factors represent an incomplete reduce_scatter step {−2}, an allreduce step implemented as allgather {2} with reduction, and an allgather step {2}. This is an example of a recursive algorithm for allreduce with reduce_scatter and allgather at the beginning and the end, respectively, and an embedded allreduce in between.

In Figure 3, each message is two boxes in length. In each column, the top two boxes represent each task’s send buffer (sendbuf), the middle two boxes the receive buffer (recvbuf), and the bottom two boxes a temporary buffer (tempbuf). On the right of the figure, the scheme is shown as a script for node zero, which is very similar to the script generated internally by our implementation in the algorithm setup phase. Each line corresponds to its box in the n0 column (left). There are three steps in the inter-node exchange, each indicated by the keyword STEP. The keyword FRAC describes the memory chunk, which is processed as follows: the first two lines 0–1 correspond to the send buffer, the second two lines 0–1 to the receive buffer, and the third two lines 0–1 to a temporary buffer. The keywords RECVFROM and SENDTO describe a non-blocking receive and send operation, respectively, with the number following being the partner’s MPI rank. For every step, sends and receives are scheduled and followed by a waitall. The step is finalised with a possible reduction operation, keyword REDUCEFROM. The numbers after the keyword specify from which lines the data is read, where the first number is the line of the data before the reduction operand and the second one is the line of the data after the reduction operand. The lines are counted from zero to five within the data vector (box).

We reiterate the three steps of our algorithm, while we discuss the message length for a data vector, assuming the overall size split into two equal-sized parts. The incomplete reduce_scatter step reduces the portion of the message managed by each process by a factor of two, corresponding to the factor of {−2} chosen in the setup phase of the algorithm. The following allreduce step (factor {2}) is implemented as allgather with reduction. The messages sent have the unmodified portion, and are reduced with redundant computation on the nodes. The final allgather step (factor {2}) then restores the message to its full size. Our implementation calls MPI_Irecv, MPI_Isend, and MPI_Waitall of the underlying MPI library.

For multiple tasks per node, the algorithm is split between the intra- and inter-node algorithms. The resulting algorithm has the following steps: a shared memory reduce_scatter step (Section 2.1.1), followed by a reduce_scatter step between nodes, an allgather step between nodes, and finally an allgather step in shared memory. The buffers we indicated as sendbuf and recvbuf in Figure 3 become temporary buffers between intra- and inter-node steps, as the memory layout between shared memory steps and inter-node steps differs and requires the data to be rearranged. As a consequence, the cyclic shift of the message vector for each task for the cyclic shift algorithm elaborated in Bruck and Ho (1993); Jocksch et al. (2021) can be done at the same time, and the recursive exchange has no preference over cyclic shift. If a node has multiple sockets or non-uniform memory, inter-node communication is done in parallel between sockets of the node, and shared memory portions of the algorithm allocate one shared memory segment per socket.

2.1.3. Parametrisation

The recursive exchange algorithm can be set up with many different sets of factors. These factorisations apply to both the reduce_scatter and allgather algorithms, since the allgather is the reversed reduce_scatter algorithm. For any number of tasks participating in the recursive exchange, valid factorisations include any set of numbers such that the product of all factors is equal to the number of tasks. For example, if there are eight tasks, valid parameter sets include: {2, 2, 2}, {2, 4}, {4, 2}, and {8}. We note without proof that for 2ⁿ tasks the number of options is 2ⁿ⁻¹. The combinatorial number of algorithmic options for different numbers of tasks is shown in Table 1.

Table 1.

Combinatorial number of factorisations for reduce_scatter / allgather algorithm.

	Number of combinations	Number of combinations
Number of tasks	Recursive Exchange	Cyclic Shift
2	1	1
4	2	3
8	4	10
⋮	⋮	⋮
64	32	409
128	64	1378
12	8	20
72	76	483
288	544	5421

When trying to find optimal factor sets, additional restrictions apply. For intra-node communication, short messages should not be split into pieces smaller than the cache line size, as this would lead to cache devalidation when different tasks are writing into the buffers. Additionally, the size of the message produced at each step of the algorithm should not exceed the cache size to avoid cache misses. The algorithmic complexity of the integrated barriers should also be considered (Section 2.1.1).

For inter-node communication, the number of communication ports must be taken into account in the factorisation of the number of tasks, where each factor is the number of ports plus one. The network might behave in a way that the number of communication ports efficiently used is message size dependent. In general the order of tasks matters for the overall execution time of the algorithm, if the network is inhomogeneous with respect to bandwidth or latency.

2.1.4. In-place versus out-of-place algorithms

For intra-node communication, we utilise the same algorithms for out-of-place reductions as for the in-place versions described. The only difference is that the very first reduction operation copies the message from the send to the receive buffer; otherwise, all read operations are done for the send buffer and all write operations for the receive buffer. If the intra-node allreduce algorithms are used in combination with inter-node communication, the intra-node reduce_scatter and allgather phases remain the same as for the pure intra-node communication. The distinction is just that the data of the receive buffer is copied to the temporary buffer for inter-node communication, rearranged, and copied back after the inter-node communication. For performance reasons another algorithm might be used for inter-node communication (Section 2.3).

2.2. Cyclic shift

The cyclic shift algorithm only works for commutative operators, due to its cyclical movement of data between tasks. It has been implemented by Bruck and Ho (1993) for inter-node communication, and by Li et al. for intra-node communication in a variant called the dissemination algorithm (Li et al., 2014). Our existing inter-node cyclic shift is described in Jocksch et al. (2021). We note that the recursive exchange algorithm for one MPI task per socket (node) requires less memory copies than the cyclic shift algorithm.

Figure 4 shows the data flow between the processes for the cyclic shift algorithm and seven MPI tasks.

Figure 4.

Cyclic shift on a node with seven tasks (t0–t6).

In all six steps a factor of two is applied. The first three steps are reduce_scatter steps, the last three steps are allgather steps. We note the change in the shift direction between reduce_scatter and allgather since these two stages can partially be considered as reversal operations.

2.2.1. Intra-node communication

Figure 5 shows the algorithmic scheme of our cyclic shift algorithm for intra-node communication. In contrast to the implementation of Bruck and Ho (1993) for inter-node communication, we do not apply any cyclic shift of the message vectors before and after the actual communication. This is possible since there is no strong need to coalesce messages. For transfer in memory, messages can be split in many pieces, as long as no pieces are smaller than the cache line size. The key feature of the algorithm is that the product of factors for reduce_scatter or allgather can be larger than the number of tasks, such as {2, 2, 2} for 7 tasks as is represented in Figure 5. As a consequence, the number of rows is not halved or doubled in the first reduce_scatter step (a → b) or the last allgather step (f → g), respectively.

Figure 5.

Cyclic shift on a node with seven tasks, all with factors of two: (a) the initial state (b) 4-way reduce_scatter (c) 2-way reduce_scatter (d) after reduce_scatter steps (e) 2-way allgather (f) 4-way allgather (g) the final state, after allgather steps.

To reiterate, the algorithm is implemented here as a reduce_scatter step followed by an allgather step, allowing for the factors to be flexible and the barriers integrated. When there is no data to reduce, our reduce_scatter or allgather operation becomes a dissemination barrier (Hoefler et al., 2006; Nanjegowda et al., 2009). Thus, the reduction operation is implemented similarly to its OpenMP counterpart (Mohamed El Maarouf et al., 2023; Speziale et al., 2011) or any shared memory reduction (Lubachevsky, 1990). The allgather operation is implemented as the reversal of the reduce_scatter operation, where each task distributes its message to all other tasks.

The recursive application feature of our allreduce implementation is applied for multi-socket nodes and small messages. For multi-socket nodes, cyclic reduce_scatter operations are performed independently on the sockets first, and then the reduce_scatter is executed cyclically between sockets. Allgather operations are then applied between sockets and on-socket as the third and fourth steps, respectively. For short messages, the outer cyclic shift uses fractions of the message size which have at least the size of a cache line. The inner cyclic shift is applied to the smallest messages and resolved before continuing on to the outer part of the algorithm.

2.2.2. Parametrisation

The cyclic shift algorithm can be set up with many different factorisations. For eight tasks there are ten options: {2, 2, 2}, {2, 3, 2}, {2, 4}, {3, 2, 2}, {3, 3}, {4, 2}, {5, 2}, {6, 2}, {7, 2}, and {8}. They include the options for recursive exchange. For cyclic shift, the original order of the factor sets have to be reversed for the reduce_scatter step, but are used in forward order for allgather. For more examples of cyclic shift factors, see Table 1.

In contrast to recursive exchange, we do not see any obvious correlation between number of tasks and the combinatorial number of algorithmic options. If we consider the allreduce operation being constructed from reduce_scatter and allgather without any allreduce part in the middle, the total number of options for the complete allreduce is the square of the numbers in Table 1, since reduce_scatter and allgather can be executed with different factors. The allreduce version (with a possible incomplete reduce_scatter, allreduce and allgather), discussed for inter-node communication (Section 2.1.2), provides additional options.

2.3. Tree algorithm

The recursive exchange algorithm with the data layout shown is good for intra-node communication, but necessitates an additional memory copy step before and after any inter-node communication. Therefore, we consider the on-node allreduce operation as a sequence of reduce and broadcast operations. The shared memory algorithm has the same number of computational operations. Figure 6 shows the alternative data layout which sacrifices data locality, but avoids the additional memory copy steps.

Figure 6.

Tree reduction in shared memory, colours (and numbers 0–7) indicate the different MPI tasks, vertical data vectors of tasks t0–t7; a → b represents the first step, c → d represents the second step.

The parameters of the algorithm are chosen to be identical to the variant in Figure 2. Again, barriers are integrated into the algorithm. Figure 7 shows the last step if two sockets per node are assumed. This means that in the last step only half the data is reduced to task zero, while the other half is reduced to task four. Only the reduce step is shown here, as the broadcast step is analogous: it proceeds in the opposite direction. We refer to this algorithm as a tree algorithm here. The parametrisation is considered as to be identical to the recursive exchange algorithm, except that different parameters might be the optimum for maximum performance.

Figure 7.

Last step of tree reduction in shared memory on two sockets per node, colours (and numbers 0–7) indicate the different MPI tasks, vertical data vectors of tasks t0–t7; for first step see Figure 6.

2.4. Cyclic copy-in reduction

If a copy-in phase is required and the message size is large, a combined copy-in and reduction is efficient (Figure 8).

Figure 8.

Copy in (top) and reduction operations (middle and bottom), + equals reduction operator, colours (and numbers 0–3) indicate the different MPI tasks, horizontal data vectors, adapted from Jocksch and Piccinali, 2023b.

Pieces of the different data vectors are copied in and reduced cyclically. In Figure 8 top, the four tasks 0–3 have written their tasks number’s pieces (0–3) of their data vectors to the shared memory segment (first step). In the second step (middle top) the next pieces of the data vectors are added to the shared memory segment, namely the data of tasks 3, 0, 1, and 2 to the data of tasks 0, 1, 2, and 3, respectively, while in step 3 (middle bottom) and 4 (bottom) the data of tasks 2, 3, 0, and 1 and tasks 1, 2, 3, and 0, respectively, is added to the shared memory segment. After every copy or reduce step, the necessary barrier flags are set and checked. In this way, the algorithmic complexity of the barrier is quadratic with the number of tasks (N²) while it is N log N for the dissemination algorithms. Therefore, it is efficient to combine this algorithm with consecutive reductions in shared memory. Multiple groups of tasks perform independent copy-in reductions, then the shared memory reduction follows. The number of tasks participating in a single copy-in reduction is a trade-off.

2.5. Other collective operations

The shared memory allreduce algorithms can be applied straightforwardly to other collective communications, such as reduce_scatter and allgatherv (the different phases of the allreduce algorithm). The banded structure of the data in Figures 2 and 5 requires a data copy operation to and from the reduce_scatter’s and allgatherv’s receive and send buffers, respectively. Furthermore, for the reduce_scatter operation, a temporary buffer of the size of the send buffer is required. In the special case of a radix equal to the number of tasks, the reduce_scatter algorithm does not require this temporary buffer. Instead, all data is copied and reduced in the receive buffer of every task directly. The broadcast and reduce collective communications can be obtained by replacing the reduce_scatter step with a scatter operation, and the allgather step with a gather operation, respectively. An alternative is the tree allreduce algorithm (Section 2.3) which consists of a reduce and a broadcast phase already.

3. Hardware mapping

The algorithmic schemes are mapped onto (potentially hierarchical) multi-socket hardware with the aim of facilitating maximum data locality. If required, reductions are performed on GPU memory, where the number of kernel invocations on the GPU is considered and minimised.

3.1. Barrier implementation

Every task’s shared memory segment is utilised to store the barrier flag counters, to better coordinate and facilitate synchronisation. The implementation details of the dissemination barrier are highlighted below in Listing 1.

Listing 1: Dissemination barrier

The array shmem contains pointers to the flags of the different tasks, where indices are assigned cyclically, and thus index 0 is always the own flag. This flag is incremented in every step of the barrier, and is then compared with the flags of the other processes. The instruction memory_fence_store(); and memory_fence_load(); are memory barriers if necessary for the parallel architecture.

3.2. Translation of the algorithm

Our implementation processes the parameters of the collective communication and generates the executable bytecode in several steps. One intermediate step from the algorithm to its execution is generation and compilation of an assembler code, such as Listing 2, which represents an allreduce between four nodes (one socket per node), and four tasks per node. See Table 2 for the meanings of the colours used.

Table 2.

Colour code of assembler listings.

Colour	Syntactical meaning
Red	Point-to-point MPI communication
Blue	On-node copy or reduce operations
Green	Memory buffer
Brown	Barriers

For illustration, we have chosen {−2, −2, 2, 2} as factors for the on-node allreduce and the reduce_scatter - allgather approach. Between nodes, the factors are also {−2, −2, 2, 2}. For our example, 1024 elements with a size of eight bytes were selected; more performant factorisations with more complicated code exist. The assembler code is written for task zero (i.e. node zero and task on the node zero). The instructions MEMORY_FENCE_STORE and MEMORY_FENCE_LOAD are memory fences for parallel execution on multi-core nodes.

Listing 2: Assembler code allreduce 4 nodes and 4 tasks per node; for the colouring see Table 2

Memory copies and reductions in a serial way are specified with SREDUCE and SMEMCPY, respectively, where SENDBUF, RECVBUF, and SHMEM are the send buffers, receive buffers and temporary shared buffers of the node, respectively, the numbers in brackets are the task numbers of the memory, the numbers added are the offset in the buffer, and the last number is the size of the memory operation. The barrier flag of the task itself is incremented with SET_NODE_BARRIER, while the test for the barrier flags of other tasks of the node is done with WAIT_NODE_BARRIER 1, “1” stands for tasks one. MPI point-to-point communication with MPI_Isend, MPI_Irecv, and MPI_Waitall is represented with ISEND, IRECV, and WAITALL, where LOCMEM represents a local memory buffer used for the MPI ‘request’ handle. The REDUCE operations are reduction operations executed in parallel. The keyword indicates that at the same time other tasks on the node execute reduction operations on neighbouring memory chunks. Therefore, explicit memory barriers SOCKET_BARRIER and SOCKET_BSMALL are required. In our case both do the same thing: a dissemination barrier on the whole node. Typically the former acts between all tasks on the socket, and the latter on all tasks on the sockets which participate in parallel memory copies or reductions. The end of the execution is indicated with a RETURN statement. The assembler code is translated into bytecode, which is later executed.

Part of this bytecode reflects memory copy and reduction operations, incrementing and reading of barrier flags, MPI point-to-point communication with MPI_Isend, MPI_Irecv, and MPI_Waitall, as well as CUDA kernel calls. For the kernel calls, the parameters are stored in GPU memory and passed as a pointer to the kernel, which is one single copy / reduce kernel in the whole implementation. Consecutive operations are fused to a batch of GPU operations for a single kernel invocation. Any barrier, send, or receive operation limits the fusion of copy and reduce operations.

Thus, the assembler code for maximum performance on CPU or GPU differs not only in the factors for the algorithm and possible recursion or pipelining, but also has device-specific considerations for barrier flag handling. For the CPU, the check for barrier flags is done immediately before the corresponding memory copy or reduction operations. For the GPU, all flags are checked together (and possibly with a dissemination algorithm) before a batch of operations is executed on the GPU. Listing 3 shows the assembler code of an on-node allreduce for task 0 for factors {−4, 4}. All memory copy and reduction operations are executed within one CUDA kernel. Memory fence operations are not necessary for the GPU allreduce.

Listing 3: Assembler code allreduce 4 tasks per node optimised for GPU; for the colouring see Table 2

3.3. Mapping on the GPU

In a typical application in an MPI library, the send and receive buffers of the different tasks are not visible to the other tasks by default. This can be done with libraries (e.g., XPMEM for the CPU) or the CUDA runtime for the GPU. However, the execution of these library functions introduces certain time overheads. It is therefore more efficient to copy the data to / from a temporary buffer for small messages.

The memory copy, reduce, and send / receive operations can be executed on either the CPU or the GPU. The architecture supports CPU access to GPU memory and GPU access to CPU memory. We do not discuss this feature in depth here since it is particular to Grace-Hopper architecture Fusco et al. (2024). Both CPU and GPU executions have advantages: The CPU code has lower latency, while the GPU code has higher bandwidth. The CPU and GPU may be more physically separated or closely integrated, depending on the architecture.

The translator tries to put consecutive memory copies and reductions into a single kernel call. If this is not possible, since our kernels need to be embarrassingly parallel with respect to CUDA threads, the fallback translation is separate kernels. Send and receive operations utilise the GPU aware point-to-point communication of the underlying MPI library.

4. Persistent versus blocking interface

Our algorithms are part of a library on top of any MPI 4.0 library which overwrites functions using the MPI profiler hook. Thus, for the persistent interface the standard MPI_Allreduce_init call is overwritten with our fast implementation where the original routine is used as fallback for unimplemented features. The initialisation of our persistent collective communication is relatively expensive (Section 5). To reduce the number of initialisation calls, an alternative blocking interface is provided.

The blocking interface is most suitable for applications which require different message sizes in every allreduce call, but still always use the same processes. The algorithms are lazily initialised in their first call for their best fit parameters, and are slightly adapted before execution within the MPI_Allreduce call. The message size is just the multiple of the one of the algorithmic set up, plus padding. After the algorithms initialisation and usage they are freed with communicator destruction. In order to avoid part of the expensive set up when a simulation is restarted, we store all bytecodes to disc and read them in upon re-initialisation of the algorithm.

For the intra-node communication algorithms introduced, the blocking interface is limited to message sizes without the need of padding only, or an additional memory copy is required if incorporated into inter-node communication, since padding is, in general, not possible for send and receive buffers passed to the MPI_Allreduce function. For intra-node communication, send and receive buffers of the different processes need to be made accessible for other processes on the node. This is done with the XPMEM library. In order to be efficient, the pointers are cached using a tree data structure, as in the MPICH library.

5. Benchmarks

Benchmarks were performed on a HPE Cray EX system with dual-socket nodes with 64-core AMD EPYC 7003 processors, and on a HPE EX system with nodes equipped with four NVIDIA Grace-Hopper chips (72 CPU cores each). On both architectures, the MPI libraries MPICH 4.2.0 and Open MPI 5.0.2 were used for comparison. In all tests, algorithm parameters were chosen empirically.

We perform one node allreduce measurements for potential best and worst case numbers of MPI tasks per node. The best case corresponds to the number of tasks being equal to all cores per CPU or node, while the worst case is represented by the largest prime number fitting the number of cores. The choice of the prime number is also motivated by the case that one or more cores are occupied for other purposes.

We measure the performance of allreduce, reduce, and broadcast on one node, and of allreduce between nodes. Benchmarks on the GPU are also included for comparison with NCCL 2.20.3-1 with the AWS plugin. The benchmarks are concluded with an evaluation of our algorithms on the CP2K application.

5.1. Standalone communication

For all benchmarks, the messages were split inside the algorithm into equal parts, but with a minimum size of 64 bytes (cache line size). For messages less or equal than 64 bytes in length, the whole message was treated as one segment (Section 2.1.1). The pipelining and recursive call features are not exploited here for a more direct comparison with other implementations. For maximum performance, a suitable and possibly expensive individual parameter selection step in the initialisation phase would be necessary.

The empirical parameters for the different benchmark cases are listed in Table 3. The parameters for the AMD EPYC are motivated by the fact that the nodes have two sockets and every processor has four quadrants with 16 cores each. We set the maximum factor for all three levels. Consequently, the first reduce_scatter step is local on each of the quadrants of the processor and transfers the largest amount of data. The second reduce_scatter step is between quadrants within each socket and the third step is between sockets. In the latter step the smallest data volume is transferred. The allgather factors have the reverted order. Thus the largest data volume is transferred with the largest bandwidth within quadrants and the smallest volume with the smallest bandwidth between sockets. For the four socket Grace-Hopper nodes we chose a factor four for the innermost reduce_scatter and allgather steps. Each datapoint of our measurements has been obtained as an arithmetic average of ten independent OSU microbenchmark 7.4 runs with default parameters.²

Table 3.

Parameters of the allreduce algorithm for different hardware configurations.

System hardware	Factors
AMD EPYC 7003 (128 cores over two sockets)	{−16, −4, −2, 2, 4, 16}
NVIDIA Grace-Hopper (72 cores over one socket)	{−9, −8, 8, 9} or {−72, 72}
NVIDIA Grace-Hopper (288 cores over four sockets)	{−9, −8, −4, 4, 8, 9} or {−72, −4, 4, 72}

In case of multiple sets of factors (Table 3), every datapoint is the minimum from the multiple sets, but still the arithmetic average of ten.

Figure 9 shows the the performance on the AMD EPYC processor.

Figure 9.

Allreduce on 1 node with 128 MPI tasks per node (top) and 127 MPI tasks per node (bottom), AMD EPYC.

For 128 MPI tasks and 127 MPI tasks, the recursive exchange algorithm and the cyclic shift algorithm are applied, respectively. For 128 MPI tasks on one node, our implementation ext_mpi outperforms both MPICH and Open MPI because the chosen factors map directly to the amount of available cores. In contrast, the cyclic shift algorithm does not map to the node well, thus its performance is lower. On 127 MPI tasks, the cyclic shift algorithm is preferred, since recursive exchange is only possible with a factorisation of {127}, as it is a prime number. The performance of cyclic shift is still comparable with the MPICH and Open MPI library.

NVIDIA Grace CPU benchmarks are done on a single CPU (Figure 10), a single node (Figures 11 and 12), and on multiple nodes (Figure 13).

Figure 10.

Allreduce on 1 NVIDIA Grace CPU 72 tasks (top) and 71 tasks (bottom).

Figure 11.

Allreduce on 1 node with 4 NVIDIA Grace CPUs 288 tasks (top) and 283 tasks (bottom).

Figure 12.

Allreduce on 1 node with 4 NVIDIA Grace CPUs, in-place operation, 288 tasks.

Figure 13.

Allreduce on 4 nodes, top: with 128 MPI tasks per node, AMD EPYC, bottom: with 288 MPI tasks per node, NVIDIA Grace.

All benchmarks except the ones shown in Figure 11 are out-of-place cases. The one CPU benchmark was performed with all 72 cores and also with one core less, where each core got one MPI task assigned to it. The latter case has a prime number of tasks, making it only possible with a factor of the prime itself to be executed with the recursive exchange algorithm. In contrast to the AMD EPYC processor, there is no penalty for the cyclic shift algorithm on a single socket. On the whole node (Figures 11 and 12), the recursive exchange algorithm shows better performance than the cyclic shift algorithm. The differences between the out-of-place execution and the in-place execution are marginal (Figures 11 and 12).

Figure 13 shows the performance comparison for communication for four nodes on both architectures, with 128 tasks per node on the AMD EPYC processor and with 288 MPI tasks per node on the NVIDIA Grace processor. For both tests in Figure 13, the network used is Slingshot with libfabric 1.15.2.0. Additionally, we use the vendor communication library, HPE MPI (based on MPICH), instead of Open MPI and MPICH from earlier tests. The factorisation of the intra-node part of allreduce is identical to those of the one node benchmarks, however the inter-node part uses parameters for latency optimal communication: allreduce based on allgather with a reduction following.

On AMD EPYC and NVIDIA Grace, two and four sockets were considered, respectively. For each socket, the inter-node communication algorithm is set up with the factor of four. The blocking version of HPE MPI allreduce (a persistent one is not present in this version) underperforms relative to our implementation on both architectures. An exception is very small messages, where our implementation is equally fast. In principle, we achieve even higher performance for long messages for bandwidth optimal intra-node communication, using a complete reduce_scatter followed by allgather, but the difference is marginal (not shown). Please note that on NVIDIA Grace, the vendor’s MPI was used with CMA, not XPMEM, due to current limitations.

Figure 14 shows the cost for the allreduce initialisation on the Grace-Hopper system.

Figure 14.

Allreduce initialisation on 1 socket (72 tasks) and 1 node (288 tasks), NVIDIA Grace.

For short messages on one socket, our implementation’s initialisation phase is four orders of magnitude slower than the execution of the allreduce. For long messages, initialisation and execution consume approximately the same amount of time. However, the initialisation for a whole node is two orders of magnitude more expensive than for a socket. This is due to the fact that the algorithmic complexity in our initialisation routine is disadvantageous. The MPICH and Open MPI (not shown) libraries have a significantly smaller initialisation time.

Figures 15 –17 show benchmarks for reduce and broadcast on the different architectures.

Figure 15.

Reduce (top) and bcast (bottom) on 1 node with 2 AMD EPYC CPUs, 128 tasks.

Figure 16.

Reduce (top) and bcast (bottom) on 1 node with 4 Grace CPUs, 288 tasks.

Figure 17.

Reduce (top) and bcast (bottom) on 1 socket with 1 Grace CPU, 72 tasks.

The reduce operation is only slightly faster than the references. This is due to the fact that the actual reduction is performed in a memory buffer, and a copy out to the root is executed at the end. The advantageous reduction in the output buffer, such as done for allreduce, is not possible since the MPI standard does not allow for the use of the buffers passed from the non-root tasks. A possible MPI_Info argument allowing for it would provide further speedups. The broadcast shows good performance compared to the references on the AMD EPYC processor. On the NVIDIA Grace processor our implementation shows an advantage for medium-size and long messages.

Figure 18 (top) shows the timings on the GPU part of a single Grace-Hopper node using one MPI task for each of the four GPUs.

Figure 18.

Allreduce 1 node, 4 Hopper GPUs, 4 tasks (top) and 16 tasks (bottom).

In this case, our implementation outperforms HPE MPI. NCCL achieves better performance for short messages, but we match its performance when the NCCL benchmark’s ‘blocking’ flag is enabled. This behaviour is not surprising, in our implementation a single CUDA kernel per task (Listing 3) is called for the allreduce. This simple solution cannot be beaten, and is likely identical to the NCCL implementation. For multiple MPI tasks per GPU (Figure 18, bottom), specifically four tasks per GPU on a four GPU node, our implementation outperforms HPE MPI. Figure 19 (top) shows benchmarks for GPU allreduce on four Grace-Hopper nodes using one MPI task for each of the four GPUs per node.

Figure 19.

Allreduce 4 nodes, 4 Hopper GPUs per node, 4 tasks per node (top) and 16 tasks per node (bottom); high uncertainty in the timings of NCCL for message sizes from 10³ bytes to 10⁷ bytes.

For short messages HPE MPI and NCCL are faster than our implementation. We consider this to not be relevant, because to achieve maximum performance for short messages one would copy the GPU data to the CPU, execute the reduction, and copy back. For medium message sizes, our implementation is faster then HPE MPI. With NCCL, the performance comparison is not as clear. The spikes in the NCCL timings are assumed to be caused by a mismatch in the interplay between NCCL library and the underlying libfabric library. Again, for long messages our implementation performs almost identically to NCCL, and HPE MPI is slower. In the case of 16 tasks per node (Figure 19, bottom) on four nodes, we also outperform HPE MPI for both long and medium sized messages.

5.2. Application

We perform benchmarks with the quantum chemistry and solid state physics application CP2K. This application makes use of allreduce and reduce_scatter_block operations with the blocking interface. The benchmarks are “QS/H2O-512” and “QS_mp2_rpa/64-H2O/H2O-64-RI-dRPA-TZ”, where the former is close to many current applications, and the latter is a popular test case for theoretical interest. Allreduce is called with CPU memory and reduce_scatter_block with GPU memory. We neglect allgather, reduce, and bcast since their contribution to the overall time consumption is small. Both cases were run on one NVIDIA Grace-Hopper node with 16 MPI tasks and 16 OpenMP threads per task. The algorithmic parameters of allreduce and reduce_scatter communication are shown in Tables 4 and 5.

Table 4.

Algorithmic parameters of MPI_Allreduce communication in CP2K.

Message	Data access
Size / Bytes	Method	Split	Factors
$\geq 1$	Copy	64 bytes	{-4, −4, 4, 4}
$\geq 512$	Copy	Equal	{-4, −4, 4, 4}
$\geq 2048$	XPMEM	Equal	{-4, −4, 4, 4}

Table 5.

Algorithmic parameters of MPI_Reduce_scatter_block communication in CP2K.

Message	Data access
Size / Bytes	Method	Split	Factors
$\geq 1$	Copy	Equal	{-16}
$\geq 80000$	CUDA IPC	Equal	{-16}

Table 6 shows the number of calls, the average message size, and overall time spent in allreduce and reduce_scatter_block in this application by their mean values.

Table 6.

Average performance of MPI collective communication in CP2K and overall runtime, ±values indicate the corrected sample standard deviation s for 15 samples.

Collective	Calls	Message size (bytes)	Time/Call, original (s)	Time/Call, optimised (s)	Δ Time/Call (s)
QS/H2O-512
allreduce	295000 ± 2221	3649 ± 1.528	2.949 ⋅ 10⁻⁵ ± 2.982 ⋅ 10⁻⁷	1.410 ⋅ 10⁻⁵ ± 2.797 ⋅ 10⁻⁷	1.539 ⋅ 10⁻⁵ ± 5.779 ⋅ 10⁻⁷
allreduce (ext_mpi)	281900 ± 4327	3730 ± 2.324	—	—	—
reduce_scatter_block	46 ± 0	2097200 ± 0	0.09834 ± 0.001937	0.004390 ± 0.001822	0.09395 ± 0.003795
total	—	—	278.4 ± 2.627	271.8 ± 2.350	6.6 ± 4.977
QS_mp2_rpa/64-H2O/H2O-64-RI-dRPA-TZ
allreduce	59000 ± 0	8665 ± 0.001765	2.808 ⋅ 10⁻⁵ ± 8.496 ⋅ 10⁻⁷	1.959 ⋅ 10⁻⁵ ± 4.078 ⋅ 10⁻⁷	0.849 ⋅ 10⁻⁵ ± 1.257 ⋅ 10⁻⁶
allreduce (ext_mpi)	47740 ± 0	3501 ± 0.002619	—	—	—
reduce_scatter_block	166 ± 0	18940000 ± 0	0.2959 ± 0.002437	0.2562 ± 0.002635	0.0397 ± 0.005072
total	—	—	423.2 ± 2.629	416.4 ± 2.463	6.8 ± 5.092

In addition the total time spent in the application is given. All times are given for the original and optimised version of the code. Furthermore, the corrected sample standard deviation s is indicated for all values of the 15 samples. We note that not all MPI datatypes and operators used by CP2K are implemented in our allreduce (allreduce (ext_mpi) in Table 6). Therefore we tabulate the number of calls and message sizes for our implementation (allreduce (ext_mpi)) separately from the overall values including the fallback (allreduce). The times per call were only measured for all allreduce calls. The fallback to MPICH reduces the average speedups. The case QS/H2O-512 is accelerated by 2.5%, where the time spent in allreduce is halved, and the time spent in reduce_scatter_block is 1/20 with the optimisations compared to MPICH. The benchmark case QS_mp2_rpa/64-H2O/H2O-64-RI-dRPA-TZ receives a speedup of 1.5%. This rather low number is due to the moderate speedup of reduce_scatter_block, the time spent in allreduce is again halved. There is potential of improvement at this place by supporting more MPI datatypes especially for reduce_scatter_block.

Real applications will execute far more time steps than the short benchmark runs. Therefore, the relative time spent in the initialisation of the collective communication will be smaller, and moderately higher speedups can be expected. We note, that the timings in Table 6 were measured for a restart run, using the wisdom files aforementioned. However, in this particular case with only 16 MPI tasks per node, the generation of wisdom files contributes only marginally to the timings.

6. Related work

A state-of-the-art summary and uniformation of collective communication algorithms is given by Träff et al. (2023). This work introduces a uniform framework for designing and analysing collective operations, with a particular focus on reductions. The authors derive these algorithms from a uniform, circulant graph communication pattern.

One possible approach to MPI parallelism is to use shared memory segments allocated with shmget (Peng et al., 2023). This has the advantage that send and receive buffers do not need to be shared, but an additional copy operation is required to bring the data into a shared memory segment. Other approaches include providing shared memory buffers from the application (Li et al., 2014), or using Cross Memory Attach (CMA) (Vienne, 2014) to access the memory of other processes directly. A prototype of shared memory collectives has been implemented by Li et al. (2014). The established MPI libraries MPICH, MVAPICH and Open MPI follow this implementation for intra-node communication based on shared memory utilising XPMEM.

Collective communication algorithms with a radix higher than two have been utilised recently (Fan et al., 2024; Wilkins et al., 2023), demonstrating their potential to reduce the number of communication steps and improve scalability in HPC systems. Earlier works have discussed algorithms with non-constant radix (Jocksch et al., 2021; Rüfenacht et al., 2017); that is, where the radix may vary dynamically based on message size, network topology, or hardware configuration.

Special cases of our generalised recursive exchange are the algorithms of Rüfenacht et al. (2017) for short messages and Thakur et al. (2005) for long messages. The former introduces a method for recursive multiplying that achieves a reduced execution time when compared to recursive doubling. The latter reference discusses optimisation techniques tailored specifically to long messages and situations with a number of processes not equal to a power of two.

We optimise allreduce with high data locality in a similar manner to Loch and Koslovski (2021). The referenced work introduces the Stripe Parallel Binomial Trees (Sparbit) algorithm. Sparbit achieves reduced latency and improved bandwidth utilisation by maintaining a more localised communication pattern.

An implementation of intra-node allreduce on GPUs has been implemented by Faraji and Afsahi (2014). The authors made use of CUDA inter-process communication for efficient data exchange. They propose two design approaches: a GPU-aware binomial reduce-broadcast algorithm, and a GPU shared-buffer aware design.

The idea to autotune MPI collectives has been present in the literature. For an automated selection of MPI algorithms, see the contribution of Hunold and Carpen-Amarie (2018) and references therein. In this contribution, machine learning techniques were used in order to find the best algorithmic solution, and the default configurations of Intel MPI and Open MPI were outperformed.

Today’s supercomputer architectures allow for alternative implementations of collective reductions. For short message allreduce performed on the network routers, see De Sensi et al. (2021). Their approach has the advantage of shorter ways from the compute node to the network router and back only, instead of from compute node to compute node and back.

In the context of distributed deep learning, sparse data vectors to be reduced are a common occurrence. Several authors, such as Li and Hoefler (2022), developed allreduce algorithms combined with data compression to exploit the sparsity. In the particular contribution from Li and Hoefler (2022), the performance of established dense allreduce is achieved with favourable speedups, as compared to known algorithms for sparse data. An alternative approach to the significant use case of sparse data, including sparse matrices, is described in Geyko et al. (2024), which introduces algorithms for dynamic, locality-aware data exchange via an MPI extension library. Their results demonstrate strong scalability and high performance.

There are preexisting solutions to allreduce with flexible parameters. A hierarchical application of reduction algorithms has been investigated by Hasanov and Lastovetsky (2017), which can be interpreted as flexibilisation of algorithmic parameters. They obtained relevant speedups for short and medium sized messages, even for two hierarchy levels of known reduction algorithms only.

While in our contribution only the frequently used inner-group communication is discussed, other relevant communication patterns exist. The general case of inter-group allgather communication has been considered by Kang et al. (2019). Data from m MPI tasks is redistributed to n tasks, with the algorithms introduced.

The case of multiple sockets per node is included in this contribution. From a more theoretical point of view, these multiple sockets are multiple parallel network lanes. These multiple lanes also exist in hardware on some computer architectures. An algorithmic development for different collective communication patterns on multiple parallel lanes of the network has been done by Träff and Hunold (2020) and Tran et al. (2024).

There are other communication libraries outside of the MPI standard, such as NCCL. NCCL is a commercial non-MPI communication library for allreduce and related collective patterns on the GPU. This library utilises the ring algorithm, and is therefore particularly suitable for long message sizes.

7. Conclusions

We have introduced an implementation of persistent allreduce that utilises the persistent interface’s design to effectively separate derivation of complex factorisations required for efficient allreduce from the allreduce operation itself. It allows for the application of more complex algorithms than those employed in pure execution without initialisation, such as that found in many implementations of the blocking communication programming interface. This can be highly performant when used in applications which use the same allreduce repeatedly. We also provide a blocking communication programming interface with partial initialisation of the algorithms. At execution time the algorithms are only modified slightly. Our contribution not only demonstrates the benefits of these approaches, partially known from the literature for inter-node communication, but also shows the usefulness of this approach for shared-memory communication. For the algorithm selection at initialisation, it is desirable to know the number of subsequent calls, since the initialisation time may not be negligible and can differ significantly for different algorithms.

Additionally, the recursive exchange and cyclic shift algorithms previously introduced for inter-node communication in the literature were extended to shared memory intra-node communication and mixed intra- and inter-node communication, including support for multi-socket nodes. The tree algorithm has been utilised for the on-node component of inter-node allreduce, and additionally for intra-node broadcast and reduce. The pure intra-node implementation outperforms the recent versions of MPICH and Open MPI up to almost an order of magnitude for medium-size messages and large messages on AMD EPYC and NVIDIA Grace-Hopper processors, for both CPU and GPU. For combined intra-node and inter-node communication, we obtain similar speedups.

The comparison of the GPU timings with the commercial NCCL library shows comparable performance. It should be noted that the inter-node communication algorithms of the libraries do not coincide, at least not in their formal descriptions. The NCCL library uses the ring algorithm while our implementation does not.

The reduce_scatter_block collective is covered by the algorithms for allreduce, and applied together with allreduce to the application CP2K. When our blocking interface is applied to CP2K, it shows an overall speedup for a standard benchmark of 2.5%.

8. Future work and Opportunities

The current heuristic relies on empirical parameter selection. Continuing research efforts seek to replicate or improve upon the results of this parameter selection with a machine-learning-guided heuristic to aid in extensibility and performance portability, complementing the compiled approach of the algorithms presented here.

This work may be meaningfully extended to optimise for sparse data, which is common in many deep learning applications. This may be accomplished, for instance, by utilising data compression, such as that found within the contribution from Li and Hoefler (2022), or with novel locality-aware algorithms, such as those found in Geyko et al. (2024).

Footnotes

Acknowledgements

Benchmarks were performed on the Alps supercomputer of CSCS. The authors would like to thank Jean-Guillaume Piccinali, Augustin Bussy, and Sebastian Keller (CSCS) for helpful discussions. This contribution is a throughout expanded version of the autors article “Flexible algorithms for persistent MPI allreduce communication” first published in Lecture Notes in Computer Science (LNCS). Vol. 15579, 273–286, 2024 by Springer Nature.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was performed with partial support from the National Science Foundation under Grants Nos. 2405142 and 2412182. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. Partial support from the Tennessee Technological University’s Department of Computer Science, CEROC, and the College of Engineering is acknowledged.

ORCID iDs

Andreas Jocksch

C. Nicole Avans

Riley Shipley

Anthony Skjellum

Notes

Author biographies

Andreas Jocksch is a senior research software engineer at the Swiss National Supercomputing Centre (CSCS). His past areas of activity include parallel computing and fluid dynamics.

C. Nicole Avans is a research assistant at Tennessee Technological University and an intern at Sandia National Laboratories.

Riley Shipley is a HPC researcher and lecturer at Tennessee Tech whose work includes multiple MPI profiling tools and methods as well as in-depth study on accelerator-centred communication models.

Anthony Skjellum Dr. Anthony (Tony) Skjellum studied at Caltech (BS, MS, PhD). His PhD work emphasised portable, parallel software for large-scale dynamic simulation, with a specific emphasis on message-passing systems, parallel nonlinear and linear solvers, and massive parallelism. From 1990-93, he was a computer scientist at LLNL focusing on performance-portable message passing and portable parallel math libraries. From 1993 to 2003, he was on faculty in Computer Science at Mississippi State University, where his group co-invented the MPICH implementation of the Message Passing Interface (MPI) together with colleagues at Argonne National Laboratory. He has also been active in the MPI Forum since its inception in 1992. From 2003 to 2013, he was professor and chair at the University of Alabama at Birmingham, Dept. of Computer and Information Sciences. In 2014, he joined Auburn University as Lead Cyber Scientist and led R&D in cyber and High-Performance Computing for over 3 years. From August 2017 to March 2023, he was at the University of Tennessee at Chattanooga as Professor of Computer Science, Chair of Excellence, and Director, SimCenter, where he continues work in HPC (emphasising MPI, scalable libraries, and heterogeneous computing). In August, 2023 he joined the Dept. of Computer Science at Tennessee Technological University, and is the prospective director of the soon to be approved ASCEND Center for High Performance Computing, Extreme Networks & Data at TN Tech He is also co-PI of the DOE/NNSA PSAAP III Center ”Center for Understandable, Performant Exascale Communication Systems” led by the University of New Mexico. He is a senior member of ACM and IEEE; and a member of ASEE and AIChE. He has over 17,200 citations on Google scholar, with an H-index of 40, and an I-10 Index of 140.

References

Bouhrour

Pepin

Jaeger

(2022) Towards leveraging collective performance with the support of MPI 4.0 features in MPC. Parallel Computing 109: 102860. DOI: 10.1016/j.parco.2021.102860.

Bruck

(1993) Efficient global combine operations in multi-port message-passing systems. Parallel Processing Letters 3(04): 335–346. DOI: 10.1142/S012962649300037X.

De Sensi

Di Girolamo

Ashkboos

, et al. (2021) Flare: flexible in-network allreduce. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, 1–16. DOI:10.1145/3458817.3476178.

Fan

Petruzza

Gilray

, et al. (2024) Configurable algorithms for all-to-all collectives. In: ISC High Performance 2024. IEEE. DOI:10.23919/ISC.2024.10528936.

Faraji

Afsahi

(2014) GPU-aware intranode MPI_Allreduce. In: Proceedings of the 21st European MPI Users’ Group Meeting. Association for Computing Machinery, 45–50. DOI:10.1145/2642769.2642773.

Fusco

Khalilov

Chrapek

, et al. (2024) Understanding data movement in tightly coupled heterogeneous systems: a case study with the Grace Hopper superchip. arXiv:2408.11556 [cs.DC]. DOI: 10.48550/arXiv.2408.11556.

Gabriel

Fagg

Bosilca

, et al. (2004) Open MPI: goals, concept, and design of a next generation MPI implementation. In: Proceedings, 11th European PVM/MPI Users’ Group Meeting. Budapest, Hungary: Springer, 97–104. DOI: 10.1007/978-3-540-30218-6_19.

Geyko

Collom

Schafer

, et al. (2024) A more scalable sparse dynamic data exchange. In: 2024 IEEE 31st International Conference on High Performance Computing, Data, and Analytics (HiPC). IEEE, 143–154. DOI: 10.1109/HiPC62374.2024.00023.

Gropp

Thakur

Lusk

(1999) Using MPI-2: Advanced Features of the Message Passing Interface. 2nd edition. Cambridge, MA: MIT Press.

10.

Hasanov

Lastovetsky

(2017) Hierarchical redesign of classic MPI reduction algorithms. The Journal of Supercomputing 73: 713–725. DOI: 10.1007/s11227-016-1779-7.

11.

Hashmi

Chakraborty

Bayatpour

, et al. (2018) Designing efficient shared address space reduction collectives for multi-/many-cores. In: 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 1020–1029. DOI: 10.1109/IPDPS.2018.00111.

12.

Hoefler

Mehlan

Mietke

, et al. (2006) Fast barrier synchronization for infiniband/spl trade Proceedings 20th IEEE International Parallel & Distributed Processing Symposium. IEEE, 7. DOI: 10.1109/IPDPS.2006.1639561.

13.

Holmes

Skjellum

Jaeger

, et al. (2021) Partitioned collective communication. In: 2021 Workshop on Exascale MPI (ExaMPI). IEEE, 9–17. DOI: 10.1109/ExaMPI54564.2021.00007.

14.

Hunold

Carpen-Amarie

(2018) Algorithm selection of MPI collectives using machine learning techniques. In: 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). IEEE, 45–50. DOI: 10.1109/PMBS.2018.8641622.

15.

Jocksch

Ohana

Lanti

, et al. (2021) An optimisation of allreduce communication in message-passing systems. Parallel Computing 107: 102812. DOI: 10.1016/j.parco.2021.102812.

16.

Jocksch

Avans

Shipley

, et al. (2024) Flexible algorithms for persistent MPI allreduce communication. In: Wyrzykowski

Dongarra

Deelman

, et al. (eds). Parallel Processing and Applied Mathematics: 11th International Conference, PPAM 2024, Ostrava, Czech Republic, September 8-11, 2024. Revised Selected Papers, Lecture Notes in Computer Science (LNCS). Cham: Springer, Vol. 15579, 273–286. DOI: 10.1007/978-3-031-85697-6_18.

17.

Jocksch

Piccinali

(2023a) An efficient implementation of blocking and persistent MPI collective communication. In: EuroMPI'23 Conference Bristol 2023, Poster. https://eurompi23.github.io/assets/papers/EuroMPI23_paper_30.pdf

18.

Jocksch

Piccinali

(2023b) MPI for multi-core, multi socket, and GPU architectures: optimised shared memory allreduce. In: PASC 2023 Conference Davos 2023, Poster. https://pasc23.pasc-conference.org/presentation/?id=pos139&sess=sess110

19.

Kang

Träff

Al-Bahrani

, et al. (2019) Scalable algorithms for MPI intergroup allgather and Allgatherv. Parallel Computing 85: 220–230. DOI: 10.1016/j.parco.2019.04.015.

20.

Karp

Podobas

Kenter

, et al. (2022) A high-fidelity flow solver for unstructured meshes on field-programmable gate arrays: design, evaluation, and future challenges. In: International Conference on High Performance Computing in Asia-Pacific Region, 125–136. DOI: 10.1145/3492805.3492808.

21.

Kühne

Iannuzzi

Del Ben

, et al. (2020) CP2K: an electronic structure and molecular dynamics software package-quickstep: efficient and accurate electronic structure calculations. The Journal of Chemical Physics 152(19): 194103. DOI: 10.1063/5.0007045.

22.

Hoefler

(2022) Near-optimal sparse allreduce for distributed deep learning. In: Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 135–149. DOI: 10.1145/3503221.3508399.

23.

Hoefler

, et al. (2014) Improved MPI collectives for MPI processes in shared address spaces. Cluster Computing 17(4): 1139–1155. DOI: 10.1007/s10586-014-0361-4.

24.

Loch

Koslovski

(2021) Sparbit: a new logarithmic-cost and data locality-aware MPI allgather algorithm. In: 2021 IEEE 33rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). IEEE, 167–176. DOI: 10.1109/SBAC-PAD53543.2021.00028.

25.

Lubachevsky

(1990) Synchronization barrier and related tools for shared memory parallel programming. International Journal of Parallel Programming 19: 225–250. DOI: 10.1007/BF01407956.

26.

Mohamed El Maarouf

Giraud

Guermouche

, et al. (2023) Combining reduction with synchronization barrier on multi-core processors. Concurrency and Computation: Practice and Experience 35: e7402. DOI: 10.1002/cpe.7402.

27.

Nanjegowda

Hernandez

Chapman

, et al. (2009) Scalability evaluation of barrier algorithms for OpenMP. In: Evolving OpenMP in an Age of Extreme Parallelism: 5th International Workshop on OpenMP, IWOMP 2009 Dresden, Germany, June 3-5, 2009 Proceedings 5. Springer, 42–52. DOI: 10.1007/978-3-642-02303-3_4.

28.

NVIDIA (2025) NVIDIA collective communications library (NCCL). https://developer.nvidia.com/nccl

29.

Peng

Fang

Liu

, et al. (2023) Optimizing MPI collectives on shared memory multi-cores. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 1–15. DOI: 10.1145/3581784.3607074.

30.

Rüfenacht

Bull

Booth

(2017) Generalisation of recursive doubling for allreduce: now with simulation. Parallel Computing 69: 24–44. DOI: 10.1016/j.parco.2017.08.004.

31.

Speziale

Di Biagio

Agosta

(2011) An optimized reduction design to minimize atomic operations in shared memory multiprocessors. In: 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum. IEEE, 1300–1309. DOI: 10.1109/IPDPS.2011.271.

32.

Thakur

Rabenseifner

Gropp

(2005) Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications 19(1): 49–66. DOI: 10.1177/1094342005051521.

33.

Träff

Hunold

(2020) Decomposing MPI collectives for exploiting multi-lane communication. In: 2020 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 270–280. DOI: 10.1109/CLUSTER49012.2020.00037.

34.

Träff

Hunold

Vardas

, et al. (2023) Uniform algorithms for reduce-scatter and (most) other collectives for MPI. In: 2023 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 284–294. DOI: 10.1109/CLUSTER52292.2023.00031.

35.

Tran

Ramesh

Michalowicz

, et al. (2024) Accelerating communication with multi-HCA aware collectives in MPI. Concurrency and Computation: Practice and Experience 36(1): e7879. DOI: 10.1002/cpe.7879.

36.

Vienne

(2014) Benefits of cross memory attach for MPI libraries on HPC clusters. In: Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment, 1–6. DOI: 10.1145/2616498.2616532.

37.

Wilkins

Wang

Liu

, et al. (2023) Generalized collective algorithms for the exascale era. In: 2023 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 60–71. DOI: 10.1109/CLUSTER52292.2023.00013.

A compilation-based approach to performant reduction and redistribution collective communication algorithms

Abstract

Keywords

1. Introduction

2. Newly optimised algorithms

2.1. Recursive exchange

2.1.1. Intra-node communication

2.1.2. Inter-node communication

2.1.3. Parametrisation

2.1.4. In-place versus out-of-place algorithms

2.2. Cyclic shift

2.2.1. Intra-node communication

2.2.2. Parametrisation

2.3. Tree algorithm

2.4. Cyclic copy-in reduction

2.5. Other collective operations

3. Hardware mapping

3.1. Barrier implementation

3.2. Translation of the algorithm

3.3. Mapping on the GPU

4. Persistent versus blocking interface

5. Benchmarks

5.1. Standalone communication

5.2. Application

6. Related work

7. Conclusions

8. Future work and Opportunities

Footnotes

Acknowledgements

Declaration of conflicting interests

Funding

ORCID iDs

Notes

Author biographies

References