Abstract
Various physics simulations today rely on simulating particle interactions, where particles can represent point masses (Molecular Dynamics), rigid bodies (Discrete Element Method) or even massive bodies such as planets. Evaluating and calculating the required particle interactions in a simulation is computationally expensive, hence suitable algorithms and proper optimizations to exploit available parallelism in the target hardware are important to reach good performance. However, it is difficult to maintain flexible implementations while keeping state-of-the-art performance, as most packages are developed individually and have their own hard-coded, fine-tuned implementations. To combine flexibility and optimal performance, we introduce P4IRS, an intermediate representation and compiler for particle simulations which aims at generating high-performing code. We describe P4IRS and its features, provide some usage examples for MD and DEM fields and discuss the benefits we can obtain with code generation. Finally, we evaluate the performance and scalability from the code generated by P4IRS on modern processors, accelerators and supercomputers.
Introduction
The computation of particle-pair interactions is widely used in physics simulations, especially in the fields of Molecular Dynamics and Discrete Element Method. To efficiently compute the inter-particle contributions, different strategies can be used to increase efficiency and make use of the parallelism offered by the target hardware. Most simulation packages, however, have their own hard-coded implementations, which are strictly tightened to their data structures and programming artifacts. Also, several kernel implementations for the same potential or force field are present in a single package to account for different hardwares and optimizations strategies, leading to higher maintainability costs and higher portability efforts.
To tackle these limitations, we developed P4IRS — Parallel and Performance-Portable Particles Intermediate Representation and Simulator, an intermediate representation and compiler tool where particle interactions can be simply expressed as Python methods, and efficient and parallel code for the interaction description can be generated for multi-CPU and multi-GPU targets. P4IRS provide its own intermediate representation, which allows it to perform compiler analysis and transformations, to then generate code for the proper backend such as C++ and CUDA with OpenMP/MPI.
P4IRS can generate several kernel flavours of the same potential such as Linked Cells (Quentrec B and Brot C, 1973) and Verlet Lists (Verlet L, 1967), where data layouts for properties and specific arrays (such as the neighbor-lists) can be switched to the case with the best performance. A runtime interface in C++ is also available and can be used to integrate P4IRS with other simulation tools, which is particularly important for integrating P4IRS simulations with our multi-physics framework waLBerLa (Godenschwager C et al., 2013; Schornbaum F and Rüde U, 2016).
Most of the P4IRS internal routines such as building Linked Cells and Verlet Lists are implemented in Python by constructing the intermediate representation for them with the assistance of operator overloading (see Backend and Code Generation section). In this way, we leverage P4IRS analysis and transformations to achieve performance portability for these routines as well, without requiring multiple tailored implementations.
The goals of this paper are summarized as follows: • We present our code generator tool P4IRS, describing how it works and how it can be used to generate efficient particle simulation codes. • We discuss the advantages of using P4IRS to achieve performance portability, also comparing it with other previous approaches such as tinyMD (Machado RRL et al., 2021) and MESA-PD (Eibl and Rüde, 2018; Eibl and Rude, 2019a). • We describe P4IRS interfaces to allow its generated code to be integrated with other simulation tools, specifically in the context of domain partitioning and communication. • We provide performance results for Intel Ice Lake and AMD Milan CPUs, as well as for NVIDIA A40 and A100 accelerators to demonstrate the state-of-the-art performance of the applications generated by P4IRS in comparison to MD-Bench and MESA-PD. • We display weak-scaling results in the Fritz and JUWELS Booster clusters to demonstrate the weak-scaling capabilities of P4IRS in MD and DEM simulations on modern supercomputers.
This paper is structured as follows: In the Related Work section we list related work for MD and DEM simulation packages, code generation tools for particle simulations and performance evaluation of such simulations. In the Background and Theory section we explain the basic theory for particle simulations focusing on MD and DEM methods, also covering optimization strategies and choices. In the P4IRS section we present P4IRS with its current features and application design. In the Evaluation section we evaluate the performance and scalability of P4IRS applications using Molecular dynamics (MD) and discrete element method (DEM) experiments. Finally, the Conclusion and Outlook section presents our conclusions and outlook for P4IRS, also covering our future work.
Related work
Simulation packages
Diverse simulation fields can be described by updating particle trajectories based on their interactions with other particles. In this paper, we focus mainly on MD and DEM, where particles represent either point masses (i.e. atoms) or rigid bodies with specific shapes (i.e. spheres, half-spaces, ellipses), respectively. Most available packages use similar techniques to compute trajectories, but each has its own specific algorithms, parallelism strategies and hand-tuned implementations for specific hardware. With P4IRS, we focus on using a single description of the problem, and then leverage its implemented routines and code transformations to produce the backend implementation for the specific target, which currently can be C++ with OpenMP and CUDA, also supporting MPI for distributed-memory parallel clusters.
In MD, there exist many packages with distinct strategies and simulation fields. LAMMPS (Plimpton S, 1995; Brown W et al., 2012) is a molecular package for the material-modeling field which uses the standard Verlet Lists algorithm to optimize the non-bonded short-range force calculations. It offers many inter-atomic potentials, each one with different kernel implementations for a specific target such as CPU, GPU or Kokkos (Carter Edwards H et al., 2014). GROMACS (Páll S and Hess B, 2013; Páll S et al., 2015) is another package for MD with a focus on life sciences simulations, which uses its own Cluster Pair algorithm to optimize the computation of short-range forces. It does not support as many potentials as LAMMPS, but its algorithm tends to be significantly faster than the Verlet Lists since it employs SIMD parallelism more efficiently. Furthermore, the optimal GROMACS kernel implementations rely on SIMD intrinsics to attain higher performance, which makes them harder to be maintained and ported to other architectures. Also, the GPU algorithm in GROMACS differs from the CPU, since it relies on a strategy named “super-clustering” to reduce the amount of redundant computations caused by adjusting the cluster sizes to fit the SIMD width in the GPU. Other MD packages such as HOOMD-blue (Anderson JA et al. 2020), NAMD (Phillips JC et al., 2005), AMBER (Case DA et al., 2023) and CHARMM (Brooks BR et al., 2009) also contain dedicated and hand-tuned implementations in general purpose languages for each supported backend. Each version is tested, maintained, debugged and optimized independently.
Packages that support DEM simulations are in general more flexible since simulation with more particle properties such as rotations and shapes are needed. However, these do not have usually the same optimization techniques as the MD simulation packages, and the reason is most likely that it is not so feasible to support flexibility with sophisticated optimization algorithms as the complexity of the code grows considerably. Examples of such packages are the LIGGGHTS DEM (Kloss C et al., 2012) (based on LAMMPS), MercuryDPM (Weinhart T et al., 2020) and YADE (Smilauer et al., 2021). To mitigate such limitations, a few code generation approaches were developed for both MD and DEM.
Code generation frameworks
Solutions based on code generation are also available for specific fields of particle simulations, and these aim at providing a higher level of flexibility and extensibility while striving for efficient performance in modern accelerators.
OpenMM (Eastman P et al., 2017) is a MD package that makes use of Python as a front-end for simulation protocols and file I/O, it contains a C++ API for forces and integrators as well as optimized kernel implementations for CUDA, C++ and OpenCL. Integrators and forces can be implemented by using mathematical descriptions in OpenMM, but its use cases are restricted to MD simulations and several features required for a DEM simulation or other types of particle simulations (such as contact history and different shapes) are not available.
MESA-PD (Eibl and Ulrich, 2019) is a general particle dynamics framework that is built-in the waLBerla framework for the simulation of particles and fluids. Its design principle of separation of code and data allows to introduce arbitrary interaction models with ease, which makes it also useful for MD. By using its code generation approach through Jinja templates 1 , it can be adapted very effectively to any simulation scenario. As successor of the “physics engine” (pe) rigid particle dynamics framework it inherits its scalability and advanced load balancing functionalities. One of the limitations of MESA-PD is that it does not support GPU code generation, and since its code generation approach is based on Jinja templates, the effort for implementing new functionalities may not be lower compared to other simulation packages without code generation techniques (for instance, most of its force kernels are not generated but rather implemented in C++).
In previous work we developed tinyMD, a proxy-app (also based on miniMD) created to evaluate the portability of MD and DEM applications with the AnyDSL (Leißa R et al., 2018, 2015) framework. TinyMD uses higher-order functions to abstract device iteration loops, data layouts and communication strategies, providing a domain-specific library to implement pair-wise interaction kernels that execute efficiently on multi-CPU and multi-GPU targets. One of the limitations of tinyMD is that its flexibility and optimization settings are restricted to what can be achieved through higher-order functions and the AnyDSL compilation framework, which restricts the backend to the LLVM toolchain (since it generates LLVM IR at the end). This means that the compiler optimizations from GCC and ICC cannot be applied to AnyDSL implementations (in contrast to C/C++ implementations), and one of the reasons we develop P4IRS is to overcome this restriction.
Performance evaluation
Most performance-oriented research for MD systems uses implemented packages as a black-box and adjust different simulation parameters to analyse their implications in the overall performance. In some cases, however, in-depth performance studies were made to gain a broader understanding of efficient ways to achieve SIMD vectorization with SSE/AVX instructions (Pennycook J et al., 2013). Most in-depth studies are performed using mini-apps or proxy-apps, one example is miniMD, a proxy-app that provides a simple copper face-cubic-centered lattice simulation where atom forces are computed with kernels extracted from LAMMPS for either the Lennard-Jones (LJ) or embedded atom method (EAM) potentials.
The Exascale Computing Project (ECP) Co-design Center for Particle Applications (CoPA) (Mniszewski SM et al., 2021) focuses on co-designing particle applications to prepare them for exascale computing. They provide several proxy-apps for particle simulations such as CabanaMD and ExaMiniMD, which are designed in a modular fashion to account for different optimization strategies and particle data layouts, and are built on top of Kokkos to be able to execute on many-core CPU and GPU targets.
The performance-focused prototyping harness MD-Bench (Ravedutti Lucio Machado R et al., 2023a, b) is a toolbox developed for performance-engineering MD non-bonded short-range force calculation algorithms, with focus on in-core execution on CPUs and GPUs. It currently supports the Verlet Lists and Cluster Pair algorithms from LAMMPS and GROMACS, respectively. One of our goals is to use the knowledge obtained with MD-Bench and integrate it into P4IRS to achieve state-of-the-art performance and flexibility altogether in particle simulations.
Based on our searches, there are not many in-depth performance studies focusing on SIMD and in-core execution for kernels used in DEM simulations, and most existing studies focus on evaluating their distributed-memory parallelism in weak-scaling settings. With P4IRS, we intend to bridge the gap between performance strategies across different simulation domains, and use the most efficient algorithm whenever it is possible without requiring significant efforts such as reimplementing kernels and redesigning applications with tailored data structures to improve the performance.
To evaluate performance on a single CPU core or GPU in this paper, we use MD-Bench and miniMD as the baseline performance, as they are well-known implementations that reach state-of-the-art performance for MD simulations. For DEM cases, we use MESA-PD as the comparison target, since our focus is to show that we can achieve performance portability with the flexibility that MESA-PD provides.
Background and theory
Particle simulations consist of solving N-body problems and have a wide range of applications as many fields can be represented as a system of interacting particles. In this sense, particles can be either atoms, rigid bodies or even planets, stars and galaxies. Interaction contributions can come either from inter-atomic van-der-Walls forces based on the LJ potential, or even from gravitational forces generated by super-massive particles. In order to simulate such systems, however, we need well-defined and closed system of particles with known boundary conditions. In the scope of this paper, particle contributions are computed for particles that lie within a range or “cutoff radius”. Strategies to compute approximations of particles beyond this range (also known as long-range contributions) also exist, but are not disclosed. To provide distinct examples of simulations and to demonstrate a certain level of flexibility from P4IRS, we focus on MD and DEM simulations in this work.
Molecular dynamics
MD simulations are widely used today to study the behavior of microscopic structures. These simulations reproduce the interactions between atoms in these structures on a macroscopic scale while allowing us to observe the evolution of the system of particles on a time-scale that is simpler to analyze.
Different areas such as materials science to study the evolution of the system for specific materials, chemistry to analyze the evolution of chemical processes, or biology to reproduce the behavior of certain proteins and bio-molecules resort to simulations of MD systems.
A MD system to be simulated consists of a number of atoms, the initial state (such as the atoms’ position or velocities), and the boundary conditions of the system. Here, we use periodic boundary conditions (PBC) in all directions, hence when one atom crosses the domain, it reappears on the opposite side with the same velocity. Fundamentally, atom trajectories in classical MD systems are computed by integrating the Newton-Euler equations for every atom i:
Discrete element method
Besides computing trajectories for zero-dimensional point masses as done in MD, we can model particles as objects with specific shapes and spatial extension using the Discrete Element Method (DEM). This requires support of more degrees of freedom for particles, since spatial properties (such as rotation) need to be taken into consideration. In this work, we consider particles with homogeneous density, and then use the center of mass for each particle as its position, which coincides with its geometrical center due to its density properties. Equations of motions are extended with the following terms for the torque and angular velocity:
With the torque vector
In this work, we use the penalty-based Linear Spring-Dashpot model to calculate forces between particles for the DEM cases. The collision force is split into two components, one representing the force in the normal direction and one representing the force in the tangential direction.
The relative contact velocity
Then, the force components in the normal direction
Optimization strategies
To calculate the resulting force for a particle in a specific time-step, it is necessary to compute all partial force contributions from each of the other particles present in the system. In a naive strategy, this results in a quadratic complexity and severely affects the application performance, which limits the system size and the time-scale of the simulation.
When force contributions become negligible at long-distance interactions or when a contact between particles is needed, different algorithmic strategies can be employed to perform computations only for particles within a specific cut-off region. There are two main data structures which are commonly used for efficiently computing particle interactions in state-of-the-art simulation packages: the Linked Cells and the Verlet Lists.
Linked Cells strategy (see Figure 1(a)) sorts particles spatially according to their position in cells (also called bins), where cells have a fixed size, and mapping particles into cells can be done directly by computing in which cell a particle is contained within. After, forces can be computed for each particle by traversing its current and neighbor cells, and only adding contributions for particles within these cells. Comparison between (a) Linked Cells and (b) Verlet Lists algorithms, in each case, the blue particles are used as the candidates to compute the forces with the red particle. In the Linked Cells, all particles in the neighbor cells are considered (considering the size of the cell is higher or equal than the cutoff radius). In the Verlet Lists, a list of particles is created by evaluating which particle distances are smaller than the cutoff radius plus a “skin” value, which represents a sphere in the 3D case.
The Linked Cells results in computing only forces for particles within a region that is discretized by their current and neighbor cells/bins, which can still results in a significant amount of redundant computations. In order to avoid even more redundant computations, the Verlet Lists strategy can be used jointly with Linked Cells. The Verlet Lists (see Figure 1(b)) constructs a list of neighbor particles for every particle, and this list can be created by using the Linked Cells structure, evaluating the distances between particles within close bins and then using the distance to determine whether such particles should be added into the list. In the Verlet Lists case, the region for searching the neighbor particles is defined by a sphere, where the center of the sphere is the position of the particle and the radius is the cutoff radius plus a “skin”. The “skin” is used to avoid building the neighbor-lists at every time-step, it extends the neighbor-search and avoid missing computations from particles that end up entering the cut-off region in subsequent time-steps in which the neighbor-lists are not reconstructed.
Despite avoiding computations from long-distance particles, another common optimization is to take advantage of Newton’s Third Law and compute forces in only one direction. This means that only half of particle-pairs are traversed when computing the forces, and partial forces are decreased for the neighbor particles during their computation. Ideally, this should lead to a factor of two speedup, but this strategy has a few drawbacks: (a) parallelism is harmed since decreasing the forces of neighbor particles requires atomic operations to avoid race conditions and (b) when the code is vectorized with SIMD intrinsics (such as AVX2 and AVX512), gather and scatter instructions are needed to build up the SIMD registers with forces of the neighbor particles, which results in scattered memory accesses and extra instructions. At the end, this strategy is likely to be beneficial for more arithmetic-intensive kernels, and the trade-off between parallelism and less arithmetic operations must be assessed to know when it is advantageous.
P4IRS
P4IRS 2 is an open-source, stand-alone compiler and domain-specific language for particle simulations which aims at generating optimized code for different target hardwares. It is released as a Python package and allows users to define kernels, integrators and other particle routines in a high-level and straightforward fashion without the need to implement any backend code.
Figure 2 displays an overview of P4IRS. On the left side, the front-end library from P4IRS is displayed with the available features, all these are defined through a Python interface by the user for modeling the simulation and determining the optimization schedule. In the middle, the backend with the internal methods (deep-embedded in Python) with the compiler transformations and analyses are shown, these make use of the front-end representation to produce the final backend code. On the right side, the final generated code is integrated and compiled together with the runtime library and functionalities to achieve parallelism and access to external functions. The next subsections explore the different P4IRS features in more detail. Overview of P4IRS architecture. The Frontend is defined in high-level Python language, and describes the computation and optimization strategies for the simulation. The compiler uses the data defined in the Frontend, combines with its internal kernel implementations and perform several analyses and transformations. Finally, it generates the backend code for the target hardware, which currently is either C++ or CUDA code with OpenMP and MPI support.
Frontend and features
Listing 1 shows a simple description in P4IRS to compute forces with the LJ potential. Functions can be provided through a method (line 1), and the parameters in the method refer to different particles in the system. In this way, many-body potentials can also be described through methods with more than two parameters. Some keywords are also provided by P4IRS for writing the kernels, in line 2 the squared_distance method is used for obtaining the squared distance for the atoms i and j, and in line 4 the delta method is used for obtaining the delta vector. Also, in lines 3 and 4 it is possible to see accesses to feature properties called sigma6 and epsilon, which are defined during the simulation setup as properties which depend on the atom types feature. In line 4, the apply method is used for applying the computed term into the force property of the particles, the method allows the identification of particle property reductions and is also used when generating the kernel variant using the half neighbor-lists.
Listing 1: Lennard-Jones force description in P4IRS.
Functions that update every particle separately can also be described with a single parameter. Listing 2 displays how to implement a simple Euler integrator for MD (see line 1). In lines 2-3, there are also property accesses for the particles’ linear velocities, forces, masses and positions. Note that vector operations are intrinsically supported in P4IRS, which is also the case for matrices and quaternions data structures. P4IRS methods also allow the specification of parameters such as the dt specified in line 3, these must be given when setting up the simulation.
Listing 2: Euler integrator description in P4IRS.
Listing 3 shows the simulation setup for the MD case. In the first part (lines 6-10), a simulation instance is created, and its identifier, the list of particle shapes used, the number of time-steps and the floating-point precision are specified. Further, the properties needed for each particle in the simulation are specified (lines 13-16), together with their data types and (if needed) initial values. Some properties are also defined as volatile (in this case the force at line 16) property, which means that they are reset after each time-step. To add feature properties, it is first required to add the specified feature as shown in line 19 (in this example, each atom has a feature named type), and then specify the feature name for these properties (see lines 20-21). Next, the simulation domain is specified (line 24) and the initial state of the system is defined by reading data from a file (lines 27-30), a list of properties must be given in the same order as they appear in the file, and the particle shapes must also be included. Different entries can be used to read particles with different shapes in the same simulation.
Listing 3: Simple example for MD simulation setup in P4IRS.
After the properties and simulation declarations, the optimization settings (or schedule) are specified (see lines 33-35). These can be, for instance, whether the Verlet Lists strategy is used, the particle reneighboring frequency, and whether half neighbor-lists must be used. Note that, ideally, optimization strategies should not change the semantics of the generated program nor the simulation results, but since particle simulations are sensitive to the order of computation performed and the floating-point precision, it is possible that simulations with different optimization strategies lead to different results. This is not an issue for particle simulations as long as it is still possible to validate the simulations through macroscopic properties such as the temperature or pressure (for MD simulations) or through analysis of the physical dynamics in the system (for DEM simulations).
The vtk_output directive (line 36) is used to write data to VTK files at every 20 time-steps. Finally, the kernels to be computed must be specified using the compute method (lines 39-40). The cutoff radius must be specified for kernels with more than one parameter/particle and other parameters used in such kernels (such as the dt in the Euler kernel) must also be provided as a Python dictionary. The final steps define the target and call the generate method to then trigger the code generator. P4IRS provides targets with default settings via the target_cpu and target_gpu settings, but custom targets can be defined with their own options such as the thread and block configurations.
Contact history and contact properties
When implementing DEM kernels such as the Linear Spring-Dashpot, properties computed during two contacted particles must be used across subsequent time-step interactions whenever they happen. For this reason, P4IRS supports contact properties, which are properties tightened to a contact in the simulation. These can be simply added through the add_contact_property method as can be seen in Listing 4 for the sticking, tangential spring displacement and the impact velocity magnitude:
Listing 4: Contact properties setup example.
These properties can be used directly in kernels in the same way as feature properties. Listing 5 shows an example of their usage in the Linear Spring-Dashpot kernel. Note that when such properties are not stored for the particle pair, their default values are used.
Listing 5: Contact properties usage example.
To provide fast and efficient access to contact properties when using neighbor-lists, P4IRS keeps the particles in the same indexes in both data structures. This prevents a lookup operation within the kernel for finding the contact index, and also keeps data access contiguous for the contact history data structure. When Linked Cells are used without a Verlet Lists, this optimization is not possible, which should significantly increase the cost for accessing contact properties in a kernel.
Intermediate representation
This and the following sections focus on describing the architecture from the P4IRS compiler. Figure 3 depicts a general overview of the P4IRS architecture from the time the user input is given to the time the final binary is generated. P4IRS architecture. The user specifies the particle computations to be performed, which are compiled to the Python AST and further mapped to the P4IRS IR. Optimization choices and simulation setup methods are used to determine which internal kernels are utilized, and the P4IRS IR is constructed in-place for the internal particle optimization kernels based on the inputs. Many compilation passes are performed to transform the P4IRS IR before triggering the code generator, which generates the optimized kernels, data structures and the PairsSimulation interface class. If whole-program generation is not used, a backend simulation code is required, which uses the PairsSimulation interface to call P4IRS generated kernels. The final Binary is then generated with any backend compiler of the user’s choice (i.e., GCC, CLANG, …), and it must be linked with the runtime library. The runtime library uses the P4IRS Metadata structure to handle different particle data structures and to adapt to different simulations.
An important aspect of a domain-specific language (DSL) is to be able to use domain knowledge for its compiler transformations, thus enabling code optimizations that may not be detected by general purpose compilers. In order to achieve this, P4IRS provides its own intermediate representation (IR) which can represent elements that are specific to particle computations. Note that in this paper we use the term IR to denominate the internal data structure utilized to represent the parsed code. Our IR is not formally defined, as is usually the case for most IRs from general-purpose compilers, and it is not our intention here to provide a formal and rigorous definition for it. Instead, we present an informal and rather simple abstract syntax tree (AST) representation that provide elements associated with particle computations and data structures. These particle-specific elements in the AST can then be used in P4IRS compilation passes, and at some point during the compilation are lowered, resulting in the low-level AST that corresponds to the code implementation that matches the strategies given by the user.
Listing 6: Neighbor-lists intermediate representation construction in P4IRS. Elements that are explicitly built for the AST (both high and low level) are shown in purple, and implicit elements such as expressions are constructed via operator-overloading and bound to a statement at some point in the construction.
The P4IRS IR can be constructed in two ways: (a) by mapping the Python IR generated with the compile() method on Python functions, and (b) by explicitly constructing the IR elements in-place. The former is used for custom particle interactions (such as the lennard_jones method from Listing 1) and the latter is normally used internally by P4IRS to construct the IR for internal routines. Listing 6 displays an example for constructing the intermediate representation that generates the code that builds the neighbor-lists for each particle in the system. The BuildNeighborLists class is a separate, high-level element from the AST in P4IRS. During the compilation process, P4IRS performs analyses and transformations to lower this element by using its lower method, and decorators such as the @pairs_device_block are used to determine the target hardware where the code must be executed.
Figure 4 depicts a schematic diagram of the constructed P4IRS IR for the BuildNeighborLists example. Note the domain-specific elements in the IR, such as ParticleInteraction and ParticleFor; as discussed earlier, these can be used during compilation passes to gain domain knowledge, and at some point they are lowered according to the user-specified strategies. For instance, the ParticleInteraction abstract element can be lowered to a code representation that traverses pairs of particles using either the Linked Cells or Verlet Lists data structures. In this case, the ParticleInteraction is lowered by using the linked-cells data structure to build the neighbor-lists as is done in traditional implementations. The interaction object also contains methods to access the current particle (i), the neighbor particle (j) and the shape for the neighbor particle (shape). The latter is particularly important for the shape-partitioning and multi-kernel code generation discussed in the Shape Partitioning and Multi-Kernel section. P4IRS IR for the BuildNeighborLists example. Red boxes represent elements that are specific to particle simulations, and are used for domain-specific transformations in the AST, at some point these elements are lowered to generic, primitive elements that should be straightforward to map to code in a general-purpose language or even an IR such as the LLVM IR. Green boxes represent some of these lower level, generic elements. Blue boxes represent symbols from the simulation, and the dashed red lines show when these symbols come from a particle-specific element. Yellow boxes represent constants, and the two “s” elements represent patterns that are repeated in the constructed AST according to the referenced loop in their respective orange boxes.
Other primitive elements such as assignments (Assign) and array accesses, as well as subexpressions, can be observed in Listing 6. These are simple AST nodes which are not lowered during the compiler analyses and transformations, but can be in some cases optimized away. Note that Python artifacts can help construct the IR such as the iteration loop at line 10, as well as the summation at line 18. These are not directly built into the P4IRS AST, but rather executed during the lowering process to construct the list of assignments for all shapes and the summation of all array-access elements constructed in-place, respectively.
Backend and code generation
The final step of the P4IRS compiler is the code generation, which converts the optimized IR into low-level hardware assembly instructions or to another high-level language (source-to-source compilation). In its current state, P4IRS generates C++ and CUDA code with MPI for distributed memory parallelism, but new backend languages, intermediate representations and technologies can be included without difficulties by extending the code generator. Although most of the code generation is straightforward for the IR at its lowest level, there are important aspects that need to be considered to handle different backends and achieve performance portability.
P4IRS creates “modules” for one or more routines when building its IR, and every module determines a point at which data must be synchronized for the target on which it will be executed. For example, the code generated from the constructed IR in the lower method from BuildNeighborLists is preceded by the appropriate copy calls into the GPU for all the arrays and properties referenced within the IR subtree. To avoid complex control-flow analysis to determine which properties and arrays actually have to be copied (since some of them can already be synchronized with the device), we provide bitmasks to represent each array and property defined in the simulation, and provide methods to check every array and property used within the intermediate representation. Only when the bit is not set in the device, the copy method for the respective array or property is called. Also, when properties or arrays are changed within the subtree, specific code is generated to clean up the respective bits.
In addition, the check_resize (see line 7 at Listing 6) method is used to specify which array sizes can grow during the module execution, and which variable or array is used as their current capacity. In this case, the neighbor_capacity can be increased based on the numneighs array, so when one particle contains more neighbors than neighbor_capacity, a proper code is generated to reallocate the arrays that contain the referred capacity as one of its dimension sizes. The compiler handles the reallocation cases when the target is an accelerator such as a GPU, which needs the kernel to be re-executed after arrays are reallocated since it is not possible to reallocate them directly in the accelerator. P4IRS also takes care of avoiding memory access violation in the current kernel execution by wrapping the accesses around branches which assure the accessed indexes are smaller than the current capacity of the arrays.
When the IR is constructed for a specific routine, P4IRS then focuses on determining which loops will be scheduled to the GPU. As a general case, every outer-most loop in a block that uses the @pairs_device_block decorator is a candidate to be replaced by a kernel launch to be run in parallel. In Listing 6, both loops at lines 9 and 13 are candidates and their body instructions are then wrapped around a separate GPU kernel. The loops are then replaced by a kernel launch statement for the new generated kernel. Note that some candidates may not be offloaded to the GPU, the reason is that P4IRS can use internal heuristics to determine whether a loop should or should not be executed in the GPU. Domain-knowledge can also be used in the GPU offloading analyses since elements like ParticleFor and ParticleInteraction can be used during the compiler passes before they are lowered.
Runtime library
Apart from the kernel implementations and functions to prepare the kernel launches and move data between host and accelerator processors, particle simulations also may need to call external libraries and deal with filesystem I/O. A few case examples where this is required are, for instance, to write particle data to a file in a visualization format like VTK, and reading the initial state of the particle system to setup the simulation. Since most of these procedures do not require extensive optimizations as they are not compute-intensive and their bottleneck is an external I/O device (such as a disk), building up the intermediate representation for them and extending the code generator to support such external calls is simply not worthwhile.
To handle this, P4IRS provides a runtime library which contains such functionalities directly implemented in C++, and these make use of and synchronize (when necessary) the data structures from the generated simulation. To access the simulation data, an instance (referred to as PairsRuntime object) contains all the meta-data for the simulation, and can be used in these implemented routines.
Listing 7: P4IRS runtime routine example for writing VTK data into a file.
Listing 7 shows an example for the routine to write data into VTK files, note that the first parameter is a PairsRuntime object, and can be used to check whether properties exist, check if they are synced with the host device and (in case not) copy the data from the host. The meta-data object is fed at the start of the simulation with the code generated by P4IRS, and it is therefore dependent on each simulation. It is important to enhance that accessing simulation data through the PairsRuntime instance is not as efficient as accessing it directly in the generated code since the way of accessing the data structures (such as the property layouts) must be evaluated during runtime. Therefore, the use of such an instance should be avoided for performance-intensive routines.
The runtime library must be linked to the generated binary during execution, so it must be properly built for the target hardware and platform, and either installed on the target system or its path must be included in the list of paths where dynamic libraries are searched for (for example, the LD_LIBRARY_PATH environment variable on Linux systems). Although it is compiled separately from the generated code and should ideally be linked dynamically at runtime, it is possible to compile both together and statically link the runtime library code with the simulation code, thus generating a stand-alone, platform-specific binary that does not require further linking with the runtime code.
Shape partitioning and multi-kernel
When computing forces for DEM simulations that contain different shapes, different calculation formulas are needed for some of the contact elements such as the penetration depth, contact point and distance. From a computational point of view, this introduces branching in the inner-most loop where each path leads to different arithmetic instructions be executed and therefore inhibits SIMD vectorization. Algorithm 1 displays an example of such issue for a kernel with spheres and half-spaces. Note that when traversing the neighbors, their shapes are checked to determine which code path is taken during the interaction, hence the code is not SIMD-friendly.
To provide a simpler way to achieve vectorized code for such DEM kernels, it is possible to partition particles by shape in the acceleration data structures (Linked Cells and Verlet Lists), which allows distinguishing them during the force computation. After, multiple loops can be used in the code for handling different pair of shapes, and vectorizing these resulting loops should be easier since no branches are needed to calculate different contact elements for the particles being computed. Algorithm 2 displays the same kernel as Algorithm 1 without the branching issue. Note that the outer-most loop can also be split into two parts as well, which leads to two different kernels for each pair of shapes. For GPUs, this can be beneficial since different threads will be used for different pair of shapes (parallelism is done at the level of the outer-most loop) and no threads will be disabled due to traversing a different code path.
Note that there is no guarantee that the internal loop will be vectorized for CPU, since there may still be branches and instructions in the force kernel that make it not possible. Besides, the backend compiler must be able to properly analyze and apply the transformations to actually generate the vectorized code to use AVX/SSE instructions, which may not happen even when the code is amenable to vectorization.
P4IRS is able to generate this algorithmic optimization automatically for the shapes involved in the simulation. This is possible by using the domain knowledge provided in the AST, where the ParticleInteraction symbol can be lowered to several loops that handle different pairs of shapes in the simulation. Note that P4IRS does not require complex and sophisticated analyses to identify these opportunities for splitting the interaction loop because its IR already exposes the domain knowledge necessary for applying this strategy. As discussed, this is an important benefit of using a DSL because the IRs from general-purpose compilers need to work at a lower level in order to be able to represent any generic code.
Apart from splitting the inner-most loop, P4IRS makes it possible to split the outer-most loop and generate two distinct kernels, each one computing different types of interactions (in this case, Sphere-Sphere and Sphere-Halfspace variants), we name this strategy here as “Multi-Kernel partition”. This requires data from the left-side particles to be loaded again, which is not necessarily a significant problem since they are contiguous in memory. Besides exposing more SIMD opportunities, this can also lead to better memory accesses from the right-side particle properties when there are not many particles of a specific shape in the simulation (such as the half-space obstacles). The reason is the properties for these obstacles are accessed separately from other particle properties and should be closer in the memory hierarchy in subsequent accesses since there should be no cache pollution from evaluating and computing interactions with other sphere neighbors in the same kernel.
Neighbor-lists per cell
When computing particle interactions with the Linked Cells algorithm, it is necessary to iterate at every surrounding cell and then every particle that belongs to them. This introduces a nested loop, and the SIMD execution becomes restricted to the inner-most loop (i.e. particles within the same neighbor cell). To mitigate this and provide more SIMD opportunities, it is possible to create a neighbor-list for each cell (instead of for each particle as in the Verlet Lists algorithm), and then merge the two loops into a single one that iterates over the neighbor list. In this way, the force for each particle is computed by traversing the neighbor list constructed for the cell it belongs to, without affecting the amount of interactions computed.
Note that this introduces a new kernel to build the neighbor-lists, and this is only beneficial when the force calculation performance improvement compensates the time spent in the new kernel. Besides, in cases where the code is not properly vectorized (especially in CPU where explicit instructions must be generated), the improvement may not be substantial. In GPUs, however, this should lead to better execution since threads are not halted waiting for the surrounding neighbor cells to be computed when the amount of particles between them is different.
Domain partitioning and communication
In order to attain distributed-memory parallelism and software flexibility while attaining good performance, P4IRS provides a customizable and robust communication module that facilitates integrating new domain partitioning schemes into its generated simulations. This is done by combining its runtime functions and code generation techniques, and allows the usage of the waLBerla domain partitioning scheme based on a block-forest data structure.
Essentially, P4IRS communication is split into three routines: • borders: defines particles within the halo region which must be send to neighbor ranks as “ghost” particles; only non-volatile properties that are accessed in a neighbor particle within a kernel are sent. • synchronize: uses the defined particle in the borders routine to exchange data; only non-volatile properties that are accessed in neighbor particles within a kernel and that change across time-steps are sent. • exchange: send particles that overlap the current rank’s domain, it becomes a local particle in the neighbor rank; all properties except volatiles are sent; contact history data is also exchanged.
These methods make use of a domain partitioner class that can be implemented in the P4IRS runtime library to determine which particles must be sent. The method is implemented in C++ and fills domain buffers that are used for evaluating if a particle is outside the domain or within the border using its position. This means that to include new domain partitioning schemes in P4IRS, it is only necessary to define a C++ runtime class, which can make use of external libraries and therefore can use the waLBerla block-forest domain partitioning. Note that the reason for using buffers instead of evaluating the particle positions directly in the runtime code is performance, since particle packing and unpacking kernels should be executed in the target processor to achieve parallelism and to avoid copying data back and forth from the accelerator and host processors.
Evaluation
Materials and methods
In this paper we conducted experiments to evaluate the base performance of P4IRS generated codes in the following processors and accelerators: • Ice Lake: Intel Xeon Platinum 8360Y “Ice Lake” processors (36 cores per chip) running at a base frequency of 2.4 GHz and 54 MB Shared L3 cache per chip, 256 GB of DDR4 RAM. • Milan: AMD EPYC 7713 “Milan” processors (64 cores per chip) running at 2.0 GHz with 256 MB Shared L3 cache per chip • A40: Nvidia A40 GPU with 48 GB DDR6 @ 696 GB/s; 37.42 TFlop/s in FP32. • A100: Nvidia A100 GPU with 40 GB HBM2 @ 1555 GB/s; HGX board with NVLink; 9.7 TFlop/s in FP64 or 19.5 TFlop/s in FP32.
And we also perform experiments to evaluate P4IRS weak-scalability in the following clusters: • Fritz: Each compute node has two Ice Lake processors; 256 GB of DDR4 RAM; Blocking HDR100 Infiniband with up to 100 GBit/s bandwidth per link and direction. • JUWELS Booster: Each compute node has two AMD EPYC “Rome” 7402 CPU, with 48 cores (simultaneous multi-threading) at 2.8 GHz; 512 GB of DDR4 memory at 3200 MHz; Four Nvidia A100 GPU with 40 GB HBM2e memory; Four InfiniBand HDR (Connect-X6) interconnect channels.
Compiler and flag specifications used for the experiments in this paper.
Experiments for MD use the standard Copper FCC lattice (see Figure 5) test-case from miniMD/MD-Bench, and we use our results from MD-Bench as a performance reference since it is a well optimized code for MD. The DEM test case is a “Settling Spheres” benchmark (see Figure 6), on which several spheres fall into a “ground” (half-space), thus generating a “bed” of spherical particles. Due to its multiple shapes and dynamics that consists of many interactions between spheres and half-spaces, it should provide a good reference for DEM simulations. We compare our single-core results from P4IRS with MESA-PD using 2 MPI ranks since it is not possible to run this setup in MESA-PD using a single rank. GPU results are not available for MESA-PD as it is not capable of generating GPU code, but we show GPU performance results for P4IRS comparing different code generation strategies. The MD simulation example used in this paper, which is a Copper FCC lattice simulation with homogeneous distribution. Forces are computed with the van der Waals potential and PBC is used in all directions. In average, every atom contains roughly 76 neighbors through the whole simulation (200 time-steps). The DEM simulation example used in this paper which can be called “Settling Spheres” or “Bed Generation”. Spherical particles fall into the ground (represented by a half-space) to generate a bed at the end of the simulation. Forces are computed for gravity, buoyancy and the Linear Spring-Dashpot method to achieve the desired result and PBC is used in the X and Y dimensions.

We also show weak-scaling results to demonstrate P4IRS scaling capabilities. For this we use regular domain partitioning since these experiments do not require sophisticated load-balancing algorithms (Copper lattice is an homogeneous test-case in all dimensions, and Settling Spheres can obtain acceptable weak-scaling by regularly partitioning the domain in X and Y dimensions). We use double floating-point precision and display results when enabling and disabling Newton’s 3rd Law optimization (half neighbor-lists). For MD simulations, we show results with the Verlet Lists optimization strategy since it is widely employed in the field, while for DEM we display results with the Linked Cells data-structure because particles represent rigid bodies that need to be in contact for their contributions to be computed. Since the amount of contacts is typically way less than atomic interactions, there are no significant advantages of using the Verlet Lists data structure.
Molecular dynamics
For the MD cases, we use the Verlet Lists algorithm as it is the standard one for such simulations. The simulations runs through 200 time-steps, and the neighbor-lists are rebuilt at every 20 time-steps of the simulation (and at the start). We use the simple LJ potential (as shown in Listing 1) to calculate the force contributions between atoms.
Figure 7 display the runtime measurements for P4IRS and MD-Bench simulations of the Copper FCC lattice case with 323 unit cells (4 atoms per unit-cell). It is noticeable that the kernels generated by P4IRS achieve competitive results in comparison to MD-Bench, a hand-tuned implementation for MD cases, thus demonstrating P4IRS capabilities to generate fast code. For Ice Lake, performance is slightly better for MD-Bench using full neighbor-lists (FN), with about 10% speedup for the force calculation but 25% worse performance for the neighbor-lists creation. When using half neighbor-lists (HN), the force calculation time is roughly the same for both versions, but P4IRS still keeps the advantage when building the neighbor-lists, resulting in a 4% better performance for P4IRS in the overall runtime. In general, runtime differences for the whole simulation vary in the range of about 5-6% in Fritz, which is not so significant and may not be the case when using different compilers and flags. Execution time for P4IRS and MD-Bench simulations in seconds (lower is better) of a Copper FCC lattice system with 323 unit-cells (4 atoms per unit-cell) during 200 time-steps on Ice Lake and Milan CPUs. Tests were performed in a single core (no parallelism) with fixed frequency. Results are shown for both full and half neighbor-lists (FN/HN).
In Milan, P4IRS is faster than MD-Bench for the FN case, and the advantage also comes from the neighbor-lists creation. For HN, results are similar, but it is possible to see that P4IRS still builds the neighbor-lists slightly faster, where MD-Bench has a moderately better performance for the force computation. Overall, we can assume that P4IRS is able to achieve good base performance when running in a single CPU core, while providing a more flexible and high-level description of the problem which is not the case for the hand-tuned and hard-coded C implementation from MD-Bench.
Measurements from P4IRS and MD-Bench Lennard-Jones force kernels for both full and half neighbor-lists cases on a single Ice Lake CPU core.
The performance with HN is significantly worse but it is compensated by the fact it computes roughly half the interactions. Another important thing to consider is the introduction of the Intel patch to mitigate the Downfall vulnerability (also known as Gather Data Sampling), which significantly affects the performance of the gather instructions in the Intel CPUs and therefore affect the FN cases performance with AVX512 vectorization. In particular, the separation of concerns introduced by P4IRS together with code generation techniques can be a first step on dealing with such issues by avoiding gather instructions, which can be done by producing explicit SIMD intrinsics to gather the data with separate instructions to load and then organize (through permutation, shuffles and swizzles) it into the vector registers, or using different algorithms such as the Cluster Pair from GROMACS, which keeps particles data in an array-of-struct-of-arrays (AoSoA) fashion, thus requiring only simple load instructions to build-up the vector registers.
Figure 8 displays the runtime measurements for P4IRS and MD-Bench simulations in the A40 and A100 GPUs. MD-Bench still do not have a GPU version with HN, so we just display it using FN. For A40, MD-Bench shows a slightly better performance (about 3%) for computing the forces while P4IRS with FN also provides slightly better performance (about 7%) for building the neighbor-lists. In the overall performance, MD-Bench is faster only by about 2% in comparison to P4IRS for the FN case, which corroborates our claim that P4IRS is able to generate efficient GPU code. For the HN case, the performance is not even close to a factor of two as one would expect, which can be attributed to the usage of atomics, and this can be different for more arithmetic-intensive kernels (the LJ kernel used has significantly less arithmetic instructions compared to realistic cases) since the overall amount of operations reduced should be larger. For A100, however, the performance difference is larger, where MD-Bench performance is about 14% faster in comparison to P4IRS. When using the HN, however, we achieve very close performance to MD-Bench with FN, and it is noticeable that the speedup is very far from the factor of two as well. Despite the observed performance divergence in A100, it is possible to see that P4IRS code still has good performance in comparison to our hand-tuned CUDA implementation, supporting our claim in generating performance-portable kernels. Execution time for P4IRS and MD-Bench simulations in seconds (lower is better) of a Copper FCC lattice system with 503 unit-cells (4 atoms per unit-cell) during 200 time-steps on A40 and A100 GPUs. Results are shown for both full and half neighbor-lists (FN/HN) cases, and atomic operations affect the performance for the HN case.
Summary of performance results (in millions of particle updates per second) for each P4IRS and MD-Bench simulation run of the MD benchmark on a single core of the Ice Lake CPU, a single core of the Milan CPU, a single A40 GPU, and a single A100 GPU.
Figure 9 shows the results for P4IRS weak-scaling in the Fritz supercomputer. For each node, a system of 1283 unit cells is included, with roughly 8.388.608 atoms per node (or 116.508 atoms per CPU core). Since the multinode partition in Fritz has the maximum of 64 nodes (which results in 4608 CPU cores, no hyper-threading), this is the maximum amount we scale up our problem to. It is possible to see that we achieve perfect weak-scaling in all the settings we use, and the HN version is significantly faster than the FN version. Notice that atomic operations are not required because we only make use of parallelism through spatial decomposition and message communication with MPI, thus we use distributed-memory parallelism strategy even within the same node and there is no risk of race conditions. Based on our results, it is clear that P4IRS is able to generate scalable code on modern CPU clusters, but problems that are not too homogeneous may need proper mechanisms for balancing the workload between different MPI ranks. This is possible by using different domain partitioning methods in P4IRS such as the waLBerla domain partitioning with block-forests, but for this example we can achieve perfect scaling even with regular domain partitioning. Weak-scaling results for P4IRS (FN and HN) on Fritz, each node simulates 200 time-steps of a Copper FCC lattice system with 1283 unit cells, with approximately 116.508 atoms per CPU core.
Figure 10 shows the weak-scaling runtimes for P4IRS in the JUWELS Booster cluster. Each node simulates a system of 1003 unit cells, with roughly 500.000 atoms per GPU. We scale up to 128 nodes, which is equivalent of 512 A100 GPUs. Note that for the first node, the performance is slightly worse than the next ones, which can be explained as we only have 4 GPUs in one node, and domain partitioning is only done in two dimensions. This can lead to worse performance because the surface area for communication is likely higher for the dimension where the partitioning is not done, which culminates in more atoms to be communicated and therefore more time spent packing and transferring data. Apart from this, the scaling is also perfect for all the settings, and it is possible to observe that the FN version is slightly faster than HN. As we already discussed, the usage of HN is not necessarily beneficial in GPUs due to atomics, and this can be worse especially for kernels which do not exhibit high arithmetic intensity. The results demonstrate that P4IRS is in fact capable of generating scalable code for modern GPU clusters. Weak-scaling results for P4IRS (FN and HN) in the JUWELS Booster GPU cluster, each node simulates 200 time-steps of a Copper FCC lattice system with 1003 unit cells, with approximately 500.000 atoms per GPU.
Discrete element method
For the DEM case, we make use of the Linear Spring-Dashpot model to calculate the particle force contributions. We also use the Linked Cells algorithm, as it is usually not common to use Verlet Lists for DEM in general for contact detection, and also because we build the Linked Cells at every time-step of the simulation, which means that building the Verlet Lists do not provide any advantage.
Figure 11 displays the runtimes for the DEM kernels in single-core (P4IRS) and two cores/ranks (P4IRS/MESA-PD), also with cases building the neighbor-lists per cell (NPC) in P4IRS to evaluate its impact on CPU performance. It is noticeable that P4IRS outperforms MESA-PD even with a single rank in the overall runtime, and what draws the attention is the amount of time MESA-PD spends executing other procedures instead of the force calculation and building the neighbor-lists This can be attributed to the overhead that MESA-PD has to manage its domain partitioning with waLBerla (which cannot be switched) and calling its communication routines for next-neighbor communication. A more in-depth analysis is required to actually understand why it takes a significant fraction of the time for MESA-PD to communicate data, but this is out of scope of this paper. Execution time for P4IRS and MD-Bench simulations in seconds (lower is better) of a setting-spheres benchmark with 18720 particles during 10000 time-steps on Ice Lake and Milan CPUs. Tests were performed in a single core (no parallelism) with fixed frequency. Results are shown for standard linked cells, linked cells with neighbor-lists created per cell (NPC) and for single and two ranks (2R).
A more interesting insight from the results is the performance improvement for the force calculation when using P4IRS, with only a single rank it is only about 16-17% worse the the force calculation for MESA-PD, while with 2 ranks it outperforms MESA-PD by about 76-82% using FN in both Ice Lake and Milan architectures. The reason is that P4IRS code is able to perform several specializations in the generated code and thus reduce the amount of arithmetic instructions to be executed. For instance, the code from MESA-PD needs parameters such as stiffness and damping to be set at every iteration because they depend on the masses and radii of the particles, which not only incurs in extra instructions to load and store such data in a lookup array, but also restricts specializations in the kernel since some parameters are not known at compile-time. In P4IRS, such parameters are calculated directly in the kernel, and we can make use of Just-In-Time (JIT) compilation to emit tailored versions of our kernels for the specific problem we want to solve, which leads to significant speedup. Some specialization can also be exposed by separating the interaction code between shapes as discussed in the Shape Partitioning and Multi-Kernel section. For more complex kernels, there may be even more opportunities for specializations, which can result in even higher improvements in performance when comparing to kernels with no specializations.
Another point to evaluate is the impact of building the neighbor-lists per cell (NPC) in the CPU. Although there are considerable improvements in the force calculation in both CPUs (about 76% faster in Ice Lake, and about 46-49% in Milan), the significant amount of time building the neighbor-lists does not compensate the improvements. At the end, it is also possible to use P4IRS for experimentation of such different strategies, which can help finding the most efficient version to solve a problem for a specific hardware. It is also clear that P4IRS can generate more efficient DEM kernels for CPU in comparison to our previous solution, which reveals the advantages of using code generation for particle simulations.
Figure 12 displays the runtimes for the DEM test-case in both A40 and A100 GPUs, also evaluating the neighbor-lists per cell (see the Neighbor-lists per Cell section) and multi-kernel (see the Shape Partitioning and Multi-Kernel section) strategies. Most of the simulation time is spent on calculating the forces and executing other routines, which is likely transferring data into/from the GPUs. Note that since particles contain significant more properties (and some of them with dynamic types such as vectors and matrices), the communication and/or PBC buffers are significantly larger especially when particles need to be exchanged. It is also important to disclose that due to PBC, particles are still exchanged and produce ghost versions of themselves (with proper PBC shifts) even when running in a single MPI rank, therefore the same routines for packing and unpacking data for communication are used. Execution time for P4IRS simulations in seconds (lower is better) of a setting-spheres benchmark with 561.600 particles during 10000 time-steps on A40 and A100 GPUs. Results are shown for standard linked cells, linked cells with neighbor-lists created per cell (NPC) and for multi-kernel (MK) in both cases.
It is noticeable that in this case, building the neighbor-lists per cell actually increases the overall performance, as the improved runtime for the force calculation compensates the time for building the neighbor-lists. The force calculation with the neighbor-lists per cell is about 24% faster the standard Linked Cells version in A40, and about 45% faster in A100, hence the SIMD execution is improved as expected. In the overall runtime, the NPC version is about 20% faster for A40 and 12% faster for A100 since the increase in time for building the neighbor-lists is relatively larger for the A100 GPU.
For the Multi-Kernel approach, there are no significant improvements observed for the A40 GPU, with a time difference less than 1% in the force calculation kernels. The time spent in other sections is significantly higher leading to worse performance when using the Multi-Kernel, but this can be happening due to impact caused by asynchronously launching more kernels into the GPU. For the A100, the same behavior applies for the standard Linked Cells, but when combining both NPC and MK approaches, it is possible to see an improvement of 13% in the force calculation performance in comparison to only using NPC, leading to about 6% performance improvement overall. This example also illustrates the usage of P4IRS for experimentation, as these different kernel versions can be generated with P4IRS without the effort of implementing each one of them for the specific interaction model used.
Summary of performance results (in millions of particle updates per second) for each P4IRS and MESA-PD simulation run of the DEM benchmark on a single core of ice lake CPU, on a single core of a Milan CPU, on a single A40 GPU, and on a single A100 GPU.
Finally, Figure 13 shows the weak-scaling for the DEM case on Fritz. It is noticeable that the scaling degrades over successive executions, with about 91% efficiency with 32 nodes (2304 CPU cores) and 87% efficiency with 64 nodes (4608 CPU cores). The reason is likely due to work imbalances caused by the Settling-Spheres example when extending the simulation domain and amount of particles. Note that the example itself can present more heterogeneity when scaling up, as the spheres generated contain random radii and this can impact the amount of interactions computed, as well as the overall distribution of the particles into the sub-domains over the course of the simulation. In practice, the presented weak-scaling results are still acceptable, but they demonstrate that such particle simulations may not necessarily scale well for more realistic and complex simulations. The purpose of this work, however, is to demonstrate that we can achieve state-of-the-art performance and (currently in certain conditions) good scalability with P4IRS while providing simple and flexible means of modeling particle simulations, which is achieved via code generation techniques. Weak-scaling results for P4IRS (FN) on fritz, each node simulates 10000 time-steps of the settling-spheres simulation, with approximately 7.800 particles per CPU core.
Conclusion and outlook
This paper introduced P4IRS, an intermediate representation and compiler for particle simulations which is able to use a single problem description in Python and generate efficient C++ and CUDA parallel codes with OpenMP and MPI for shared- and distributed-memory parallelism. We present and describe P4IRS features focusing on MD and DEM, illustrating how to setup simulations by providing the potential and/or contact model specification, the particle properties and even contact properties required for more complex DEM models such as the Linear Spring-Dashpot. Apart from features to model particle simulations, we also display optimization strategies enabled with P4IRS, which are possible to be achieved through code generation. P4IRS is currently able to generate the Linked Cells and Verlet Lists variants for each kernel specified, and these can also make use of specific optimization strategies such as the neighbor-lists per cell and multi-kernel approach to enhance SIMD parallelism on CPUs and GPUs.
Results for MD kernels generated by P4IRS demonstrate that we can achieve state-of-the-art performance in comparison to our performance-focused prototyping harness MD-Bench, which contains hand-tuned implementations of MD kernels in C and CUDA. Also, results in modern clusters Fritz and JUWELS Booster are displayed to show P4IRS scaling capabilities, which are important for particle simulations in general. For more complex and realistic cases, we also show that P4IRS is able to generate code to solve DEM simulation with the Linear Spring-Dashpot contact model. The performance results demonstrate that P4IRS significantly outperforms our previous framework (called MESA-PD), which can be attributed to aggressive specializations in the force calculation kernels of the contact model. Besides, we also demonstrate how P4IRS can be used for experimentation, since contact models or potentials can be generated with different optimization strategies without the effort of implementing new kernels, highlighting the advantages of using code generation techniques.
In the future, we intend to use P4IRS particle simulations coupled with fluid simulations, which can be solved through the waLBerla framework. For this, we still need to generate proper interfaces for the coupling and generate modular code for the particle simulations that can be seamlessly integrated into other applications. Nevertheless, we also want to include more performance optimizations into P4IRS such as the Cluster Pair algorithm from GROMACS to enhance SIMD performance, as well as being able to generate SIMD intrinsics explicitly into P4IRS generated kernels to either enforce SIMD vectorization and also to be able to choose between different strategies (i.e. gather data manually instead of using gather instructions). Another future plan is to support long-range interactions in P4IRS simulations, which could be achieved either by implementing methods in P4IRS to account for the long-range computation steps or coupling the generated code for short-range calculations with other existing long-range implementations. At the end, the separation of concerns introduced by P4IRS together with its intermediate representation embedded in Python provide several opportunities to explore and experiment with performance on many types of particle simulations that can belong to distinct research fields.
Footnotes
Acknowledgments
The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU). This project has received funding from the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under project number 433735254. Additionally, this work has also received funding from the European High Performance Computing Joint Undertaking (JU) and Sweden, Germany, Spain, Greece, and Denmark under grant agreement number 101093393.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: this work was supported by the Erlangen National High Performance Computing Center, Deutsche Forschungsgemeinschaft; 433735254 and European High Performance Computing Joint Undertaking; 101093393.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
