Sage Journals: Discover world-class research

Abstract

Current supercomputers often have a heterogeneous architecture using both conventional Central Processing Units (CPUs) and Graphics Processing Units (GPUs). At the same time, numerical simulation tasks frequently involve multiphysics scenarios whose components run on different hardware due to multiple reasons, e.g., architectural requirements, pragmatism, etc. This leads naturally to a software design where different simulation modules are mapped to different subsystems of the heterogeneous architecture. We present a detailed performance analysis for such a hybrid four-way coupled simulation of a fully resolved particle-laden flow. The Eulerian representation of the flow utilizes GPUs, while the Lagrangian model for the particles runs on conventional CPUs. Two characteristic model situations involving dense and dilute particle systems are used as benchmark scenarios. First, a roofline model is employed to predict the node level performance and to show that the lattice-Boltzmann-based Eulerian fluid simulation reaches very good performance on a single GPU. Furthermore, the GPU-GPU communication for a large-scale Eulerian flow simulation results in only moderate slowdowns. This is due to the efficiency of the CUDA-aware MPI communication, combined with the use of communication hiding techniques. On 1024 A100 GPUs, an overall parallel efficiency of up to 71% is achieved. While the flow simulation has good performance characteristics, the integration of the stiff Lagrangian particle system requires frequent CPU-CPU communications that can become a bottleneck, especially when simulating the dense particle system. Additionally, special attention is paid to the CPU-GPU communication overhead since this is essential for coupling the particles to the flow simulation. However, thanks to our problem-aware co-partitioning, the CPU-GPU communication overhead is found to be negligible. As a lesson learned from this development, four criteria are postulated that a hybrid implementation must meet for the efficient use of heterogeneous supercomputers.

Keywords

Hybrid implementation high-performance computing particulate flow lattice Boltzmann method discrete element method

1. Introduction

The Top500 list reports the most powerful supercomputers worldwide. The number of heterogeneous supercomputers, i.e., systems with additional accelerators such as Graphics Processing Units (GPUs), in the Top500 list has steadily increased in the last years, accounting for roughly 39% in June 2024.¹ Each node of such a heterogeneous supercomputer typically consists of one or many Central Processing Units (CPUs) and GPUs (Kim et al., 2012). Numerical multiphysics simulations are a powerful technique for conducting in-depth investigations of complex physical phenomena by providing detailed data, which is challenging, if not impossible, to collect in experiments. A significant challenge with these simulations is that they are computationally costly, which is why they are often run on supercomputers. Especially supercomputers containing GPUs have become increasingly popular for numerical simulations in recent years (Oyarzun et al., 2017; Rohr et al., 2014; Shimokawabe et al., 2017) as they offer unprecedented computing power. A common approach in the literature for utilizing such heterogeneous supercomputers for multiphysics simulation is hybrid implementation (Feichtinger, 2012; Kotsalos et al., 2021; Xu et al., 2012), i.e., different simulation modules running on different hardware. Several reasons are predestinating or sometimes even forcing hybrid implementations in the context of multiphysics simulations. First, the various combined methodologies in such multiphysics simulations may exhibit distinctly contrasting computational properties, e.g., problem sizes, parallel and sequential portions, conditionals, and branching. Therefore, the best-suited hardware architecture can differ between the simulation modules. Second, if one simulation module dominates the overall run time of the simulation, accelerating only this part using GPUs is a straightforward, pragmatic, and development time-saving alternative to porting the whole code base to GPU. Third, practical limitations can challenge multiphysics simulations: not all coupled software frameworks and modules may support the same hardware, e.g., if parts of the simulation use commercial software. While hybrid implementations are commonly used in practice for the aforementioned reasons, they introduce inherent challenges in performance and scalability.

A prominent example of multiphysics simulations is coupled fluid-particle simulations with fully resolved particles (i.e., multiple fluid cells per particle diameter, see Rettinger, 2023). Such simulations have been used in the literature, among others, to understand the formation and dynamics of dunes in river beds (Kidanemariam and Uhlmann, 2014; Rettinger et al., 2017; Schwarzmeier et al., 2023), analyze the chimney fluidization in a granular medium (Ngoma et al., 2018), investigate the erosion kinetics of soil under an impinging jet (Benseghier et al., 2020) and to analyze mobile sediment beds (Rettinger et al., 2022; Vowinckel et al., 2014). One encounters some of the previously mentioned reasons for hybrid implementations in the context of coupled fluid-particle simulations with fully resolved particles. First, when coupling a particle simulation with a fluid simulation using a fully resolved approach, the number of particles is typically at least three orders of magnitude smaller than the number of fluid cells, leading to an imbalance in the workload, where the fluid simulation can overwhelmingly dominate the total run time of the simulation. Therefore, accelerating only the fluid simulation using GPUs is a pragmatic approach for this application. Second, while the fluid simulation employed here uses a structured, cartesian-grid-based methodology and is therefore ideally suited for GPU parallelization (Holzer et al., 2021; Kuznik et al., 2010), the situation is more complex for the particle simulation. One extreme is molecular dynamics simulations using millions of particles but relatively simple and uniform particle-particle interactions, which are well suited for GPUs (Machado et al., 2021). The other extreme is particle simulations consisting of few but large particles with complex shapes and sophisticated and diverse particle-particle interactions (Iglberger and Rüde, 2010), assumably better suited for CPUs. The particle methodology of this paper lies somewhere in between. While spherical particles are used, the number of particles is comparably small (due to being fully resolved), and the particle-particle interactions are more sophisticated (e.g., lubrication corrections) than the typical molecular dynamics or Discrete Element Method (DEM) simulation on GPUs in the literature. This effect will increase when introducing more complex particle shapes or additional particle-particle interactions (such as cohesion) in the future.

Numerous fluid-particle simulations in the literature use hybrid parallelization in one way or another, of which we present a representative selection in the following. One example is simulating deformable bodies (using the finite element method) in a fluid, i.e., blood cells in cellular blood flow (Kotsalos et al., 2021). Here, the focus is more on structural mechanics, as the finite element method dominates the overall run time and is, therefore, the accelerated part of the code. Hybrid approaches have also been used in the literature to couple commercial Computational Fluid Dynamics (CFD) solvers on the CPU such as Ansys (He et al., 2020; Sousani et al., 2019), or AVL fire (Jajcevic et al., 2013) with the DEM on the GPU. Furthermore, simulations have been accelerated using hybrid parallelization to gain a deeper understanding of fluidized beds (Norouzi et al., 2017; Xu et al., 2012), or the effects of porosity and tortuosity on soil permeability (Sheikh and Pak, 2015). Another variant is to use the GPU for both the fluid and the particles, but the CPU supports the particle dynamics through collision detection (Junior et al., 2010). Furthermore, we point the reader’s attention to a related work utilizing a heterogeneous-hybrid approach, i.e., CPUs for the particles, while the fluid simulation is distributed on both the CPUs and GPUs (Feichtinger, 2012).

We chose the Partially Saturated Cells Method (PSM) for the coupling between the fluid and solid phase as it has been used successfully in the context of GPUs (Benseghier et al., 2020; Fukumoto et al., 2021) due to its Single Instruction, Multiple Data (SIMD) nature.

In this work, we postulate several performance criteria that a hybrid implementation has to meet in our opinion for an efficient use of a heterogeneous supercomputer without misusing resources. First, the overhead introduced by the hybrid implementation (i.e., the CPU-GPU communication) must be negligible compared to the overall run time. Second, the performance-critical part (in our case, the fluid simulation) should show good performance on the GPU. Third, the share of the GPU run time of the total run time must be sufficiently high to justify using a heterogeneous cluster. Fourth, it has to show a satisfactory weak scaling performance for the efficient use of multiple supercomputer nodes. One of the main contributions is introducing a multi-CPU-multi-GPU implementation for fully resolved fluid-particle simulations, fulfilling the four criteria stated above. To evaluate the presented implementation, an in-depth performance analysis on a state-of-the-art heterogeneous supercomputer is provided, comparing two contrasting cases of a fluidized bed simulation, namely in dense and dilute states of the solid granular phase. The paper focuses on the experience with the performance and scalability of the presented hybrid implementation on supercomputers rather than on the exposition of physical results. Coupled fluid-particle implementations in the literature using a comparable particle methodology still are often limited to CPUs (Kidanemariam and Uhlmann, 2014; Rettinger et al., 2017; Vowinckel et al., 2014). With this work, we report and study the experience with the performance and scalability of the presented hybrid implementation on a supercomputer to allow other researchers to make an informed decision if such a ‘partial’ acceleration might be a suitable approach for their codes.

2. Numerical methods

Generally, fully resolved coupled fluid-particle simulations consist of three modules: fluid dynamics, particle physics, and fluid-particle coupling. In this section, we introduce the background of the methods based on the work of Rettinger and Rüde (2017, 2022). Figure 1 illustrates coupled fluid-particle simulations using the PSM, as it will be explained in the upcoming sections.

Figure 1.

Two-dimensional sketch of coupled fluid-particle simulations using the PSM.

2.1. Lattice Boltzmann method

We use the Lattice Boltzmann Method (LBM) with the D3Q19 lattice model for the hydrodynamics simulation, an alternative to conventional Navier-Stokes solvers (Krüger et al., 2017). We evolve 19 Particle Distribution Functions (PDFs) $f_{q}$ with $q \in {0, \dots, 18}$ for every cell of a three-dimensional cartesian lattice. Each $f_{q}$ is associated with a lattice velocity $c_{q}$ . The underlying update rule is based on the Boltzmann equation, typically split into the streaming and collision steps. The cell-local collision relaxes the PDFs towards a thermodynamic equilibrium. The streaming propagates the post-collision PDFs $\tilde{f_{q}}$ to neighboring cells. The collision step for the lattice cell $x$ at time step $t$ is defined as:

\tilde{f_{q}} (x, t) = f_{q} (x, t) + C_{q} (x, t) + F_{q} (x, t),

(1)with

C_{q}

being the collision operator and

F_{q}

the forcing operator. The streaming step

f_{q} (x + c_{q} Δ t, t + Δ t) = \tilde{f_{q}} (x, t),

(2)distributes the PDFs to neighboring cells.

Δ t

is the time step size, which is typically 1 in the context of the LBM. Although recent studies have used more elaborate collision operators in the context of the PSM (Wang et al., 2018), the Single Relaxation Time (SRT) model (also known as the BGK model (Qian et al., 1992)) is still commonly applied in the context of the PSM, which we will introduce in the following. It relaxes the PDFs towards their equilibrium using a single relaxation time

τ

C_{q}^{SRT} (x, t) = \frac{Δ t}{τ} (f_{q}^{eq} (ρ_{f}, U_{f}) - f_{q} (x, t)) .

(3)The relaxation time

τ

is linked to the kinematic fluid viscosity

ν

by:

ν = (τ - Δ t / 2) c_{s}^{2} .

(4)The equilibrium is defined as:

\begin{align} f_{q}^{eq} (ρ_{f}, U_{f}) = w_{q} (ρ_{f} + ρ_{0} (\frac{c_{q} \cdot U_{f}}{c_{s}^{2}} + \frac{{(c_{q} \cdot U_{f})}^{2}}{2 c_{s}^{4}} \\ - \frac{U_{f} \cdot U_{f}}{2 c_{s}^{2}})), \end{align}

(5)for incompressible flows (He and Luo, 1997) with

ρ_{0} = 1

, the lattice weights

w_{q}

(Qian et al., 1992), and the lattice speed of sound

c_{s} = 1 / \sqrt{3}

. The cell-local quantities

ρ_{f} (x, t) = \sum_{q} f_{q} (x, t),

(6)

U_{f} (x, t) = \frac{1}{ρ_{0}} \sum_{q} f_{q} (x, t) c_{q} + \frac{Δ t}{2 ρ_{0}} f^{ext},

(7)are calculated based on the moments of the PDFs. The forcing operator

F_{q} (x, t) = Δ t w_{q} [\frac{c_{q} - U_{f}}{c_{s}^{2}} + \frac{c_{q} \cdot U_{f}}{c_{s}^{4}} \cdot c_{q}] \cdot f^{ext},

(8)can incorporate external forces using the constant force density

f^{ext}

(Ladd and Verberg, 2001).

2.2. Particle dynamics

The behavior of the particles is modeled using the DEM (Cundall and Strack, 1979). The total force $F_{p, i}$ acting on a particle $i$ consists of the following modules:

F_{p, i} = F_{p, i}^{col} + F_{p, i}^{hyd} + F_{p, i}^{ext} .

(9)In addition to the hydrodynamic force

F_{p, i}^{hyd}

and external forces

F_{p, i}^{ext}

(e.g., gravity), the particle interactions exert forces

F_{p, i}^{col}

on each other due to collisions. The equations of motion have to be integrated to simulate the particle movements.

2.2.1. Particle interactions using the discrete element method

The collision between particle $i$ and $j$ is modeled using a linear spring-dashpot model. The collision force $F_{p, i}^{col}$ and torque $T_{p, i}^{col}$ on particle $i$ are computed as:

F_{p, i}^{col} = \sum_{j, j \neq i} (F_{i j, n}^{col} + F_{i j, t}^{col}),

(10)

T_{p, i}^{col} = \sum_{j, j \neq i} (x_{i j}^{cp} - x_{p, i}) \times F_{i j, t}^{col},

(11)where the normal part of the collision force

F_{i j, n}^{col}

acting on particle

i

with position

x_{p, i}

is computed as:

F_{i j, n}^{col} = - k_{n} δ_{i j, n} n_{i j} - d_{n} U_{i j, n}^{cp} .

(12)Here,

k_{n}

and

d_{n}

are the normal stiffness and damping coefficients,

n_{i j}

the normal vector,

δ_{i j, n}

is the penetration depth and

U_{i j, n}^{cp}

is the normal component of the relative velocity of the surface of the particle at the contact point

x_{i j}^{cp}

. The tangential part of the collision force

F_{i j, t}^{col} = - k_{t} δ_{i j, t} - d_{t} U_{i j, t}^{cp},

(13)uses the tangential stiffness and damping coefficients

k_{t}

and

d_{t}

and

U_{i j, t}^{cp}

is the tangential component of the relative velocity of the surface of the particle at the contact point.

δ_{i j, t} = \int_{t_{i}}^{t} U_{i j, t}^{cp} (t^{'}) d t^{'}

(14)is the accumulated relative tangential motion between two particles where

t_{i}

is the time step of the impact. For more details, see Rettinger and Rüde (2022).

2.2.2. Integration of the particle properties

We update the particle’s position and velocity by solving the Newton-Euler equations of motion using the Velocity Verlet integrator:

x_{p, i} (t + Δ t_{p}) = x_{p, i} (t) + Δ t_{p} U_{p, i} (t) + \frac{Δ t_{p}^{2}}{2 m_{p, i}} F_{p, i} (t),

(15)

\begin{align} U_{p, i} (t + Δ t_{p}) = U_{p, i} (t) + \frac{Δ t_{p}}{2 m_{p, i}} (F_{p, i} (t) \\ + F_{p, i} (t + Δ t_{p})), \end{align}

(16)where

m_{p, i}

is the mass of the particle

i

x_{p, i} (t + Δ t_{p})

is computed at the beginning of each particle time step using the old force. Then, the new force

F_{p, i} (t + Δ t_{p})

is computed using the updated position. At the end of the time step, the particle velocity

U_{p, i} (t + Δ t_{p})

is computed using the updated force. Updating the angular velocity is done analogously.

2.3. Fully resolved fluid-particle coupling method

The task of the coupling is to perform momentum exchange between the fluid and the solid phase. We use the PSM for the fully resolved fluid-particle coupling (Noble and Torczynski, 1998). It modifies the LBM collision step from equation (1) by introducing the solid volume fraction $B (x, t)$ resulting in:

\begin{align} \tilde{f_{q}} (x, t) & = f_{q} (x, t) + C_{q}^{PSM} (x, t) \\ + (1 - B (x, t)) F_{q} (x, t), \end{align}

(17)where

B (x, t)

is the fraction of the fluid cell

x

being (partly) covered by one or more particles. Section 2.3.2 explains this solid volume fraction computation in detail. The modified collision operator

C_{q}^{PSM}

used in equation (17) is defined as:

\begin{align} C_{q}^{PSM} (x, t) & = (1 - B (x, t)) C_{q}^{SRT} (x, t) \\ + \sum_{I} B_{i} (x, t) C_{q, i}^{solid} (x, t) . \end{align}

(18)

C_{q}^{SRT}

is the LBM collision operator described in equation (3).

B (x, t)

is the sum over the individual overlap fractions

B_{i} (x, t)

of all particles

I

. If

B (x, t) > 1

, it is normalized to 1. This situation can occur if colliding particles are allowed to overlap during contact. Then a single fluid cell can even be entirely covered by two particles, i.e.,

B (x, t) = 2

. The solid collision operator

\begin{align} C_{q, i}^{solid} (x, t) & = [f_{\bar{q}} (x, t) - f_{\bar{q}}^{eq} (ρ_{f}, U_{f})] - [f_{q} (x, t) \\ - f_{q}^{eq} (ρ_{f}, U_{p, i} (x, t))], \end{align}

(19)acts when particles intersect with a cell. There exist different variants of the solid collision operator.

f_{\bar{q}}

corresponds to the inverse lattice velocity of

f_{q}

U_{p, i} (x, t)

is the velocity of particle

i

evaluated at the cell center

x

and is computed as:

U_{p, i} (x_{i}, t) = U_{p, i} (t) + Ω_{p, i} (t) \times (x_{i} - x_{p, i} (t)),

(20)with the translational particle velocity

U_{p, i} (t)

, the rotational particle velocity

Ω_{p, i} (t)

and the particle center of gravity

x_{p, i} (t)

x_{i}

are the cell centers of all cells intersecting with the particle.

So far, we have only considered the influence of the particles on the fluid. However, the fluid also influences the particles through hydrodynamic forces. We compute the force $F_{p, i}^{fp} (t)$ and torque $T_{p, i}^{fp} (t)$ exerted by the fluid on particle $i$ as:

F_{p, i}^{fp} (t) = \frac{{(Δ x)}^{3}}{Δ t} \sum_{x_{i}} [B_{i} (x_{i}, t) \sum_{q} (C_{q, i}^{solid} (x_{i}, t) c_{\bar{q}})],

(21)

T_{p, i}^{fp} (t) = \frac{{(Δ x)}^{3}}{Δ t} \sum_{x_{i}} [B_{i} (x_{i}, t) (x_{i} - x_{p, i}) \times \sum_{q} (C_{q, i}^{solid} (x_{i}, t) c_{\bar{q}})] .

(22)

2.3.1. Lubrication correction

The lubrication force and torque act on two particles approaching each other. The two particles squeeze out the fluid inside the gap, which exerts a force in the opposite direction of the relative motion. However, this effect would only be covered correctly by the fluid-particle coupling for a very fine grid resolution, which is computationally too expensive. As the lubrication force has a significant influence, we compute lubrication correction force terms to compensate for the inability of the coupling method to represent these forces correctly. We compute lubrication correction terms due to normal- and tangential translations and rotations. Therefore, the total hydrodynamic force $F_{p, i}^{hyd}$ and torque $T_{p, i}^{hyd}$ is a sum of the force from the fully resolved fluid-particle coupling method (Section 2.3), and the lubrication correction:

F_{p, i}^{hyd} = F_{p, i}^{fp} + F_{p, i}^{lub,cor},

(23)

T_{p, i}^{hyd} = T_{p, i}^{fp} + T_{p, i}^{lub,cor} .

(24)For details on how to compute the lubrication correction, see Rettinger and Rüde (2022).

2.3.2. Particle mapping

A coupled fluid-particle simulation using the PSM requires the computation of the solid volume fraction $B_{i} (x, t)$ (Section 2.3), i.e., the fraction of a fluid cell $x$ being (partly) covered by a particle $i$ . We restrict ourselves to spherical particles. Jones and Williams (2017) tackle the problem that, in general, no unique analytical solution exists to compute this overlapping fraction for spheres and cells/cubes. They propose a linear approximation derived from the analytical solution for a specific cell orientation relative to the particle surface. Grid cells with the dimensionless edge size 1 (as it is typically the case for the LBM) are assumed in the following.

In Figure 2, the overlap fraction $ϵ$ between the upper blue lattice cell and the orange particle is computed as:

ϵ = V_{i} = V_{a} - V_{b} = V_{a} - (D + r - 1 / 2),

(25)where

V_{a}

is the union of

V_{i}

and

V_{b}

D

is the distance from the cell center to the sphere surface (negative if the cell center lies inside the sphere). There is a cell-particle overlap if

ϵ \in] 0, 1]

. We can reformulate this as:

ϵ = V_{a} - (D + r - 1 / 2) = - D + f (r),

(26)where

f (r) = V_{a} - r + 1 / 2

only depends on the particle radius

r

and therefore is constant for each particle respectively.

V_{a}

is computed as:

\begin{align} V_{a} = \int_{- 1 / 2}^{1 / 2} \int_{- 1 / 2}^{1 / 2} \sqrt{r^{2} - x^{2} - y^{2}} d x d y & = (1 / 12 - r^{2}) \tan^{- 1} (\frac{\frac{1}{2} \sqrt{r^{2} - 1 / 2}}{1 / 2 - r^{2}}) + \frac{1}{3} \sqrt{r^{2} - 1 / 2} \\ + (r^{2} - 1 / 12) \tan^{- 1} (\frac{1 / 2}{\sqrt{r^{2} - 1 / 2}}) \\ - \frac{4}{3} r^{3} \tan^{- 1} (\frac{1 / 4}{r \sqrt{r^{2} - 1 / 2}}) . \end{align}

(27)This approximation yields accurate results also for arbitrary cell orientations and is more computationally efficient than contemporary techniques like sub-division sampling (Jones and Williams, 2017).

Figure 2.

The linear approximation yields the analytical solution for the blue cells. The particle is represented by the orange circle. Note that the grid is coarsened for better clarity.

3. Implementation

We implemented our hybrid coupled fluid-particle simulation within the massively parallel multiphysics framework waLBerla (Bauer et al., 2021a) (https://www.walberla.net/). waLBerla supports highly efficient and scalable LBM simulations on both CPUs and GPUs (Holzer et al., 2021). The MesaPD module (Eibl and Rüde, 2019) enables waLBerla to perform particle simulations on CPUs using the DEM. Large-scale simulations require the usage of numerous nodes of a supercomputer. Each node of a heterogeneous supercomputer typically consists of one or many CPUs and GPUs. Several CPU cores belong to one GPU, i.e., they have a direct connection. We divide the simulation domain into multiple blocks and exclusively assign each block to a GPU. Figure 3 illustrates the domain partitioning and the respective hardware assignment.

Figure 3.

Partitioning of a 2D simulation domain into four blocks. The circles with ID 0 to ID 9 indicate the particles, and the blue cells are the fluid. One GPUx for updating the fluid cells is assigned to each block x, and the corresponding CPUx cores are responsible for the particle dynamics. CPUx represents the CPU cores having a direct connection/affinity to GPUx. MPI rank x is assigned to CPUx, distributes the particle computations among CPUx using OpenMP, and uses GPUx for the fluid dynamics.

We want to highlight that the particle simulation is not a standalone framework, but for the particles and the fluid that are physically close to each other (i.e., in the same block), one MPI process is responsible for both the particle and fluid dynamics and the CPU and GPU have a direct connection/affinity. The CPU cores belonging to the respective GPU are responsible for updating the particles whose center of mass lies inside that block (local particles). In Figure 3, CPU0 is responsible for the particles with ID 2, 5, and 7. Additionally, particles can overlap with a given block whose center of mass lies in another block (ghost particles). In Figure 3, the particle with ID 5 is local for block 0 and ghost for block 2. This overlapping causes the need for communication between the CPUs. The particle computations within a block are parallelized among the CPU cores using OpenMP. The communication between neighboring blocks is implemented using the CUDA-Aware Message Passing Interface. On clusters with multiple GPUs sharing a node and NVLinks between the GPUs, NVIDIA GPUDirect is used for direct GPU-GPU MPI communications between the GPUs on the node. Figure 4 illustrates the different modules of the simulation, on which hardware they are running, the workflow, and the necessary communication steps. We will explain the figure in detail in the following sections. Generally speaking, the GPU is responsible for all operations on fluid cells (i.e., the LBM and the coupling), whereas the CPU performs all computations on particles. The associated data structures are consequently located in the respective memories (fluid cells in GPU memory, particles in CPU memory).

Figure 4.

Flowchart of our hybrid CPU-GPU implementation from the perspective of a CPU and GPU responsible for the same block. The color coding indicates the communication types required within each step.

3.1. Fluid dynamics and coupling on the GPU

Performing an LBM update step requires the communication of boundary cells between neighboring GPUs. However, the first three kernels do not need neighboring information. Therefore, the communication is hidden behind those kernels by starting a non-blocking send before the particle mapping. A time step begins with the coupling from the particles to the fluid.

3.1.1. Coupling from the particles to the fluid

For the particle mapping, the GPU has to check overlaps for all cell-particle combinations, even though there is no overlap for most cell-particle combinations. This check quickly becomes very computationally expensive. Therefore, we reduce the computational effort by dividing each block into $k$ sub-blocks in each dimension. The CPU inserts every particle into all sub-blocks that overlap with this particle, similar to the Linked Cell Method. Using sub-blocks allows the GPU to consider only a tiny subset of particles when computing the overlaps for a particular grid cell, namely the particles overlapping with the sub-block the cell is located in. Our coupled fluid-particle simulation requires the communication of various data. Figure 5 gives an overview of the different types of communication from the perspective of a CPU and GPU responsible for the same block. We will explain the communication steps in the following. For every particle $i$ , the position $x_{p, i}$ , radius $r_{i}$ , and $f (r_{i})$ (Section 2.3.2) are transferred from the CPU to the GPU via the PCIe. In addition, the number of overlapping particles per sub-block and the corresponding particle IDs are transferred from the CPU to the GPU. Then, the GPU performs the particle mapping. For more details, see the description of the solid volume fraction computation (i.e., the particle mapping) in Section 2.3.2. In our simulation, a maximum of two particles can overlap with a given grid cell due to the geometrically resolved spherical particles and appropriate DEM parameters allowing only a small particle-particle penetration. Therefore, the grid that we use to store $B_{i} (x, t)$ , can store two fraction values per grid cell. Next, the linear and angular velocities $U_{p, i}$ and $Ω_{p, i}$ of the particles must be synchronized between the CPUs such that every CPU has not only the correct velocities for its local particles but also for the ghost particles. Next, those velocities are transferred from the CPU to the GPU so that the GPU can compute the velocities of the overlapping particles at the cell center for every cell (equation (20)).

Figure 5.

Overview of the different communication steps from the perspective of a CPU and GPU responsible for the same block.

3.1.2. Fluid simulation

Next, the PSM inner kernel is performed. The term ‘inner’ indicates that this kernel updates all cells except the outermost layer of cells. Skipping the outermost layer ensures that this routine can be called without waiting for the previously started GPU-GPU communication to finish. The PSM kernel creates the highest workload of the entire simulation. It is, therefore, performance-critical. We use the code generation framework lbmpy (Bauer et al., 2021b; Hennig et al., 2023) to obtain highly efficient and scalable LBM CUDA kernels. lbmpy allows the formulation of arbitrary LBM methods (such as the PSM) as a symbolic representation and generates optimized and parallel compute kernels. We integrate those generated compute kernels within the simulation in waLBerla. Next, we wait for the non-blocking GPU-GPU communication started at the beginning of the time step to finish. Depending on the available hardware, this may return instantly if the previous computations completely hide the communication. The next step is the LBM boundary handling, which enforces boundary conditions to the fluid simulation by correctly updating the fluid cells at the domain’s boundary. Since the neighboring values are now available, we then update the outermost layer of cells in the PSM outer kernel. The last step on the GPU is the coupling from the fluid to the particles.

3.1.3. Coupling from the fluid to the particles

Finally, the GPU reduces the forces and torques exerted by the fluid on the particles $F_{p, i}^{fp}$ and $T_{p, i}^{fp}$ (equation (21) and (22)). Then, $F_{p, i}^{fp}$ and $T_{p, i}^{fp}$ are transfered from the GPU to the CPU to be available for the upcoming DEM simulation on the CPU. A single particle may overlap with cells located on multiple blocks. Thus, multiple GPUs may have computed $F_{p, i}^{fp}$ and $T_{p, i}^{fp}$ for the same particle $i$ . Therefore, the corresponding CPUs have to reduce these force and torque contributions exerted by the coupling on the particles into a single variable (CPU-CPU communication). The time loop continues with the particle dynamics on the CPU.

3.2. Particle dynamics on the CPU

The first step of the Particle Dynamics (PD) simulation on the CPU is the pre-force integration of the velocities to update the particle positions (equation (15)). The latter particle movement requires synchronization between the CPUs to account for the position update, which potentially moves particles from one block to another, making other CPUs responsible for the particles. Computing particle-particle interactions by iterating over all particle pairs can quickly become very expensive due to its $O (n^{2})$ complexity. Therefore, we insert the particles into linked cells such that iterating over the particle pairs is limited to neighboring linked cells. The linked cell size limits the maximum distance for which correct particle-particle interactions can be ensured (collision, lubrication correction). The linked cells have a size of 1.01 times the particle diameter, which is close to the smallest size that still leads to a correct collision detection. Next, the lubrication correction routine is applied to all particle pairs with particles close to each other but not yet in contact (Section 2.3.1). The particle-particle interactions are modeled using the DEM collision kernel (linear spring dashpot), which exerts forces and torques on overlapping particles. The collision kernel needs history information from the previous time step, i.e., the accumulated tangential motion between the two colliding particles (equation (14)). Since different processes may have handled the previous collisions, reducing the collision histories between the CPUs is necessary, i.e., collecting all collision histories for a given particle in one process. Then, the hydrodynamic forces and torques exerted by the fluid on the particles and the gravitational force are added to the total force. Since different processes may have added forces and torques to a given particle, those contributions have to be collected in one process (another CPU-CPU communication). Then, the post-integration is applied to update the particle velocities (equation (16)). Communication can be omitted here because the velocities are unused until the subsequent communication in the upcoming sub-cycle. Typically, $j$ particle sub-cycles are performed per time step since the DEM requires a finer resolution in time than the LBM for an accurate contact representation (Rettinger and Rüde, 2022). After completing $j$ sub-cycles, the next time step starts with the fluid dynamics on the GPU.

Running the fluid dynamics and coupling on the GPU first and the particle dynamics on the CPU second seems to be a promising candidate for overlapping the CPU and GPU computations to gain some performance improvements. Therefore, we elaborate on this possibility in the remainder of this section. The following will refer to the different simulation modules as they are named in Figure 4.

Under the condition that the numerical error must not increase, only some parts of the CPU and GPU computations can overlap, others cannot due to the dependencies of the two-way coupling. There are two ways of overlapping the CPU and GPU parts: within a time step or between subsequent time steps. Only the post-force integration step of the particle simulation on the CPU in time step $n$ depends on the reduction of the hydrodynamic forces on the GPU in time step $n$ , and therefore, the steps from pre-force integration to the application of external forces can, in principle, be overlap with the GPU part. The particle mapping on the GPU at the beginning of time step $n + 1$ requires the updated particle positions of the pre-force integration in time step $n$ . Therefore, the steps from inserting the particles into linked cells until the post-force integration in time step $n$ can, in principle, overlap with the particle mapping on the GPU in time step $n + 1$ . However, the particle simulation typically consists of several sub-cycles (i.e., the CPU part in Figure 4 is executed several times per time step), and only the respective parts of the first and last sub-cycle can be overlapped. Therefore, the maximum achievable speedup due to overlapping depends on the number of sub-cycles. In accordance with the literature (Rettinger and Rüde, 2022), we use ten sub-cycles, which could potentially decrease the run time of the particle simulation on the CPU by 20% if the first and last sub-cycles could be entirely overlapped. However, this is an optimistic assumption because the interruption of the consecutive particle sub-cycles by the overlapping might negatively affect the possibility of caching particle data between consecutive sub-cycles. If one disregards the requirement that numerical error must not increase, one could use results from previous time steps, allowing both the fluid simulation on the GPU and the particle simulation on the CPU to run entirely in parallel. This might result in acceptable minor errors for systems with negligible particle movements, but this is not generally applicable and would require a detailed error analysis.

4. Performance analysis

We use the Juwels Booster cluster for the performance evaluation. Each GPGPU node consists of four Nvidia A100 40 GB GPUs, two AMD EPYC 7402 CPUs (24 cores per chip), and eight NUMA domains. Thus, each CPU is shared by two GPUs and is divided into four NUMA domains, with six cores belonging to one NUMA domain. Four out of the eight NUMA domains have independent PCIe lanes to the four GPUs on Juwels Booster, ensuring that CPU-GPU communications do not interfere between the GPUs. The following will refer to a GPU and an associated (i.e., directly connected via a PCIe lane) NUMA domain with its six cores as a CPU-GPU pair. All GPUs within a node are connected via NVLinks, allowing direct GPU-GPU communication. For communication between GPUs not sharing a node, PCIe lanes connect each GPU to its own Mellanox HDR200 InfiniBand ConnectX 6 adapter. A SCALE kernel (i.e., a 1:1 read/write ratio) yields a memory bandwidth of about 1400 GB/s for the A100 40 GB (Ernst et al., 2023). We use 20 cells per diameter to geometrically resolve the particles (Rettinger et al., 2022; Rettinger and Rüde, 2022; Biegert et al., 2017; Costa et al., 2015). The upcoming sections first introduce the computational properties of the simulated cases, followed by their performance results.

4.1. Simulation setups

In order to study the performance of the here introduced hybrid coupled fluid-particle implementation, we are using a fluidized bed simulation. We compare two cases: the dilute case and the dense case. They exhibit different characteristics regarding the number of particles per volume and the number of particle-particle interactions. We choose these two cases to investigate how different particle workloads on the CPUs influence the overall performance of the hybrid CPU-GPU implementation. We use 10 particle sub-cycles (Section 3) per time step. We discretize the domain using $500 \times 200 \times 800 = 80 \times 1 0^{6}$ fluid cells. The velocity (inflow) Boundary Condition (BC) on the bottom and pressure (outflow) BC on top of the domain govern the fluid dynamics. The remaining four boundaries are no-slip conditions ensuring a zero fluid velocity, i.e., the domain is not periodic. The particle Reynolds number is 1.0, the Galileo number is around 8.9, the gravitational acceleration is $9.81 m / s^{2}$ , and the particle fluid density ratio is 1.1. Planes surrounding the domain prevent the particles from leaving the domain by acting as walls that form a box. The dilute case contains 627 particles. Figure 6(a) illustrates the dilute setup. Due to the low particle concentration, the effort for computing the particle-particle interactions (collisions and lubrication corrections) is low in the dilute case. The dense case is generally the same setup as the dilute case, except that the particle concentration is significantly higher, limiting the fluidization (Figure 6(b)), resulting in 8073 particles, almost 13 times more than in the dilute case.

Figure 6.

Visualization of the consolidated fluidized bed setup running on one CPU-GPU pair. For the fluid field, only a two-dimensional slice is visualized.

4.2. Performance results

In this section, we present and analyze the performance results. We first look at the individual run times of the different simulation modules to understand the bottlenecks. Furthermore, we present a weak scaling benchmark for both cases up to 1024 CPU-GPU pairs. In addition, we show strong scaling results for three different problem sizes up to 1024 CPU-GPU pairs. Finally, we demonstrate the acceleration potential of hybrid implementations by comparing it to a large-scale CPU-only simulation from the literature. For all results, we average over 500 time steps. In the following, we will refer to the performance criteria formulated in Section 1.

4.2.1. Run times of different simulation modules

We investigate the run times of the different simulation modules to analyze the overhead introduced by the hybrid implementation, i.e., the CPU-GPU communication, assess the GPU performance, and detect the overall bottlenecks. To evaluate the performance of the PSM kernel on the GPU, we employ a roofline model (Hager and Wellein, 2010) for the LBM kernel. We determine the maximal possible performance for the given kernel and hardware when exploiting the maximal memory bandwidth (i.e., the performance ‘lightspeed’ estimation). The PSM kernel comprises the LBM kernel plus additional memory transfers depending on the number of overlapping particles. As this number differs from cell to cell, the PSM roofline model would not be straightforward. Therefore, we analyze the LBM model only, keeping in mind that this is a too optimistic performance estimation. Since we use the D3Q19 lattice model, we read and write 19 PDFs per cell and time step. This results in 19 reads and 19 writes (double-precision), i.e., 304 bytes to update one lattice cell (Feichtinger et al., 2015). The domain consists of 8e7 fluid cells. This results in the following minimal run time per time step according to the roofline model:

\begin{aligned} T_{\min} & = \frac{304 B/cell \cdot 8 e 7 cells/time step}{1400 GB/s} \\ = 17.4 ms/time step . \end{aligned}

(28)

We divide the total run time into the following modules: the PSM kernel (PSM), the CPU-GPU communication (comm), the particle mapping (mapping), setting the particle velocities (setU), reducing the hydrodynamic forces $F_{p, i}^{fp}$ and torques $T_{p, i}^{fp}$ on the particles (redF), computing the PD and finally the remaining tasks (other), e.g., the LBM boundary handling. Figure 7 reports the run times per time step for these modules using a single CPU-GPU pair. Furthermore, a dashed horizontal line indicates the run time of the PSM kernel for a simulation without any particles in the domain. While this physically behaves precisely like an LBM kernel, it allows the quantification of the overhead due to the different code structure of the PSM compared to the LBM kernel without the additional effort due to particles inside the domain.

Figure 7.

Individual run times of the different simulation modules on a CPU-GPU pair.

In the dilute case, the PSM kernel needs about 42% more time per time step than the LBM lightspeed estimation, and the dense case 83% more time. Compared to the PSM kernel without particles, the dilute case needs about 4% more time per time step, and the dense case 34% more time. The CPU-GPU communication is negligible for both cases. All modules take longer in the dense case than in the dilute case. While in the dilute case, the PSM kernel accounts for the majority of the run time, the PD simulation needs more time than the PSM kernel in the dense case. Still, most of the run time is spent on GPU routines in the dense case.

The overhead introduced by the hybrid implementation is negligible because we only transfer a small amount of double-precision values per particle but no fluid cells, and thanks to the problem-aware co-partitioning. The overhead, therefore, shows that a hybrid parallelization with the presented technique is a viable approach, and the first criterion is met. The performance of the PSM kernel is close to utilizing the total memory bandwidth of the A100, especially considering that the given roofline model does not take the memory traffic due to the solid part of the PSM kernel into account. Therefore, the second criterion is met. There is a significant performance gain for the PSM kernel on the GPU compared to a CPU-only implementation since the PSM kernel is utilizing almost the entire memory bandwidth, which is significantly lower on CPUs. Even in the dense case, the GPU run time accounts for most of the total time because the fluid and coupling workload is much higher than the particle workload, even though we use 10 particle sub-cycles. Therefore, the third criterion is met. The PSM kernel exhibits two remarkable effects that we analyze in detail below using NVIDIA Nsight Compute.²

First, the significant difference between the LBM main memory roofline (horizontal line) and the PSM kernel without particles (dashed horizontal line) in Figure 7 indicates that the PSM kernel cannot fully utilize the main memory bandwidth as the memory transfers of the PSM kernel in the absence of particles are very similar to the ones assumed in the LBM main memory roofline. This is noticeable because the PSM is an extension of the LBM, and it is known that the LBM can nearly fully utilize the main memory bandwidth on the A100 architecture (Holzer et al., 2024; Lehmann et al., 2022). The PSM kernel uses 196 registers per thread. This large number of registers per thread heavily limits the number of warps per Streaming Multiprocessor (SM), leading to a maximum possible occupancy of the SMs of only 12.50%. This low occupancy is insufficient to issue enough load/store instructions to fully exploit the main memory bandwidth, resulting in a difference between the PSM without particles and the LBM memory roofline.

Second, while the PSM kernel is in the dilute case only slightly slower than without particles, this difference gets more significant in the dense case. Several effects differentiate the dilute and dense cases. First, more particles increase the unfavorable GPU workload, i.e., warp divergence and coalesced access. The branch instructions increase by 2.06% in the dilute case and by 18.18% in the dense case compared to the PSM kernel without particles. The number of excessive sectors due to uncoalesced global accesses increases by 78.49% from the dilute case to the dense case resulting in 21% of the total sectors in the dense case. Besides that, the number of instructions issued increases by 4.32% in the dilute case and 39.51% in the dense case, which is in a similar range than the run time increase, corresponding to the instruction boundness of the code mentioned in the first effect (i.e., the low occupancy of the SMs limits the issued instructions). Since the increase in main memory traffic is 0.98% (dilute case) and 12.15% (dense case) and therefore smaller than the run time increase, the main memory utilization drops from 65.85% (dilute case) to 52.56% (dense case).

4.2.2. Weak scaling

When increasing the simulation domain further to simulate physically relevant scenarios, using a single CPU-GPU pair is often insufficient. Instead, multiple pairs or even multiple nodes of a supercomputer must be used. Therefore, a satisfactory weak scaling is desirable. For weak scaling, the problem size is increased with an increasing number of CPU-GPU pairs, keeping the workload per CPU-GPU pair constant. When having a perfect weak scaling, the performance per CPU-GPU pair stays constant, independent of the number of CPU-GPU pairs used. In the context of LBM, MLUPs is a standard performance metric for weak scaling (Holzer et al., 2021), meaning how many lattice cell updates the hardware performs per second. We use the total run time for computing the MLUPs, containing both the CPU (particles) and the GPU (fluid, coupling) time. For the weak-scaling plots, we have conducted at least three benchmarking runs and will use the best sample in the following. We start with a single CPU-GPU pair and a single domain block as described in Section 4.1. We then double the number of CPU-GPU pairs successively until we reach 1024. At the same time, we double the domain blocks/size alternately in each direction: $2 \times 1 \times 1$ blocks (two GPUs), $2 \times 2 \times 1$ blocks (four GPUs), $2 \times 2 \times 2$ (eight GPUs), etc. Figures 8 and 9 report the weak scaling for both cases. Dashed lines indicate the ideal curves. To our best knowledge, this is the most extensive weak scaling of a hybrid fluid-particle implementation presented in the literature.

Figure 8.

Weak scaling performance for both cases up to 1024 CPU-GPU pairs.

Figure 9.

Weak scaling parallel efficiency for both cases up to 1024 CPU-GPU pairs.

We observe a roughly three times higher performance for the dilute case than for the dense case. Both cases show a parallel efficiency decrease particularly strong in the beginning. The parallel efficiency is 71% in the dilute case and 53% in the dense case when using 1024 CPU-GPU pairs, which corresponds to a domain size of $8000 \times 1600 \times 6400 = 8.192 \times 1 0^{10}$ fluid cells. A similar scaling behavior has been observed in the literature both for other hybrid (Kotsalos et al., 2021) and CPU-only fluid-particle implementations (Rettinger et al., 2017).

Interpreting the overall weak scaling behavior requires more detailed analysis of the scaling of the different simulation modules. When using a single CPU-GPU pair, the dominating routines are the PD, the PSM kernel, and the coupling (i.e., particle mapping, setting the particle velocities and reducing the hydrodynamic forces $F_{p, i}^{fp}$ and torques $T_{p, i}^{fp}$ on the particles). Additionally, we must now consider the communication (comm) overhead that arises from using multiple CPU-GPU pairs. On the one hand, this is the PD communication (CPU-CPU communication), on the other hand, the PSM communication (GPU-GPU communication). Figures 10 and 11 show the weak scaling behavior of the dominating simulation modules for both cases. The communication numbers cover the communication steps themselves, but also load imbalances between two communication steps.

Figure 10.

Weak scaling performance of the dominating modules for the dilute case up to 1024 CPU-GPU pairs.

Figure 11.

Weak scaling performance of the dominating modules for the dense case up to 1024 CPU-GPU pairs.

The different modules show similar qualitative scaling behavior when comparing the two cases. The PSM kernel scales quite well in both cases. The corresponding GPU-GPU communication (PSM comm) is negligible. The PD run time increases initially and then shows saturation. The CPU-CPU communication (PD comm) increases drastically, surpassing the run time of the PSM kernel and the coupling in the dense case. The PD and the corresponding communication are more relevant for the overall scaling in the dense case than in the dilute case. The coupling scales similarly to the PD.

The PSM workload per GPU stays constant in the weak scaling, explaining the nearly perfect scaling. Since we are hiding the PSM communication (Section 3), we expect it to be negligible. We expect the PD and the coupling run time to increase initially because the number of neighboring blocks increases. More neighboring blocks lead to more ghost particles per block, resulting in a higher workload. This effect decreases when blocks have neighbors in all directions, resulting in an almost linear scaling from this point on. This phenomenon has been reported in the literature (Rettinger et al., 2017). The methodology requires 10 particle sub-cycles per time step and three communication steps per sub-cycle for a physically accurate simulation. Additionally, the simulation requires two CPU-CPU communication steps per time step apart from the sub-cycles (Figure 4). We have 32 CPU-CPU communication steps per time step, which cannot be hidden behind other routines. This becomes the dominating factor for the decrease of the overall weak scaling performance in both cases.

To gain more profound insights into the reasons for this significant increase of the PD comm time in the weak scaling and to estimate the scaling behavior beyond 1024 CPU-GPU pairs, we investigate the data transfers of the PD comm in more detail for the dense case. The PD comm consists of two parts: the synchronization of particle quantities between processes and the reduction of particle quantities (Section 3.2). In the following, we focus on the dominating part, the synchronization. Figure 12 plots the maximum and average amount of data a process either sends to or receives from its neighboring processes per synchronization call. The two arrows indicate the steps in which the weak scaling doubles the domain in the y-direction (the out-of-plane direction normal to the cross-section in Figure 6(a)) for the first two times. Furthermore, the figure illustrates the maximum number of synchronization (i.e., communication) partners per process.

Figure 12.

Weak scaling of the particle synchronization in the dense case up to 1024 CPU-GPU pairs. Both the maximum and average amount of data (in bytes) a process either sends to or receives from its neighboring processes per synchronization call are illustrated (left y-axis), and the maximum number of synchronization partners (right y-axis).

The maximum number of communication partners increases initially, reaching 26 starting from $2^{6}$ processes and then staying constant. The maximum amount of data a process sends to or receives from its neighboring processes increases to $2^{5}$ processes and then saturates. The maximum data volume per process already saturates one process earlier than the maximum number of communication partners. The average data volume per process increases until $2^{10}$ processes. However, the increase from $2^{0}$ to $2^{5}$ processes is larger than from $2^{5}$ to $2^{10}$ . For both the maximum and average data curves, the most significant spikes can be observed when going from $2^{1}$ to $2^{2}$ and from $2^{4}$ to $2^{5}$ processes, i.e., when the weak scaling doubles the domain size in the y-direction for the first two times. The maximum and average communication data show correlation with each other, as well as with the PD comm time in Figure 11.

The maximum number of communication partners is bounded from above at 26 because this is the maximum number of neighboring blocks in 3D (including blocks that only share a corner) because a $3 \times 3 \times 3$ setup of blocks consists of the center block and 26 neighboring blocks. The weak scaling obtains this situation for the first time for $2^{6}$ processes since this corresponds to $4 \times 4 \times 4$ blocks, and therefore, at least one block is entirely surrounded by neighboring blocks. The initial increase in maximum data transfer in the weak scaling is expected since the maximum number of neighbors increases. Therefore, more data has to be communicated per synchronization. The large spikes correspond to the cases where the domain is increased in the y-direction, which causes the most significant increase of the boundary layer between two blocks and, therefore, the most significant communication data increase. The step from $2^{4}$ to $2^{5}$ increases the maximum number of communication partners stronger than the step from $2^{1}$ to $2^{2}$ . Therefore, the maximum data volume increases stronger accordingly (see the two arrows in Figure 12). As expected for nearest-neighbor communication, the maximum communication saturates if other processes entirely surround at least one process. However, the maximum data volume per process already saturates one process earlier than the maximum number of communication partners is reached. This is because from $2^{5}$ to $2^{6}$ processes, the domain is increased in z-direction for the second time. However, due to the arrangement of the particles, synchronization only happens in positive z-direction. Therefore, doubling the domain in z-direction requires no additional communication since still, no process has to communicate in both z-directions. The average data size still increases after the saturation of the maximum data because the ratio of boundary blocks (with less than 26 communication partners) compared to the total number of blocks decreases, i.e., proportionally, more blocks are becoming blocks with the maximum data size. The maximum communication serves as an upper bound for the convergence of the average communication data. Due to the pure nearest neighbor communication and the resulting convergence of the average data size per process, we expect promising scalability on larger systems beyond 1024 CPU-GPU pairs. However, the scaling of the PD comm depends strongly on the PD methodology, its implementation, the available hardware, and the assignment of the blocks to the hardware. Reducing the collision history is part of PD comm, which includes swapping old and new contact information. Since this swap is also necessary without using multiple CPU-GPU pairs, PD comm is bigger zero even when using a single CPU-GPU pair. Overall, we observe a weak scaling performance that justifies using multiple supercomputer nodes. Therefore, the fourth criterion is met.

4.2.3. Strong scaling

For strong scaling, the problem size is fixed with an increasing number of CPU-GPU pairs, decreasing the workload per CPU-GPU pair. When having a perfect strong scaling, the run time decreases to the same extent as the number of CPU-GPU pairs increases. We present strong scaling results for three problem sizes: small, medium, and large. The small problem consists of $500 \times 200 \times 800$ fluid cells, as described in Section 4.1, which fits on a single CPU-GPU pair. For the medium problem, the small problem size is doubled in all three directions, resulting in $1000 \times 400 \times 1600$ fluid cells, which fits on eight CPU-GPU pairs. For the large problem, the medium problem size is again doubled in all three directions, resulting in $2000 \times 800 \times 3200$ fluid cells, which fits on 64 CPU-GPU pairs. For each problem size, we successively double the number of CPU-GPU pairs four times in x-, y-, z-, and again in x-direction while keeping the problem size constant. Figures 13 and 14 illustrate the strong scaling results employing both the dilute and dense case for the three different problem sizes. Dashed lines indicate the ideal curves.

Figure 13.

Strong scaling performance for both cases and three different problem sizes up to 1024 CPU-GPU pairs.

Figure 14.

Strong scaling parallel efficiency for both cases and three different problem sizes up to 1024 CPU-GPU pairs.

The dilute and dense cases exhibit a similar strong scaling behavior for all three problem sizes. In contrast to the weak scaling, the dense case does not show an overall lower parallel efficiency in the strong scaling. In all cases, the time per time step decreases when doubling the number of CPU-GPU pairs. However, the decrease is less than a factor of 2, resulting in a deviation from the ideal curve. The corresponding parallel efficiency is around 60% in all cases when doubling the number of CPU-GPU pairs twice, whereas it drops to about 40% when doubling the number of CPU-GPU pairs four times. The decrease in parallel efficiency is expected in strong scaling since, eventually, the problem size becomes too small to effectively utilize the computational resources while the communication overhead increases. The most relevant driving force for the decrease in parallel efficiency in the strong scaling originates, similar to the weak scaling analysis (Section 4.2.2), from the frequent synchronizations of the particle properties (PD comm). Similar strong scaling behavior, in particular the decrease in parallel efficiency, has been reported in the literature for other CFD codes on different supercomputers (Karp et al., 2023; Min et al., 2024).

4.2.4. Potential speedup of hybrid implementations

We expect that the speedup of the hybrid implementation compared to a CPU-only code $S_{hyb}$ can be estimated based on Amdahl’s law as:

S_{hyb} \approx \frac{1}{1 + {f r a c}_{acc} \cdot (\frac{B W_{CPU}}{B W_{GPU}} - 1)},

(29)where

B W_{CPU}

and

B W_{GPU}

are the CPU and GPU memory bandwidths.

{f r a c}_{acc}

is the CPU-only run time fraction of the module accelerated by the hybrid implementation. In our case, this is the PSM and the coupling. We assume

{f r a c}_{acc}

is memory bound. We compare the hybrid performance results of this paper with one of the largest CPU-only simulations of polydisperse sediment beds (Rettinger et al., 2022). The authors conducted the latter simulation on 320 Intel Xeon Platinum 8174 CPUs. In total, they computed

2.25 \cdot 1 0^{15}

lattice cell updates in 48 hours, which leads to a performance of around 41 MLUPs per CPU vs. 377 MLUPs per CPU-GPU pair in the dense case when using 1024 GPUs. The latter numbers result in a measured speedup of around 9.2. For the Intel Xeon Platinum 8174, we measured a memory bandwidth

B W_{CPU}

of 70 GB/s. The estimated speedup based on equation (29) is around 10.3 (assuming

{f r a c}_{acc} = 0.95

and

B W_{GPU} = 1400 GB/s

) which is similar to the measured speedup. The latter computation is only a rough estimate since it ignores effects due to different CPUs, networks, physical setups, numerical parameters, etc.

5. Implications and lessons learned

As a lesson learned, we discovered that the relatively small data transfers between CPU and GPU can become negligible for coupled applications exhibiting a significant imbalance of the data exchange between the coupled modules and the computations inside the modules in favor of the computations, making hybrid coupling a feasible approach. This is particularly true for the application used in this work, as only the several orders of magnitude smaller particle data need to be exchanged, not the entire fluid field processed by the GPU. As a result, only 0.35% of the total run time is devoted to CPU-GPU communication in the dense case. These findings can also be applied to other applications that exhibit this desired imbalance between computation data size and data transfer between the coupled modules. Other coupled applications, such as a huge flow field around complex deformable geometries like the deformation of blood cells in cellular blood flow, which is referenced in Section 1, can also exhibit this imbalance. Since the surface data acts as a boundary condition for the deformation simulation, just this surface data needs to be exchanged between the fluid and solid phases. Approaches with a high ratio of data exchange to computations (i.e., exchange entire fields between CPU and GPU in each time step) have been found in the literature to perform undesired on previous architectures, this may change in the future due to new hardware developments. High bandwidth memory transfer between CPU and GPU main memory is the trend of architectures like the NVIDIA GH200 Grace Hopper, which makes hybrid implementations promising for more applications (even up to transferring the entire GPU data in every time step) and should be further studied in the future.

The advantages of code generation (Section 3.1.2) are the subject of another lesson learned. We discovered that code generation adds an extra overhead during the implementation stage, which is only beneficial in a framework with long-term support if one depends on highly optimized similar codes, portability to other architectures, and a certain form of expandable, sustainable approach. However, because code generation is used, the code functions flawlessly with different LBM variations on other modern GPU architectures, such as the AMD MI250X, which uses the HIP API, as well as on consumer GPUs. Subsequent work will encompass a systematic performance comparison.

6. Conclusion

On heterogeneous systems, it is pragmatic and, therefore, attractive to use a hybrid parallelization, i.e., different simulation modules running on different hardware. However, hybrid implementations increase the complexity of achieving good performance and scalability, especially on large-scale systems. In this paper, we have examined a hybrid coupled fluid-particle simulation with geometrically resolved particles. We use GPUs for the fluid dynamics, whereas the particle simulation runs on the CPUs. We have reported and studied the performance of this approach for two cases of a fluidized bed simulation that differ in terms of the number of particles per volume. The overhead introduced by the hybrid implementation (i.e., CPU-GPU communication) is negligible because we are transferring only a small amount of data per particle but no fluid cells. The performance of the fluid simulation is close to utilizing the full memory bandwidth of the A100, implying that using the GPU is a good choice for the fluid simulation. In both cases, the GPU routines take most of the run time. In a weak scaling benchmark, the hybrid fluid-particle implementation reaches a parallel efficiency of 71% in the dilute case and 53% in the dense case when using 1024 CPU-GPU pairs. The current PD methodology requires 32 CPU-CPU communication steps per time step, which is the driving force for the decrease of the overall parallel efficiency. Our results are limited insofar as different numbers of particle sub-cycles, fluid cells per diameter, etc., will result in different performance results. We have formulated four criteria that a hybrid implementation must meet to be suitable for the responsible use of heterogeneous supercomputers. The performance results have shown that our hybrid implementation fulfills all criteria, making it suitable for large-scale simulations on heterogeneous supercomputers. In the future, we plan to investigate the particle communication steps in more detail regarding the bottleneck and optimization possibilities. We are employing sub-cycles to increase stability for stiff systems. Using other integrators may permit longer time steps and thus less communication due to sub-cycles. We have shown the acceleration potential of hybrid implementations. Therefore, we plan to run coupled fluid-particle simulations of even larger scenarios to better analyze, among others, the physical phenomena of erosion in sediment beds.

Footnotes

Acknowledgements

The authors gratefully acknowledge the Gauss Centre for Supercomputing e.V. () for funding this project by providing computing time on the GCS Supercomputer JUWELS at Jülich Supercomputing Centre (JSC). The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU). The hardware is funded by the German Research Foundation (DFG).

Author contributions

S. Kemmler: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data Curation, Writing - Original Draft, Visualization, Project administration

C. Rettinger: Conceptualization, Writing - Review and Editing

U. Rüde: Writing - Review and Editing

P. Cuéllar: Writing - Review and Editing

H. Köstler: Resources, Writing - Review and Editing, Supervision, Funding acquisition

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The authors thank the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) for funding the project 433735254. The DFG had no direct involvement in this paper; and this work has received funding from the European High Performance Computing Joint Undertaking (JU) and Sweden, Germany, Spain, Greece, and Denmark under grant agreement No 101093393.

ORCID iDs

Samuel Kemmler

Christoph Rettinger

Ulrich Rüde

Pablo Cuéllar

Harald Köstler

Notes

Author biographies

Samuel Kemmler is a Ph.D. student at the Chair for System Simulation at the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) and a research assistant at the Federal Institute for Materials Research and Testing (BAM) in Berlin. He holds an M.Sc. degree in Computational Engineering and is one of the core developers of the waLBerla HPC framework. His research interests are high-performance computing and particle-resolved sediment transport simulations.

Christoph Rettinger studied Computational Engineering at the FAU. In 2023, he finished his Ph.D. on fully resolved simulation of particulate flows at the Chair for System Simulation.

Ulrich Rüde heads the Chair for System Simulation at the FAU. He studied Mathematics and Computer Science at Technische Universität München (TUM) and the Florida State University. He holds a Ph.D. and Habilitation degrees from TUM. His research interest focuses on numerical simulation and high-end computing, particularly computational fluid dynamics, multilevel methods, and software engineering for high-performance computing. He is a Fellow of the Society of Industrial and Applied Mathematics.

Pablo Cuéllar studied Civil Engineering at the Universidad Politécnica de Madrid and received his Ph.D. in Civil Engineering from Technical University Berlin in 2011. He is a guest scientist at BAM.

Harald Köstler got his Ph.D. in Computer Science in 2008 on variational models and parallel multigrid methods in medical image processing. 2014, he finished his habilitation on Efficient Numerical Algorithms and Software Engineering for High-Performance Computing. Currently, he works at the Chair for System Simulation at the FAU. His research interests include software engineering concepts, especially using code generation for simulation software on HPC clusters, multigrid methods, and programming techniques for parallel hardware, especially GPUs. The application areas are computational fluid dynamics, rigid body dynamics, and medical imaging.

References

Bauer

Eibl

Godenschwager

, et al. (2021a) waLBerla: a block-structured high-performance framework for multiphysics simulations. Computers & Mathematics with Applications 81: 478–501. DOI: 10.1016/j.camwa.2020.01.007.

Bauer

Köstler

Rüde

(2021b) Lbmpy: automatic code generation for efficient parallel lattice Boltzmann methods. Journal of Computational Science 49: 101269. DOI: 10.1016/j.jocs.2020.101269.

Benseghier

Cuéllar

Luu

, et al. (2020) A parallel GPU-based computational framework for the micromechanical analysis of geotechnical and erosion problems. Computers and Geotechnics 120: 103404. DOI: 10.1016/j.compgeo.2019.103404.

Biegert

Vowinckel

Meiburg

(2017) A collision model for grain-resolving simulations of flows over dense, mobile, polydisperse granular sediment beds. Journal of Computational Physics 340: 105–127. DOI: 10.1016/j.jcp.2017.03.035.

Costa

Boersma

Westerweel

, et al. (2015) Collision model for fully resolved simulations of flows laden with finite-size particles. Physical Review E 92(5): 053012. DOI: 10.1103/PhysRevE.92.053012.

Cundall

Strack

ODL

(1979) A discrete numerical model for granular assemblies. Géotechnique 29(1): 47–65. DOI: 10.1680/geot.1979.29.1.47.

Eibl

Rüde

(2019) A modular and extensible software architecture for particle dynamics.

Ernst

Holzer

Hager

, et al. (2023) Analytical performance estimation during code generation on modern GPUs. Journal of Parallel and Distributed Computing 173: 152–167. DOI: 10.1016/j.jpdc.2022.11.003.

Feichtinger

(2012) Design and Performance Evaluation of a Software Framework for Multi-Physics Simulations on Heterogeneous Supercomputers. PhD Thesis.

10.

Feichtinger

Habich

Köstler

, et al. (2015) Performance modeling and analysis of heterogeneous lattice Boltzmann simulations on CPU–GPU clusters. Parallel Computing 46: 1–13. DOI: 10.1016/j.parco.2014.12.003.

11.

Fukumoto

Yang

Hosoyamada

, et al. (2021) 2-D coupled fluid-particle numerical analysis of seepage failure of saturated granular soils around an embedded sheet pile with no macroscopic assumptions. Computers and Geotechnics 136: 104234. DOI: 10.1016/j.compgeo.2021.104234.

12.

Hager

Wellein

(2010) Introduction to High Performance Computing for Scientists and Engineers. 0 edition. CRC Press. DOI: 10.1201/EBK1439811924.

13.

Luo

(1997) Lattice Boltzmann model for the incompressible Navier–Stokes equation. Journal of Statistical Physics 88(3/4): 927–944. DOI: 10.1023/B:JOSS.0000015179.12689.e4.

14.

Muller

Hassanpour

, et al. (2020) A CPU-GPU cross-platform coupled CFD-DEM approach for complex particle-fluid flows. Chemical Engineering Science 223: 115712. DOI: 10.1016/j.ces.2020.115712.

15.

Hennig

Holzer

Rüde

(2023) Advanced automatic code generation for multiple relaxation-time lattice Boltzmann methods. SIAM Journal on Scientific Computing 45(4): C233–C254. DOI: 10.1137/22M1531348.

16.

Holzer

Bauer

Köstler

, et al. (2021) Highly efficient lattice Boltzmann multiphase simulations of immiscible fluids at high-density ratios on CPUs and GPUs through code generation. The International Journal of High Performance Computing Applications 35(4): 413–427. DOI: 10.1177/10943420211016525.

17.

Holzer

Mitchell

Leonardi

, et al. (2024) Development of a central-moment phase-field lattice Boltzmann model for thermocapillary flows: droplet capture and computational performance. Journal of Computational Physics 518: 113337. DOI: 10.1016/j.jcp.2024.113337.

18.

Iglberger

Rüde

(2010) Massively parallel granular flow simulations with non-spherical particles. Computer Science - Research and Development 25(1-2): 105–113. DOI: 10.1007/s00450-010-0114-4.

19.

Jajcevic

Siegmann

Radeke

, et al. (2013) Large-scale CFD–DEM simulations of fluidized granular systems. Chemical Engineering Science 98: 298–310. DOI: 10.1016/j.ces.2013.05.014.

20.

Jones

Williams

(2017) Fast computation of accurate sphere-cube intersection volume. Engineering Computations 34(4): 1204–1216. DOI: 10.1108/EC-02-2016-0052.

21.

Junior

JRS

Clua

Montenegro

, et al. (2010) Fluid simulation with two-way interaction rigid body using a heterogeneous GPU and CPU environment. In: 2010 Brazilian Symposium on Games and Digital Entertainment. Florianpolis, Santa Catarina, TBD, Brazil: IEEE. pp. 156–164. DOI:10.1109/SBGAMES.2010.25.

22.

Karp

Massaro

Jansson

, et al. (2023) Large-Scale direct numerical simulations of turbulence using GPUs and modern Fortran. The International Journal of High Performance Computing Applications 37(5): 487–502. DOI: 10.1177/10943420231158616.

23.

Kidanemariam

Uhlmann

(2014) Direct numerical simulation of pattern formation in subaqueous sediment. Journal of Fluid Mechanics 750: R2. DOI: 10.1017/jfm.2014.284.

24.

Kim

Seo

Lee

, et al. (2012) SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters. In: Proceedings of the 26th ACM International Conference on Supercomputing. San Servolo Island, Venice Italy: ACM, 341–352. DOI: 10.1145/2304576.2304623.

25.

Kotsalos

Latt

Beny

, et al. (2021) Digital blood in massively parallel CPU/GPU systems for the study of platelet transport. Interface Focus 11(1): 20190116. DOI: 10.1098/rsfs.2019.0116.

26.

Krüger

Kusumaatmaja

Kuzmin

, et al. (2017) The lattice Boltzmann method: principles and practice. In: Graduate Texts in Physics. Cham: Springer International Publishing. DOI: 10.1007/978-3-319-44649-3.

27.

Kuznik

Obrecht

Rusaouen

, et al. (2010) LBM based flow simulation using GPU computing processor. Computers & Mathematics with Applications 59(7): 2380–2392. DOI: 10.1016/j.camwa.2009.08.052.

28.

Ladd

AJC

Verberg

(2001) Lattice-Boltzmann simulations of particle-fluid suspensions. Journal of Statistical Physics 104(5): 1191–1251. DOI: 10.1023/A:1010414013942.

29.

Lehmann

Krause

Amati

, et al. (2022) Accuracy and performance of the lattice Boltzmann method with 64-bit, 32-bit, and customized 16-bit number formats. Physical Review E 106(1): 015308. DOI: 10.1103/PhysRevE.106.015308.

30.

Machado

RRL

Schmitt

Eibl

, et al. (2021) tinyMD: mapping molecular dynamics simulations to heterogeneous hardware using partial evaluation. Journal of Computational Science 54: 101425. DOI: 10.1016/j.jocs.2021.101425.

31.

Min

Brazell

Tomboulides

, et al. (2024) Towards exascale for wind energy simulations. The International Journal of High Performance Computing Applications 38(4): 337–355. DOI: 10.1177/10943420241252511.

32.

Ngoma

Philippe

Bonelli

, et al. (2018) Two-dimensional numerical simulation of chimney fluidization in a granular medium using a combination of discrete element and lattice Boltzmann methods. Physical Review E 97(5): 052902. DOI: 10.1103/PhysRevE.97.052902.

33.

Noble

Torczynski

(1998) A lattice-Boltzmann method for partially saturated computational cells. International Journal of Modern Physics C 09(08): 1189–1201. DOI: 10.1142/S0129183198001084.

34.

Norouzi

Zarghami

Mostoufi

(2017) New hybrid CPU-GPU solver for CFD-DEM simulation of fluidized beds. Powder Technology 316: 233–244. DOI: 10.1016/j.powtec.2016.11.061.

35.

Oyarzun

Borrell

Gorobets

, et al. (2017) Portable implementation model for CFD simulations. Application to hybrid CPU/GPU supercomputers. International Journal of Computational Fluid Dynamics 31(9): 396–411. DOI: 10.1080/10618562.2017.1390084.

36.

Qian

D’Humières

Lallemand

(1992) Lattice BGK models for Navier-Stokes equation. Europhysics Letters (EPL) 17(6): 479–484. DOI: 10.1209/0295-5075/17/6/001.

37.

Rettinger

(2023) Fully Resolved Simulation of Particulate Flows with a Parallel Coupled Lattice Boltzmann and Discrete Element Method. PhD Thesis. Friedrich-Alexander-Universität Erlangen-Nürnberg.

38.

Rettinger

Rüde

(2017) A comparative study of fluid-particle coupling methods for fully resolved lattice Boltzmann simulations. Computers & Fluids 154: 74–89. DOI: 10.1016/j.compfluid.2017.05.033.

39.

Rettinger

Rüde

(2022) An efficient four-way coupled lattice Boltzmann – discrete element method for fully resolved simulations of particle-laden flows. Journal of Computational Physics 453: 110942. DOI: 10.1016/j.jcp.2022.110942.

40.

Rettinger

Godenschwager

Eibl

, et al. (2017) Fully resolved simulations of dune formation in riverbeds. In: Kunkel

Yokota

Balaji

, et al. (eds) High Performance Computing. Cham: Springer International Publishing, Vol. 10266, 3–21. DOI: 10.1007/978-3-319-58667-0_1.

41.

Rettinger

Eibl

Rüde

, et al. (2022) Rheology of mobile sediment beds in laminar shear flow: effects of creep and polydispersity. Journal of Fluid Mechanics 932: A1. DOI: 10.1017/jfm.2021.870.

42.

Rohr

Kalcher

Bach

, et al. (2014) An energy-efficient multi-GPU supercomputer. In: 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS). Paris: IEEE, 42–45. DOI: 10.1109/HPCC.2014.14.

43.

Schwarzmeier

Rettinger

Kemmler

, et al. (2023) Particle-resolved simulation of antidunes in free-surface flows. Journal of Fluid Mechanics 961: R1. DOI: 10.1017/jfm.2023.262.

44.

Sheikh

Pak

(2015) Numerical investigation of the effects of porosity and tortuosity on soil permeability using coupled three-dimensional discrete-element method and lattice Boltzmann method. Physical Review E 91(5): 053301. DOI: 10.1103/PhysRevE.91.053301.

45.

Shimokawabe

Endo

Onodera

, et al. (2017) A stencil framework to realize large-scale computations beyond device memory capacity on GPU supercomputers. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER). Honolulu, HI: IEEE, 525–529. DOI: 10.1109/CLUSTER.2017.97.

46.

Sousani

Hobbs

Anderson

, et al. (2019) Accelerated heat transfer simulations using coupled DEM and CFD. Powder Technology 357: 367–376. DOI: 10.1016/j.powtec.2019.08.095.

47.

Vowinckel

Kempe

Fröhlich

(2014) Fluid–particle interaction in turbulent open channel flow with fully-resolved mobile beds. Advances in Water Resources 72: 32–44. DOI: 10.1016/j.advwatres.2014.04.019.

48.

Wang

Leonardi

Aminossadati

(2018) Improved coupling of time integration and hydrodynamic interaction in particle suspensions using the lattice Boltzmann and discrete element methods. Computers & Mathematics with Applications 75(7): 2593–2606. DOI: 10.1016/j.camwa.2018.01.002.

49.

Chen

Liu

, et al. (2012) Discrete particle simulation of gas–solid two-phase flows with multi-scale CPU–GPU hybrid computation. Chemical Engineering Journal 207(208): 746–757. DOI: 10.1016/j.cej.2012.07.049.

Efficiency and scalability of fully-resolved fluid-particle simulations on heterogeneous CPU-GPU architectures

Abstract

Keywords

1. Introduction

2. Numerical methods

2.1. Lattice Boltzmann method

2.2. Particle dynamics

2.2.1. Particle interactions using the discrete element method

2.2.2. Integration of the particle properties

2.3. Fully resolved fluid-particle coupling method

2.3.1. Lubrication correction

2.3.2. Particle mapping

3. Implementation

3.1. Fluid dynamics and coupling on the GPU

3.1.1. Coupling from the particles to the fluid

3.1.2. Fluid simulation

3.1.3. Coupling from the fluid to the particles

3.2. Particle dynamics on the CPU

4. Performance analysis

4.1. Simulation setups

4.2. Performance results

4.2.1. Run times of different simulation modules

4.2.2. Weak scaling

4.2.3. Strong scaling

4.2.4. Potential speedup of hybrid implementations

5. Implications and lessons learned

6. Conclusion

Footnotes

Acknowledgements

Author contributions

Declaration of conflicting interests

Funding

ORCID iDs

Notes

Author biographies

References