Towards exascale simulations of granular fluidization in offshore wind turbine foundations

Abstract

Fully-resolved micromechanical simulations coupling the Lattice Boltzmann Method (LBM) with the Discrete Element Method (DEM) provide high-fidelity insights into granular fluidization. However, the substantial computational demands of such simulations require efficient implementations on supercomputing architectures. This work presents a comprehensive performance analysis of a fully-resolved LBM-DEM model to study granular fluidization, i.e., piping erosion, during the installation of an offshore caisson foundation. The performance evaluation focuses on real-world workloads rather than simplified benchmark problems to obtain realistic performance insights. The study considers a diverse range of state-of-the-art HPC hardware architectures, namely the LUMI and MareNostrum 5 EuroHPC supercomputers, both CPU and GPU partitions. The results demonstrate that GPU-based systems generally outperform the CPU-based systems. Strong and weak scaling analyses were conducted with up to 512 nodes, and parallel efficiencies reached up to 92%. Nonetheless, the results also indicate that atomic add operations on GPUs can become a bottleneck for the parallel efficiency at large scales. Moreover, the study reveals a close link between physical variations of the setup and significant scaling implications, underscoring the need to consider physical model characteristics in scaling assessments.

Keywords

high-performance computing performance engineering hardware comparison CPU GPU fluid-solid coupling suction bucket

1. Introduction

Suction buckets have emerged as an increasingly popular foundation for offshore wind turbines, see Figure 1. These foundations are installed by inducing a suction pressure within the bucket and leveraging the structure’s self-weight to embed the bucket into the seabed. In contrast to conventional offshore wind foundations, suction buckets have numerous advantages, including fast and cost-efficient installation and environmentally friendly characteristics, such as low noise emissions during the installation process (Sturm, 2017). However, the installation process of these types of foundations involves considerable challenges, such as the risk of piping erosion, i.e., the local fluidization of soil particles beneath the foundation. This can result in a loss of suction pressure and ultimately lead to installation failure. The phenomenon of piping erosion is driven by complex physical interactions that are difficult to analyze and predict in practice due to the general lack of accurate models and the scarcity of available experimental data (Ragni et al., 2020).

Figure 1.

(left) Image and schematic of an offshore wind turbine with a suction bucket foundation; (middle) three-dimensional numerical setup; (right) schematic representation of a fully-resolved particle.

By focusing on the physical phenomena and interactions at the grain scale, numerical micromechanical simulations provide a convenient and detailed insight into the onset and triggering conditions of granular fluidization. To study piping erosion in the context of offshore wind turbine foundations, a fully-resolved numerical model representing a sub-region of a suction bucket foundation is presented in (Kemmler et al., 2025b, 2025c), see Figure 1. In such fully-resolved simulations, one particle spans over several fluid cells. According to the literature, $\geq 10$ fluid cells per particle diameter is an established resolution in sediment transport simulations (Biegert et al., 2017; Costa et al., 2015; Schwarzmeier et al., 2023), with 20 fluid cells per particle diameter being used in this work. This resolution is chosen to accurately capture coupled erosion/fluidization dynamics at system scale, consistent with the detailed model validation in (Kemmler et al., 2025b). The present setup is not intended as full boundary-layer DNS in the most strongly fluidized regions; unresolved near-wall scales are treated through the employed Large Eddy Simulation (LES) closure. Such fully-resolved simulations can well reproduce, both qualitatively and quantitatively, the complex physics of immersed granular media (Froiio et al., 2019; Fukumoto and Ohtsuka, 2018; Rettinger and Rüde, 2022), and recent coupled Lattice Boltzmann Method (LBM) frameworks further underline the relevance of robust coupling strategies and Graphics Processing Unit (GPU) acceleration for multiphase fluid-solid and fluid-structure interaction problems (Jiang et al., 2021, 2022). Nevertheless, the computational requirements of these simulations are substantial due to their high resolution, e.g., the bounding box of a single fully-resolved particle with 20 fluid cells per spatial direction consists of 8000 (20 × 20 × 20) fluid cells. To put the computational requirements into perspective, even a relatively small but physically relevant setup for granular fluidization can already lead towards exascale-level demands. For instance, considering a domain of 0.5 m × 0.05 m × 1.5 m containing fine gravel with an average grain size of 5 mm, and maintaining a resolution of 20 fluid cells per particle diameter, the resulting fluid grid comprises 2000 × 200 × 6000 cells. For a typical LBM implementation with 27 unknowns per cell as used in the present study, this corresponds to about 6.48 × 10¹⁰ degrees of freedom. To address these computational challenges, efficient implementations on modern High-Performance Computing (HPC) systems are essential. This task is complicated by the inherently diverse computational properties of multiphysics simulations and the heterogeneous architectures of current supercomputing platforms, which vary across Central Processing Units (CPUs) and GPUs and among vendors such as NVIDIA, AMD, and Intel. Furthermore, on HPC systems with electric power consumption of 20 megawatts or more, energy-to-solution emerges as a critical factor besides computational performance (Kulkarni et al., 2026; Suarez et al., 2025; Vysocky et al., 2024). Although multiphysics simulations play an important role in engineering and scientific research, there remains a lack of well-documented performance data for real-world applications rather than synthetic benchmarks, which are often based on strong simplifications of the real problem.

In this paper, an extensive performance analysis of a fully-resolved micromechanical simulation is presented. The comparison is conducted on the CPU and GPU partitions of the state-of-the-art EuroHPC supercomputers LUMI and MareNostrum 5 and delivers insights by addressing the following aspects:

• Time- and energy-to-solution comparison of hardware architectures (CPU vs. GPU) and Roofline performance modeling.

• Scaling analysis focusing on differences between CPU and GPU partitions and the close link between the physical setup and scalability.

• Focus on real-world workloads, i.e., the installation process of suction bucket foundations for offshore wind turbines, rather than simplified benchmarks, ensuring relevance to practical engineering challenges.

2. Algorithms

In this section, the pseudocodes for the key simulation components required in every time step are presented. The fluid phase in the suction bucket model is represented using the LBM. The Lagrangian particles are modeled with the Discrete Element Method (DEM). Each time step consists of the following main components:

• Coupling (solid → fluid): transfer the influence of moving particles to the surrounding fluid.

• Fluid dynamics: simulate fluid behavior and its influence on the surrounding particles.

• Coupling (fluid → solid): translate fluid forces back onto the particles.

• Particle dynamics: update particle movement based on all acting forces.

Note that the computationally expensive kernels which contribute significantly to the run time are described, in contrast to the less impactful components such as particle dynamics and boundary handling. These computationally expensive algorithms are implemented for both CPUs and GPUs, with loops over fluid cells implemented as for-loops on CPU and as kernel calls on GPU.

2.1. Particle mapping

Algorithm 1 maps the particles in the fluid domain represented by cells, see Figure 1, to obtain the coupling from the solid to the fluid phase. The mapping output is the individual solid fraction field

B_{i} (x) = f (\sqrt{{(x - x_{p, i})}^{2}}, D_{p, i}),

(1)

which stores the overlap fractions of the particle i with the fluid cells. x denotes the iteration variable for the fluid cells. The function f is designed to efficiently handle sphere-cube overlaps and is detailed in (Jones and Williams, 2017). The required inputs to Algorithm 1 are the particle positions x _p,i and diameters D_p,i. Note that the subscript “p” stands for particle. The algorithm loops over all fluid cells of the domain. For each cell, it iterates over all particles overlapping with the corresponding cell. The respective overlap fraction B_i( x ) is computed by the function f. The input of f includes a square root distance, which is the computationally most challenging part of this algorithm.

2.2. Particle velocity field computation

Algorithm 2 computes the velocity induced by the particle motion into the fluid phase. Thereby, the particle velocity field

u_{i} (x) = u_{p, i} + Ω_{p, i} \times (x - x_{p, i}),

(2)

is updated. This field stores the velocity induced by the movement of particle i on the fluid cells. The inputs are the particle positions, x _p,i, linear velocities, u _p,i, and angular velocities Ω_p,i. The algorithm iterates over all cells. For each cell x , it loops over all particles overlapping with the corresponding cell. The velocity induced by the motion of particle i into fluid cell x is computed using the linear velocity, and the contribution from its angular velocity.

2.3. Fluid dynamics

The LBM discretizes the fluid domain using a Cartesian grid, where each fluid cell x is discretized into Q velocity directions c _q. In this study, Q = 27 is used. These directions are associated with Particle Distribution Functions (PDFs), i.e., the unknowns, that represent the likelihood of fluid molecules moving in the given direction.

LBM is particularly efficient for parallel computations, exhibiting a 1:1 read-to-write ratio, and being memory-bound on most hardware architectures (Holzer et al., 2021; Lehmann et al., 2022).

The Partially Saturated Cells Method (PSM) (Noble and Torczynski, 1998) extends the LBM algorithm to couple moving geometries into the fluid. The PSM is an established methodology in the context of sediment transport simulations, especially on GPU systems (Benseghier et al., 2020; Fukumoto et al., 2021) and therefore used here. In the following, it is assumed that not more than two particles can overlap with a single fluid cell, a reasonable assumption for fully-resolved spherical particles, since their spatial extent does not permit more than two particles in a single fluid cell in most cases. The inputs to Algorithm 3 are the PDFs field from the previous time step $f_{q}^{old} (x)$ , with streaming from neighboring cells implied in this notation, the individual solid fraction field B_i( x ), the solid fraction field B( x ) = ∑_iB_i( x ), and the particle velocity field u _i( x ). The outputs are the updated PDFs field

\begin{align} f_{q}^{new} (x) & = f_{q}^{old} (x) + (1 - B (x)) C_{q}^{fluid} (f_{q}^{old} (x)) \\ + \sum_{i \in 0,1} B_{i} (x) C_{q, i}^{solid} (f_{q}^{old} (x), u_{i} (x)), \end{align}

(3)

and the hydrodynamic force field

F_{i}^{hyd} (x) = B_{i} (x) \sum_{q \in Q} c_{\bar{q}} C_{q, i}^{solid} (f_{q}^{old} (x), u_{i} (x)),

(4)

i.e., the momentum of the fluid cells on the particles, which is one vector for each overlapping particle in the cell. The algorithm iterates over all fluid cells. For each fluid cell, it loops over the velocity directions Q, and updates the corresponding PDFs by adding a weighted contribution of

C_{q}^{fluid}

to the old PDFs, which originates from the Boltzmann equations and relaxes the PDFs towards their thermodynamic equilibrium and incorporates possible body forces. If one particle is overlapping with the respective cell, i.e., B₀( x ) > 0, a weighted solid contribution

C_{q, 0}^{solid}

is added to the updated PDFs, using the particle velocity field u _i( x ). Furthermore, the hydrodynamic force contribution

F_{0}^{hyd} (x)

of cell x on the overlapping particle is set. If a second particle overlaps with the respective cell, it is processed in the same way.

2.4. Hydrodynamic force and torque reduction

Algorithm 4 aggregates the hydrodynamic force field $F_{i}^{hyd} (x)$ into one hydrodynamic force and torque vector for each particle. The input to the algorithm consists of the hydrodynamic force field $F_{i}^{hyd} (x)$ , and the particle positions x _p,i. The algorithm iterates over all cells in the domain. For each cell x , it iterates over all particles overlapping with the respective cell. The corresponding hydrodynamic force field contribution is added to the hydrodynamic force on particle i

F_{p, i}^{hyd} = \sum_{x} F_{i}^{hyd} (x),

(5)

and the hydrodynamic torque on particle i

T_{p, i}^{hyd} = \sum_{x} (x - x_{p, i}) \times F_{i}^{hyd} (x),

(6)

is updated accordingly. Since one particle spans over several fluid cells, race conditions can occur when updating the forces and torques in parallel, i.e., multiple threads writing to the same memory location in parallel, possibly corrupting the result. The race conditions require the use of atomic add operations on GPUs to ensure correctness.

In the implementation, the hydrodynamic force and torque reduction is realized differently on CPU and on GPU, while maintaining a consistent data layout across both architectures to facilitate portability and performance. In both cases, compact and contiguous arrays are employed to hold the per-particle accumulators for $F_{p, i}^{hyd}$ and $T_{p, i}^{hyd}$ , with a total length of #Particles × 3 to accommodate the three spatial dimensions. Likewise, the hydrodynamic force field $F_{i}^{hyd} (x)$ itself is stored in compact, contiguous arrays. On the CPU, each process/core iterates over its assigned block of cells, see Section 2.5, in a plain for-loop, thereby completely avoiding process-local race conditions without the need for synchronization primitives. In contrast, on the GPU, the reduction is accelerated by mapping each cell of $F_{i}^{hyd} (x)$ to exactly one CUDA or HIP thread. Each thread performs three reduction operations for the force components and three for the torque components into the corresponding global per-particle arrays, see lines 6 and 7 in Algorithm 4. To ensure correctness in the presence of concurrent updates from multiple threads, on CUDA, the reduction is implemented using atomic add, while on HIP unsafe atomic add is employed to bypass costly CAS loops and reduce synchronization overhead.

2.5. Implementation notes

The present suction bucket model is implemented in the massively parallel multiphysics framework waLBerla (Bauer et al., 2021a) (https://www.walberla.net). waLBerla employs a block-structured domain partitioning approach, where the simulation domain is subdivided into multiple uniform blocks, each of which is exclusively assigned to a single Message Passing Interface (MPI) process, enabling parallelism. On CPU nodes, the number of MPI processes is equal to the number of available cores, whereas on GPU nodes, the number of MPI processes is equal to the number of available GPUs or Graphics Compute Dies (GCDs). To maintain consistency across the domain, fluid and particle information must be exchanged between neighboring blocks. This inter-block communication is facilitated through a nearest-neighbor communication scheme, which is implemented using GPU-aware MPI, minimizing communication cost. More detailed implementation and parallelization information can be found in (Kemmler et al., 2025d). Providing highly optimized code for different combinations of LBM variants and hardware architectures is a challenging and time-consuming task when done manually. To address this, the code generation framework lbmpy (Bauer et al., 2021b; Hennig et al., 2023) is utilized for the PSM kernel, i.e., the fluid simulation. lbmpy automatically generates highly optimized lattice Boltzmann kernels for CPU using the general-purpose language C++, or for NVIDIA and AMD GPUs using their kernel languages CUDA and HIP, respectively. For the generated CUDA/HIP kernels, a standard structure-of-arrays memory layout is used. The use of code generation not only results in highly optimized code for different architectures but also guarantees that the same algorithm is deployed across all platforms. This consistency is essential for accurately comparing the performance of the simulation across different hardware architectures. Note that most cells in B_i( x ) would remain empty if a separate data structure were allocated for each particle i, since each particle occupies only a very limited spatial region. Therefore, in the actual implementation, all B_i( x ) fields are combined into a single data structure. The same approach is applied to u _i( x ) and $F_{i}^{hyd} (x)$ . The code, performance data, and corresponding scripts are available on Zenodo (Kemmler et al., 2025a).

3. Performance analysis

The performance analysis includes a description of the different hardware architectures employed, details about the problem sizes, single-node performance analysis, such as energy-to-solution and a Roofline performance model, as well as scalability analysis for both strong and weak scaling.

3.1. Benchmarking environment

For the performance analysis, the CPU and GPU partitions of the LUMI¹ and MareNostrum 5² EuroHPC supercomputers are used. In the following, the abbreviations LUMI-C, MN5-GPP, LUMI-G, and MN5-ACC will be used for the LUMI and MareNostrum 5 CPU and GPU partitions, respectively. The terms “general-purpose” and “accelerated” will be used for the CPU and GPU partitions, respectively. These HPC systems offer a diverse range of hardware architectures that are well-suited for HPC tasks: AMD EPYC 7763 CPUs, AMD MI250X GPUs, Intel Sapphire Rapids 8480+ CPUs, and NVIDIA H100 GPUs. The hardware configurations for each partition with detailed specifications for a single node are listed in Table 1, including the number of CPU and GPU units, the total Floating Point Operations per Second (FLOPS), the main memory size (either CPU or GPU memory), the memory bandwidth as specified in the data sheet, and the actual memory bandwidth achieved, as measured using a SCALE STREAM benchmark (McCalpin, 1995). Both general-purpose partitions comprise two CPU sockets per node, while both accelerated partitions are equipped with four GPUs per node. Note that in the MI250X architecture, each GPU contains two GCDs.

Table 1.

Hardware specifications per node: CPUs of general-purpose nodes, GPUs of accelerated nodes, corresponding FLOPS, main memory size and bandwidth. FLOPS of the CPU partitions are computed based on the number of cores, fused multiply-add units per core, register size, and turbo clock frequency.

Partition	Hardware	TFLOPS (FP64)	Main memory (GB)	Memory bandwidth (GB/s)
Partition	Hardware	TFLOPS (FP64)	Main memory (GB)	Data sheet	Measured
LUMI-C	2 × AMD EPYC 7763 (64C each)	7.2	256	410	247
MN5-GPP	2 × Intel Sapphire Rapids 8480+ (56C each)	13.6	256	614	383
LUMI-G	4 × AMD MI250X (two GCDs each)	382.8	4 × 128	12,800	10,784
MN5-ACC	4 × NVIDIA H100	136	4 × 64	8000	5932

The two GCDs of the MI250X are connected via an Infinity Fabric link with a bidirectional bandwidth of 400 GB/s. Across different MI250X modules within a single node, the GCDs communicate using either a single or double Infinity Fabric connection, providing bidirectional bandwidths of 100 GB/s and 200 GB/s, respectively. In contrast, the H100 GPUs inside one MN5-ACC node are interconnected through NVLinks, with each GPU pair linked by six NVLinks, delivering a total interconnect bidirectional bandwidth of 300 GB/s. The network interconnects between different nodes in the systems are given in Table 2. Notably, accelerated nodes provide significantly higher bidirectional bandwidths compared to general-purpose nodes.

Table 2.

Network specifications per node: interconnect, bidirectional bandwidth, and topology.

Partition	Network interconnect	Bandwidth (Gbit/s)	Network topology
LUMI-C	1 × HPE Cray Slingshot-11	1 × 400	dragonfly
MN5-GPP	1 × NDR200 (per two nodes)	0.5 × 400	fat-tree
LUMI-G	4 × HPE Cray Slingshot-11	4 × 400	dragonfly
MN5-ACC	4 × NDR200	4 × 400	fat-tree

3.2. Single node simulation setup

The simulation domain as depicted in Figure 1, having a size of 448 × 224 × 896 = 89, 915, 392 fluid cells, is mapped onto one node of each partition. In addition to the fluid cells, 8796 particles are initialized within the domain. The particle bed extends up to a height of 650 fluid cells. In the following, the performance data is obtained for 500 time steps of the simulation. The domain is partitioned across the number of available CPU cores and GPUs on the node, with a corresponding number of MPI processes being initialized to handle the partitioning. As outlined in Table 3, the domain partitioning varies depending on the hardware configuration, i.e., the number of MPI processes in x-, y-, and z-direction is chosen differently for every partition such that the total number of MPI processes matches either the number of CPU cores or GPUs on the corresponding node. The problem size remains exactly the same across all partitions.

Table 3.

Domain partitioning on a single node.

Partition	Cores/GCDs per node	MPI processes			Fluid cells per process
Partition	Cores/GCDs per node	x	y	z	x	y	z
LUMI-C	128	4	2	16	112	112	56
MN5-GPP	112	2	4	14	224	56	64
LUMI-G	8	2	1	4	224	224	224
MN5-ACC	4	1	1	4	448	224	224

3.3. Single node performance analysis

The performance analysis is conducted at the node level, focusing on the computational efficiency and energy consumption across different hardware architectures. On the LUMI cluster, the “slurm” workload manager is used to obtain energy measurements via “sacct”, whereas on MareNostrum 5, energy data is collected using the “ear” tool. Note that these measurements do not account for the energy consumed by the broader infrastructure, such as network equipment and cooling systems. Figure 2 provides a comparison of the time and energy-to-solution for the setup described in Section 3.2 across the four partitions.

Figure 2.

(left) Time-to-solution and (right) energy-to-solution on a single node.

Both GPU nodes need less time and energy-to-solution compared to the CPU-based nodes. However, the decrease in energy-to-solution is not as significant as the decrease in time-to-solution. In addition, while the AMD CPU-based LUMI-C node exhibits a higher time-to-solution than the Intel CPU-based MN5-GPP node, its energy consumption is significantly lower. The AMD MI250X-based LUMI-G node is both faster and more energy efficient than the NVIDIA-based MN5-ACC node. Figure 3 illustrates how the time-to-solution is distributed among the simulation components.

Figure 3.

Shares of the simulation components in the total run time on a single node, with inner-node communication (Comm.), and load imbalances (Load imb.).

The total run time is decomposed into the algorithms “Fluid”, “Velocity”, “Mapping”, and “Reduction”, as described in Section 2. In addition, the breakdown includes communication between CPUs and GPUs within a node, load imbalance overheads, and other operations such as boundary handling and evaluation routines. The load imbalance overheads are explicitly accounted for through MPI barriers right in front of communications to avoid load imbalances being included in the measurements of the blocking MPI communications. Additionally, an MPI barrier is placed before the fluid simulation, which exerts significant utilization on the main memory bandwidth. Without this barrier, processes could still be in previous routines with lower memory pressure. This would allow the processes being already in the fluid simulation to utilize more memory bandwidth, artificially increasing the measured memory bandwidth. Such a discrepancy could lead to overly optimistic results in comparisons with a performance model, particularly on the CPU, where multiple processes share the same main memory connection. The fluid simulation is the computationally most expensive component across all architectures. The fraction of velocity computation, mapping, communication, and load imbalances remains similar across different hardware configurations. However, the reduction step is relatively more expensive on GPU systems. Similarly, the “Other” category is notably larger on GPU systems, primarily influenced by an evaluation routine that computes the differential pressure across the bucket wall.

In the following paragraph, an in-depth analysis of the high cost associated with the reduction step on the NVIDIA architecture is provided. On the CPU, reductions are implemented as simple for-loops without race conditions. In contrast, on the GPU, atomic add operations are employed to ensure thread safety during concurrent updates, see Section 2.4. As a result, the performance of the reduction kernel is strongly influenced by the underlying particle-to-cell ratio. If particles are small relative to the grid cell size, updates from different threads primarily target disjoint memory locations, thereby minimizing atomic contention. In this regime, the cost of atomic operations remains low. Conversely, when a small number of large particles occupy multiple cells, concurrent updates to the same memory address increase, resulting in heightened contention, longer latencies, and a significantly higher kernel run time. To analyze the reduction cost on the MN5-ACC node, it is essential to consider the domain decomposition. On the MN5-ACC node, the simulation domain is decomposed into four blocks in the z-direction, each assigned to a separate GPU, see Table 3. The lowest block, assigned to “GPU0”, is densely populated with particles, whereas the uppermost block, assigned to “GPU3”, primarily contains fluid and the bucket wall segment, see Figure 1 for the numerical setup. A comparison of the reduction performance between GPU0 and GPU3 was performed using NVIDIA’s Nsight Compute profiler.³ Interestingly, GPU0 executes nearly eight times as many atomic add operations as GPU3, yet requires only 2.26 ms per kernel invocation compared to 11.96 ms on GPU3. This behavior is further reflected in the measured memory throughput, which reaches 65.49 GB/s on GPU0 but drops to 2.71 GB/s on GPU3. The underlying cause is the disparity in memory stalls: for GPU0, 66.6 stalled cycles are reported on average waiting for memory, whereas 810.6 stalled cycles were measured on GPU3. These findings highlight that the high cost of reductions on GPUs arises from the use of atomic add operations, which induce memory contention, high latency, and reduced memory throughput when thread conflicts increase. An important insight from this study is that, for fully-resolved spherical particles with 20 fluid cells per spatial dimension, as used in the present setup, the reduction remains manageable. In this case, only a moderate number of threads reduce into the same memory address, thereby avoiding an all-to-one reduction. However, when individual large particles, such as the bucket wall segment in this configuration, are involved, the atomic add-based reduction can become a significant performance bottleneck. This observation underscores the need for further optimization of reduction strategies on GPUs, particularly for highly inhomogeneous particle distributions.

The evaluation routine for post-processing is performed on the CPU and, therefore, requires transferring the entire fluid field data from GPU to CPU. Although executed only once every 500 time steps, this operation constitutes a significant portion of the total run time, especially on the MN5-ACC node. Evaluation routines can become a bottleneck on GPU nodes, especially if they are executed on the CPU, as these evaluations require GPU-CPU memory transfers, and their run time accounts for a higher fraction of the overall run time since the overall run time on GPU nodes is lower. Load imbalances emerge due to heterogeneous domain compositions, where certain regions contain only fluid, while others are densely packed with particles, leading to highly variable workloads.

To analyze the fluid algorithm performance in detail, a Roofline performance model (Hager and Wellein, 2010) is employed in the following, estimating an upper bound for the achievable performance on each architecture. For assessing which parts of the fluid kernel cause which degradation from the Roofline performance, four different variations of Algorithm 3 are defined with increasing computational intensity:

• Kernel #1: Only line 6 of Algorithm 3, i.e., a pure LBM simulation. $C_{q}^{fluid}$ is chosen such that Floating Point Operations (FLOP) per byte is 1.21.

• Kernel #2: Only line 6 of Algorithm 3, with an active Smagorinsky LES contribution included in $C_{q}^{fluid}$ , resulting in 1.39 FLOP per byte.

• Kernel #3: Full Algorithm 3 with active LES, but conditions in lines 8 and 14 always evaluate to false, i.e., no particles being present in the domain, leading to 2.10 FLOP per byte.

• Kernel #4: Full Algorithm 3 with active LES, applied on the simulation domain depicted in Figure 1, resulting in 2.64 FLOP per byte, and leading to thread divergence on GPU.

It is important to note that the term 1 − B( x ) in line 6 of Algorithm 3 is removed in the Kernels #1 and #2, as it always evaluates to 1 in the absence of particles. In the implementation, this is realized via lbmpy by generating a dedicated LBM-only kernel variant, i.e., no PSM terms, rather than by executing the full PSM kernel with runtime conditionals. The workloads of the Kernels #1 to #4, both in terms of FLOP and memory transfer, and the corresponding best-case run times on the four partitions according to the Roofline model are given in Table 4.

Table 4.

Workload and resulting best-case run time according to the Roofline performance model for Kernels #1 to #4.

Kernel #	Total FLOP	Total memory transfer (B)	Best-case run time t (s)
Kernel #	Total FLOP	Total memory transfer (B)	LUMI-C	MN5-GPP	LUMI-G	MN5-ACC
1	2.34 ⋅10¹³	1.94 ⋅ 10¹³	78.63	50.71	1.80	3.27
2	2.69 ⋅10¹³	1.94 ⋅ 10¹³	78.63	50.71	1.80	3.27
3	4.30 ⋅10¹³	2.05 ⋅ 10¹³	83.00	53.53	1.90	3.46
4	5.67 ⋅10¹³	2.15 ⋅ 10¹³	86.89	56.04	1.99	3.62

The best-case run time can be computed using the Roofline model as:

t = \max (\frac{Total FLOP}{FLOPS}, \frac{Total memory transfer}{Memory bandwidth}) .

(7)

Notably, in all cases, the code exhibits a memory-bound behavior, i.e., the minimal achievable run time is determined by dividing the total memory transfer volume by the measured main memory bandwidth, see Table 1. The total FLOP are computed as:

• Kernel #1: 89915392 fluid cells × 521 FLOP per cell × 500 simulated time steps.

• Kernel #2: 89915392 fluid cells × 598 FLOP per cell × 500 simulated time steps.

• Kernel #3: 89915392 fluid cells × 956 FLOP per cell × 500 simulated time steps.

• Kernel #4: 89915392 fluid cells × 956 FLOP per cell × 500 simulated time steps. In addition, 40031447 fluid cells being covered by solid × 683 FLOP per cell × 500 simulated time steps.

The FLOP per cell numbers are obtained from the source code. The data transfer volume is computed as:

• Kernel #1 and #2: 89915392 fluid cells × 27 unknowns per cell × 2 memory operations per unknown (read and write) × 8 byte per double-precision value × 500 simulated time steps.

• Kernel #3: In addition to Kernels #1 and #2, 89915392 fluid cells × 3 read operations × 8 byte per double-precision value × 500 simulated time steps.

• Kernel #4: In addition to Kernel #3, 40031447 fluid cells being covered by solid × 2 fields to write or read ( $F_{i}^{hyd} (x)$ and u _i( x ), respectively) × 3 spatial values per cell × 8 byte per double-precision value × 500 simulated time steps.

Figure 4 illustrates the achieved Roofline performance for Kernels #1 to #4 with the ideal efficiency highlighted by a dashed blue line.

Figure 4.

Fraction of Roofline performance achieved by Kernels #1 to #4. The ideal efficiency is highlighted by a dashed blue line.

As the FLOP per byte increase from Kernel #1 to #4, the fraction of the achieved Roofline performance tends to decrease on all architectures, with the exception of the AMD EPYC CPU. This trend of reduced bandwidth utilization for increased FLOP per byte is generally expected, as less efficient issuing of memory instructions leads to reduced efficiency in memory utilization. The most significant drop can be observed from Kernel #3 to #4. As Kernel #4 introduces thread divergence, a significant reduction of memory efficiency is expected, especially on the GPU-based partitions. Thus, the MI250X exhibits the lowest memory efficiency for Kernel #4. In contrast, the NVIDIA H100 handles the complexity of Kernel #4 significantly better. Overall, the NVIDIA H100 demonstrates the best memory utilization of all considered architectures.

3.4. Simulation setups for scaling analysis

The scaling setups are based on the configuration used for the single-node performance analysis, as shown in Figure 5 (middle), with the number of nodes ranging from 1 to 512. However, on MN5-ACC, the scaling is limited to 100 nodes due to system restrictions. The baseline setup used throughout the scaling study employs the same physics model as Kernel #4. To perform the strong scaling analysis, the problem size remains fixed while the number of compute nodes increases. As a result, the workload per node decreases. Ideally, the total run time should decrease proportionally to the number of nodes used. In this study, strong scaling results are presented for three different problem sizes. The small problem size consists of 448 × 224 × 896 fluid cells and is scaled from one to eight compute nodes, resulting in the setup shown in Figure 5 (left). The medium problem size comprises 896 × 448 × 1792 fluid cells and is scaled from eight to 64 nodes. The large problem size has 1792 × 896 × 3584 fluid cells and is scaled from 64 to 512 nodes. In weak scaling, the problem size increases proportionally with the number of compute nodes, keeping the workload per node constant. Ideally, the total run time remains constant regardless of the number of nodes used. The weak scaling analysis starts with the setup shown in Figure 5 (middle) computed on a single node. The number of nodes is successively doubled, starting at one and ending at 512 nodes. At the same time, the domain size is doubled in all three spatial dimensions up to 8 × 8 × 8 = 512 times the initial domain. Specifically, for eight nodes, the domain size is doubled in all three directions once, resulting in the setup shown in Figure 5 (right). It is important to note that only the domain size and number of particles are scaled, while the bucket wall segment width remains unchanged.

Figure 5.

Numerical setup on (left) eight nodes in small strong scaling; (middle) a single node; (right) eight nodes in weak scaling. Black lines indicate the domain partitioning among the nodes. The visualization of the fluid phase is omitted for clarity.

3.5. Scaling analysis

The scaling analysis section explores the system’s performance for different problem sizes, focusing on both strong and weak scaling approaches.

3.5.1. Strong scaling

The results of the strong scaling analysis for the three problem sizes are presented in Figures 6 and 7. The ideal scaling behavior is highlighted by a dashed blue line in both figures.

Figure 6.

Strong scaling for three problem sizes up to 512 nodes.

Figure 7.

Strong scaling parallel efficiency for three problem sizes up to 512 nodes.

A similar strong scaling behavior across all three problem sizes was observed. In each case, the time-to-solution decreased as the number of nodes increased. The parallel efficiency on CPU-based nodes was higher than that on accelerated nodes. Specifically, when doubling the number of nodes three times, the parallel efficiency ranged between 90% on LUMI-C and 40% on LUMI-G, respectively. Nevertheless, as expected based on the comparison of time-to-solution on a single node in Figure 2, the GPU partitions were significantly faster than the CPU partitions. For all three problem sizes, the initial number of nodes on the GPU-based systems was faster than the final number of nodes on the CPU nodes, e.g., one GPU node was faster than eight CPU nodes for the small problem size.

3.5.2. Weak scaling

The results of the weak scaling analysis are presented in Figures 8 and 9. The ideal scaling behavior is highlighted by a dashed blue line in both figures.

Figure 8.

Weak scaling up to 512 nodes.

Figure 9.

Weak scaling parallel efficiency up to 512 nodes.

The weak scaling behavior of LUMI-C is similar to MN5-GPP, and LUMI-G is similar to MN5-ACC. However, the CPU partitions differ from the GPU partitions. The overall time-to-solution was significantly faster on the accelerated nodes compared to the general-purpose nodes. Run time and parallel efficiency saturated on the CPU partitions LUMI-C and MN5-GPP. In contrast, no saturation was observed on LUMI-G. The parallel efficiency on CPU-based systems was higher than on LUMI-G for larger node counts. For 512 nodes, the parallel efficiency was between 92% on LUMI-C and 64% on LUMI-G. In contrast to the GPU partitions, the communication cost was less significant on CPU partitions since the overall computation took longer, as a result, the parallel efficiency was higher. The parallel efficiency on the CPU partitions behaves as expected. At small node counts, a decrease in parallel efficiency is observed due to the increased cost of inter-process communication. As the number of nodes grows, the parallel efficiency saturates, since the communication pattern is dominated by nearest-neighbor exchanges. In contrast, two noteworthy features can be observed for the GPU partitions in Figure 9. First, on MN5-ACC up to four nodes, a superlinear scaling behavior is present, i.e., a parallel efficiency $> 1$ . Second, on LUMI-G, no saturation of the parallel efficiency is observed. The main difference between CPU and GPU partitions lies in the handling of reductions. Each process must locally reduce the hydrodynamic forces within its assigned block. On GPU systems, this local reduction relies on atomic add operations, which can introduce memory contention. This effect is particularly pronounced for processes containing the bucket wall segment. In contrast, on CPU partitions, the local reduction of each core’s domain is carried out serially, avoiding such contention. The superlinear scaling observed at the beginning of the weak-scaling experiment on MN5-ACC can be explained by the initial distribution of the bucket wall segment across multiple processes. This effectively reduces the number of atomic add operations per process, thereby alleviating memory contention and leading to improved parallel performance. For larger node counts on LUMI-G, the opposite effect is seen: the bucket wall segment becomes more concentrated on certain GPUs, increasing the number of atomic add operations per process and thus preventing saturation of the parallel efficiency. The decreasing parallel efficiency on LUMI-G will be analyzed in more detail in the following.

This paragraph provides more details on how physical variations of the setup and optimization techniques affect the weak scaling parallel efficiency on LUMI-G reported in Figure 9, which will be denoted as “base case” in the following. The following variations from the base case are considered: increasing the bucket wall width during the scaling, removing particle dynamics, and removing the bucket wall segment, which results in porous media. Furthermore, communication hiding for the fluid field communication between neighboring blocks is implemented. Figure 10 compares the LUMI-G weak-scaling efficiency for these setup variations and optimization techniques.

Figure 10.

Weak scaling parallel efficiency on LUMI-G for different setups and code optimizations.

When the bucket wall width was included into the scaling, the parallel efficiency reduced significantly. In contrast, the parallel efficiency increased incrementally for the other setup variations and optimizations applied, with the most significant improvement occurring when the bucket wall was removed. With all setup simplifications and optimizations applied, the parallel efficiency can reach up to 89% on 512 nodes. Reducing the forces on the bucket wall has a significant influence on the scaling because it reduces a large number of threads into one variable. The remaining gap in parallel efficiency between 89% and 100% can be attributed to increasing communication costs that cannot be fully hidden after a certain point. Furthermore, load imbalances arose due to higher particle concentrations in deeper layers of the sediment bed, caused by increasing geomechanical stresses.

3.6. Discussion and lessons learned

The comparison between CPU and GPU partitions on a single node revealed notable differences in both time and energy-to-solution, with GPU partitions outperforming CPU partitions in both metrics. This was largely attributed to the significantly more powerful hardware of the GPUs. However, the energy-to-solution of the accelerated nodes was on average 82% lower than on the general-purpose nodes, while the time-to-solution was reduced by 92%. Such trends have been observed in the literature, where a single-GPU setup showed a 1.77× improvement in time-to-solution over a CPU, but only a 1.49× improvement in energy-to-solution, a discrepancy attributed to the higher power consumption of the GPU (Cadenelli et al., 2019). Other studies explain similar observations by the presence of host CPUs in GPU nodes, which still draw a significant amount of power, even if these CPUs are little used during the computation (Calore et al., 2016).

It is important to note that, despite the superior performance of GPU nodes in terms of time, the overall hardware costs of GPU-based nodes exceed those of CPU-based nodes by several multiples. As a result, when considering the cost-effectiveness of the GPU nodes, one must balance the gains in performance against the higher initial hardware investment. Additionally, reduction operations on GPUs are inherently more challenging than on CPUs, largely due to the need for efficient synchronization across numerous parallel threads. Using warp-level reductions before the atomic add for computing hydrodynamic forces on the bucket wall segment, thereby reducing each warp to a single atomic operation, resulted in a performance speedup on the H100 node. However, this approach led to slower reductions on the MI250X. Despite these differences in reduction speed, the choice of reduction method had no significant effect on parallel efficiency in the weak scaling tests. When comparing the performance of the accelerated nodes, note that the H100 nodes only have four GPUs compared to the MI250X nodes, which are equipped with eight GCDs. Furthermore, evaluation routines can significantly impact the performance of GPU nodes, as communication between CPU and GPU, input and output operations, and serialization can account for a substantial portion of the overall run time. These factors should be considered when analyzing GPU performance in real-world applications. To optimize performance, it is essential to minimize GPU-CPU memory transfers. Therefore, evaluation routines, which are often computed on the CPU, should be analyzed for potential execution on the GPU as well to prevent them from becoming a bottleneck. It was found that obtaining good performance on the AMD MI250X GPU was more challenging than on the NVIDIA H100. The MI250X exhibited significantly lower main memory bandwidth, particularly when thread divergence occurred, which was consistent with findings in the literature (Lehmann et al., 2022). Moreover, advanced API functionality, such as unified memory, was less performant in HIP than in CUDA, further complicating optimization efforts. Another critical consideration for GPU performance was register spilling, particularly when utilizing techniques like common subexpression elimination (CSE). While CSE reduced redundant computations, it increased register pressure, leading to excessive spilling to local memory, a problem that was less pronounced in CPU architectures. In the strong scaling analysis, a significant decrease in parallel efficiency was observed, which was more pronounced on GPU-based nodes. The decrease in parallel efficiency was caused by the problem size eventually becoming too small to effectively utilize the computational resources, while communication overhead increases. This effect is expected to be more pronounced on GPU-based systems, as they tend to be less efficient for small workloads. These findings align with those in recent literature (Karp et al., 2023; Kemmler et al., 2025d; Min et al., 2024). Similar behavior was observed in weak scaling, where the parallel efficiency of GPU nodes decreased more significantly. The performance of GPU systems in scaling was also closely linked to the physical setup, with small changes in configuration or workload distribution potentially leading to significant differences in performance, especially at large node counts.

Future work will focus on mitigating the observed load imbalance and memory contention during the reduction of hydrodynamic forces and torques on the bucket wall segment. In particular, load balancing techniques are to be explored, ranging from static approaches, suitable when sediment transport or fluidization remains moderate, to dynamic methods such as physics-aware domain decomposition. These approaches will assign subdomain weights based on local particle concentration and bucket-wall surface area, with optional dynamic remapping at runtime to maintain balanced workloads as sediment distribution evolves. On the GPU side, performance improvements due to employing more advanced reduction techniques beyond simple atomic add are to be studied in more detail. This includes privatized per-warp or per-block accumulators with shared-memory or tree-based reductions prior to a single global update. Additionally, the necessity for the computation of hydrodynamic forces on the bucket wall segment should be carefully assessed. When such forces are not strictly necessary for capturing the relevant physics due to minimal movement, they may be selectively omitted or approximated using surrogate models to trade negligible accuracy loss for substantial performance gains.

4. Conclusion

Fully-resolved coupled fluid-particle simulations represent a powerful approach to gain insights into complex multiphysics systems such as suction bucket foundations for offshore wind turbines. Nonetheless, the computational requirements of these simulations are substantial due to their high resolution and, therefore, the use of highly efficient and scalable implementations on modern supercomputing architectures becomes necessary. This work has provided a detailed description of the algorithmic characteristics of the dominant modules of this granular fluidization application along with a comprehensive performance analysis on four distinct hardware architectures, namely AMD EPYC 7763, AMD MI250X, Intel Sapphire Rapids 8480+, and NVIDIA H100. The performance was quantified in terms of time and energy-to-solution, utilizing a Roofline performance model for the fluid simulation, and evaluated under strong and weak scaling scenarios on up to 512 nodes. The obtained results have shown that:

• GPU systems outperform CPU systems both in time and energy-to-solution, as expected, but the energy savings were less pronounced compared to the reduction in time-to-solution.

• Reaching the maximum main memory bandwidth becomes more challenging with increasing kernel complexity in terms of FLOP per byte and branching. The NVIDIA H100 achieved the best main memory efficiency. The AMD MI250X exhibited the lowest main-memory efficiency, but used the lowest energy-to-solution.

• The implemented model demonstrated its potential for good scalability on modern supercomputing architectures, with parallel efficiencies reaching up to 92% in a weak scaling test on LUMI-C.

• Although the GPU systems show a superior performance for compute-intensive kernels, they are also more susceptible to scalability degradation from evaluation routines, communication overhead, and reduction operations under certain conditions, for instance when the reduced portion of the domain represents a substantial fraction of the whole domain, as observed for the bucket wall segment.

• Despite the lower energy-to-solution on accelerated nodes, the significantly higher hardware cost of accelerated nodes and reduced parallel efficiency might result in an overall higher cost-to-solution for large-scale simulations on GPU-based systems when both hardware and energy costs are considered.

• Parallel efficiency is closely linked to the physical setup, with small modifications potentially causing significant performance differences at large node counts.

The obtained results underscore the importance of application-specific performance evaluation rather than relying solely on benchmark applications.

Footnotes

Acknowledgements

We acknowledge the EuroHPC Joint Undertaking for awarding this project access to the EuroHPC supercomputer LUMI, hosted by CSC (Finland) and the LUMI consortium through a EuroHPC Development Access call. We acknowledge the EuroHPC Joint Undertaking for awarding this project access to the EuroHPC supercomputer MareNostrum 5, hosted by BSC (Spain) through a EuroHPC Development Access call.

ORCID iDs

Samuel Kemmler

Antoni Artinov

Pablo Cuéllar

Harald Köstler

Author contributions

S. Kemmler: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data Curation, Writing - Original Draft, Visualization, Project administration; A. Artinov: Conceptualization, Writing - Review and Editing; P. Cuéllar: Conceptualization, Writing - Review and Editing; H. Köstler: Resources, Writing - Review and Editing, Supervision, Funding acquisition;

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work has received funding from the European High Performance Computing Joint Undertaking (JU) and Sweden, Germany, Spain, Greece, and Denmark under grant agreement No 101093393.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

Data is available on Zenodo: https://doi.org/10.5281/zenodo.15063619 (Kemmler et al., 2025a).

Notes

Author biographies

Samuel Kemmler is a research associate at the Bundesanstalt für Materialforschung und -prüfung (BAM) in Berlin and a Ph.D. student at the Chair for System Simulation at the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU). He holds an M.Sc. degree in Computational Engineering and is one of the core developers of the waLBerla HPC framework. His research interests are HPC and particle-resolved simulations.

Antoni Artinov studied Engineering Science at TU Berlin and received his Ph.D. in 2024 on the mathematical analysis of laser beam welding. He is a researcher in buildings and structures for wind and railways at BAM in Berlin.

Pablo Cuéllar studied Civil Engineering at the Universidad Politécnica de Madrid and received his Ph.D. in Civil Engineering from Technical University Berlin in 2011. He is a member of the scientific staff at Bayerisches Landesamt für Umwelt.

Harald Köstler got his Ph.D. in Computer Science in 2008 on variational models and parallel multigrid methods in medical image processing. 2014, he finished his habilitation on Efficient Numerical Algorithms and Software Engineering for HPC. Currently, he works at the Chair for System Simulation at the FAU. His research interests include software engineering concepts, especially using code generation for simulation software on HPC clusters, multigrid methods, and programming techniques for parallel hardware, especially GPUs. The application areas are computational fluid dynamics, rigid body dynamics, and medical imaging.

References

Bauer

Eibl

Godenschwager

, et al. (2021a) waLBerla: a block-structured high-performance framework for multiphysics simulations. Computers & Mathematics with Applications 81: 478–501. https://doi.org/10.1016/j.camwa.2020.01.007

Bauer

Köstler

Rüde

(2021b) Lbmpy: automatic code generation for efficient parallel lattice boltzmann methods. Journal of Computational Science 49: 101269. https://doi.org/10.1016/j.jocs.2020.101269

Benseghier

Cuéllar

Luu

, et al. (2020) A parallel GPU-Based computational framework for the micromechanical analysis of geotechnical and erosion problems. Computers and Geotechnics 120: 103404. https://doi.org/10.1016/j.compgeo.2019.103404

Biegert

Vowinckel

Meiburg

(2017) A collision model for grain-resolving simulations of flows over dense, Mobile, polydisperse granular sediment beds. Journal of Computational Physics 340: 105–127. https://doi.org/10.1016/j.jcp.2017.03.035

Cadenelli

Jaksić

Polo

, et al. (2019) Considerations in using OpenCL on GPUs and FPGAs for throughput-oriented genomics workloads. Future Generation Computer Systems 94: 148–159. https://doi.org/10.1016/j.future.2018.11.028

Calore

Gabbana

Kraus

, et al. (2016) Massively parallel lattice–Boltzmann codes on large GPU clusters. Parallel Computing 58: 1–24. https://doi.org/10.1016/j.parco.2016.08.005

Costa

Boersma

Westerweel

, et al. (2015) Collision model for fully resolved simulations of flows laden with finite-size particles. Physical Review 92(5): 053012. https://doi.org/10.1103/PhysRevE.92.053012

Froiio

Callari

Rotunno

(2019) A numerical experiment of backward erosion piping: kinematics and micromechanics. Meccanica 54(14): 2099–2117. https://doi.org/10.1007/s11012-019-01071-7

Fukumoto

Ohtsuka

(2018) 3-D direct numerical model for failure of non-cohesive granular soils with upward seepage flow. Computational Particle Mechanics 5(4): 443–454. https://doi.org/10.1007/s40571-017-0180-5

10.

Fukumoto

Yang

Hosoyamada

, et al. (2021) 2-D coupled fluid-particle numerical analysis of seepage failure of saturated granular soils around an embedded sheet pile with no macroscopic assumptions. Computers and Geotechnics 136: 104234. https://doi.org/10.1016/j.compgeo.2021.104234

11.

Hager

Wellein

(2010) Introduction to High Performance Computing for Scientists and Engineers. 0 edition. CRC Press. https://doi.org/10.1201/EBK1439811924

12.

Hennig

Holzer

Rüde

(2023) Advanced automatic code generation for multiple relaxation-time lattice boltzmann methods. SIAM Journal on Scientific Computing 45(4): C233–C254. https://doi.org/10.1137/22M1531348

13.

Holzer

Bauer

Köstler

, et al. (2021) Highly efficient lattice boltzmann multiphase simulations of immiscible fluids at high-density ratios on CPUs and GPUs through code generation. The International Journal of High Performance Computing Applications 35(4): 413–427. https://doi.org/10.1177/10943420211016525

14.

Jiang

Matsumura

Ohgi

, et al. (2021) A GPU-Accelerated fluid–structure-interaction solver developed by coupling finite element and lattice boltzmann methods. Computer Physics Communications 259: 107661. https://doi.org/10.1016/j.cpc.2020.107661

15.

Jiang

Liu

Chen

, et al. (2022) A coupled LBM-DEM method for simulating the multiphase fluid-solid interaction problem. Journal of Computational Physics 454: 110963. https://doi.org/10.1016/j.jcp.2022.110963

16.

Jones

Williams

(2017) Fast computation of accurate sphere-cube intersection volume. Engineering Computations 34(4): 1204–1216. https://doi.org/10.1108/EC-02-2016-0052

17.

Karp

Massaro

Jansson

, et al. (2023) Large-scale direct numerical simulations of turbulence using GPUs and modern fortran. The International Journal of High Performance Computing Applications 37(5): 487–502. https://doi.org/10.1177/10943420231158616

18.

Kemmler

Artinov

Cuéllar

, et al. (2025a) Towards Exascale Simulations of Granular Fluidization in Offshore Wind Turbine Foundations. Zenodo. https://doi.org/10.5281/zenodo.15063619

19.

Kemmler

Cuéllar

Artinov

, et al. (2025b) A fully-resolved micromechanical simulation of piping erosion during a suction bucket installation. Computers and Geotechnics 186: 107375. https://doi.org/10.1016/j.compgeo.2025.107375

20.

Kemmler

Cuéllar

Rettinger

, et al. (2025c) A Fluid-Solid Coupled Micromechanical Simulation for the Analysis of Piping Erosion During the Seabed Installation of a Suction Bucket Foundation. IOP Conference Series: Earth and Environmental Science 1480(1). Available at: https://doi.org/10.1088/1755-1315/1480/1/012024

21.

Kemmler

Rettinger

Rüde

, et al. (2025d) Efficiency and scalability of fully-resolved fluid-particle simulations on heterogeneous CPU-GPU architectures. The International Journal of High Performance Computing Applications 39(3): 345–363. Available at: https://doi.org/10.1177/10943420241313385

22.

Kulkarni

Kemmler

Schwartz

, et al. (2026) Harvesting energy consumption on European HPC systems: Sharing Experience from the CEEC project. Proceedings of the Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region Workshops. Available at: https://doi.org/10.1145/3784828.3785161

23.

Lehmann

Krause

Amati

, et al. (2022) Accuracy and performance of the lattice boltzmann method with 64-bit, 32-bit, and customized 16-bit number formats. Physical Review 106(1): 015308. https://doi.org/10.1103/PhysRevE.106.015308

24.

McCalpin

(1995) Memory bandwidth and machine balance in current high performance computers. In: IEEE computer society technical committee on computer architecture (TCCA) newsletter 2(19). IEEE.

25.

Min

Brazell

Tomboulides

, et al. (2024) Towards exascale for wind energy simulations. The International Journal of High Performance Computing Applications 38(4): 337–355. https://doi.org/10.1177/10943420241252511

26.

Noble

Torczynski

(1998) A lattice-boltzmann method for partially saturated computational cells. International Journal of Modern Physics C 09(08): 1189–1201. https://doi.org/10.1142/S0129183198001084

27.

Ragni

Bienen

Stanier

, et al. (2020) Observations during suction bucket installation in sand. International Journal of Physical Modelling in Geotechnics 20(3): 132–149. https://doi.org/10.1680/jphmg.18.00071

28.

Rettinger

Rüde

(2022) An efficient four-way coupled lattice boltzmann – discrete element method for fully resolved simulations of particle-laden flows. Journal of Computational Physics 453: 110942. https://doi.org/10.1016/j.jcp.2022.110942

29.

Schwarzmeier

Rettinger

Kemmler

, et al. (2023) Particle-resolved simulation of antidunes in free-surface flows. Journal of Fluid Mechanics 961: R1. https://doi.org/10.1017/jfm.2023.262

30.

Sturm

(2017) Design aspects of suction caissons for offshore wind turbine foundations. International Conference on Soil Mechanics and Geotechnical Engineering 19: 45–63.

31.

Suarez

Bockelmann

Eicker

, et al. (2025) Energy-aware operation of HPC systems in Germany. Frontiers in High Performance Computing 3: 1520207. https://doi.org/10.3389/fhpcp.2025.1520207

32.

Vysocky

Holzer

Staffelbach

, et al. (2024) Energy-efficient implementation of the lattice boltzmann method. Energies 17(2): 502. https://doi.org/10.3390/en17020502