Abstract
Fully-resolved micromechanical simulations coupling the Lattice Boltzmann Method (LBM) with the Discrete Element Method (DEM) provide high-fidelity insights into granular fluidization. However, the substantial computational demands of such simulations require efficient implementations on supercomputing architectures. This work presents a comprehensive performance analysis of a fully-resolved LBM-DEM model to study granular fluidization, i.e., piping erosion, during the installation of an offshore caisson foundation. The performance evaluation focuses on real-world workloads rather than simplified benchmark problems to obtain realistic performance insights. The study considers a diverse range of state-of-the-art HPC hardware architectures, namely the LUMI and MareNostrum 5 EuroHPC supercomputers, both CPU and GPU partitions. The results demonstrate that GPU-based systems generally outperform the CPU-based systems. Strong and weak scaling analyses were conducted with up to 512 nodes, and parallel efficiencies reached up to 92%. Nonetheless, the results also indicate that atomic add operations on GPUs can become a bottleneck for the parallel efficiency at large scales. Moreover, the study reveals a close link between physical variations of the setup and significant scaling implications, underscoring the need to consider physical model characteristics in scaling assessments.
Keywords
1. Introduction
Suction buckets have emerged as an increasingly popular foundation for offshore wind turbines, see Figure 1. These foundations are installed by inducing a suction pressure within the bucket and leveraging the structure’s self-weight to embed the bucket into the seabed. In contrast to conventional offshore wind foundations, suction buckets have numerous advantages, including fast and cost-efficient installation and environmentally friendly characteristics, such as low noise emissions during the installation process (Sturm, 2017). However, the installation process of these types of foundations involves considerable challenges, such as the risk of piping erosion, i.e., the local fluidization of soil particles beneath the foundation. This can result in a loss of suction pressure and ultimately lead to installation failure. The phenomenon of piping erosion is driven by complex physical interactions that are difficult to analyze and predict in practice due to the general lack of accurate models and the scarcity of available experimental data (Ragni et al., 2020). (left) Image and schematic of an offshore wind turbine with a suction bucket foundation; (middle) three-dimensional numerical setup; (right) schematic representation of a fully-resolved particle.
By focusing on the physical phenomena and interactions at the grain scale, numerical micromechanical simulations provide a convenient and detailed insight into the onset and triggering conditions of granular fluidization. To study piping erosion in the context of offshore wind turbine foundations, a fully-resolved numerical model representing a sub-region of a suction bucket foundation is presented in (Kemmler et al., 2025b, 2025c), see Figure 1. In such fully-resolved simulations, one particle spans over several fluid cells. According to the literature,
In this paper, an extensive performance analysis of a fully-resolved micromechanical simulation is presented. The comparison is conducted on the CPU and GPU partitions of the state-of-the-art EuroHPC supercomputers LUMI and MareNostrum 5 and delivers insights by addressing the following aspects: • • •
2. Algorithms
In this section, the pseudocodes for the key simulation components required in every time step are presented. The fluid phase in the suction bucket model is represented using the LBM. The Lagrangian particles are modeled with the Discrete Element Method (DEM). Each time step consists of the following main components: • • • •
Note that the computationally expensive kernels which contribute significantly to the run time are described, in contrast to the less impactful components such as particle dynamics and boundary handling. These computationally expensive algorithms are implemented for both CPUs and GPUs, with loops over fluid cells implemented as for-loops on CPU and as kernel calls on GPU.
2.1. Particle mapping
2.2. Particle velocity field computation
Algorithm 2 computes the velocity induced by the particle motion into the fluid phase. Thereby, the particle velocity field
2.3. Fluid dynamics
The LBM discretizes the fluid domain using a Cartesian grid, where each fluid cell
LBM is particularly efficient for parallel computations, exhibiting a 1:1 read-to-write ratio, and being memory-bound on most hardware architectures (Holzer et al., 2021; Lehmann et al., 2022).
The Partially Saturated Cells Method (PSM) (Noble and Torczynski, 1998) extends the LBM algorithm to couple moving geometries into the fluid. The PSM is an established methodology in the context of sediment transport simulations, especially on GPU systems (Benseghier et al., 2020; Fukumoto et al., 2021) and therefore used here. In the following, it is assumed that not more than two particles can overlap with a single fluid cell, a reasonable assumption for fully-resolved spherical particles, since their spatial extent does not permit more than two particles in a single fluid cell in most cases. The inputs to Algorithm 3 are the PDFs field from the previous time step
2.4. Hydrodynamic force and torque reduction
Algorithm 4 aggregates the hydrodynamic force field
and the hydrodynamic torque on particle i
In the implementation, the hydrodynamic force and torque reduction is realized differently on CPU and on GPU, while maintaining a consistent data layout across both architectures to facilitate portability and performance. In both cases, compact and contiguous arrays are employed to hold the per-particle accumulators for
2.5. Implementation notes
The present suction bucket model is implemented in the massively parallel multiphysics framework waLBerla (Bauer et al., 2021a) (https://www.walberla.net). waLBerla employs a block-structured domain partitioning approach, where the simulation domain is subdivided into multiple uniform blocks, each of which is exclusively assigned to a single Message Passing Interface (MPI) process, enabling parallelism. On CPU nodes, the number of MPI processes is equal to the number of available cores, whereas on GPU nodes, the number of MPI processes is equal to the number of available GPUs or Graphics Compute Dies (GCDs). To maintain consistency across the domain, fluid and particle information must be exchanged between neighboring blocks. This inter-block communication is facilitated through a nearest-neighbor communication scheme, which is implemented using GPU-aware MPI, minimizing communication cost. More detailed implementation and parallelization information can be found in (Kemmler et al., 2025d). Providing highly optimized code for different combinations of LBM variants and hardware architectures is a challenging and time-consuming task when done manually. To address this, the code generation framework
3. Performance analysis
The performance analysis includes a description of the different hardware architectures employed, details about the problem sizes, single-node performance analysis, such as energy-to-solution and a Roofline performance model, as well as scalability analysis for both strong and weak scaling.
3.1. Benchmarking environment
Hardware specifications per node: CPUs of general-purpose nodes, GPUs of accelerated nodes, corresponding FLOPS, main memory size and bandwidth. FLOPS of the CPU partitions are computed based on the number of cores, fused multiply-add units per core, register size, and turbo clock frequency.
Network specifications per node: interconnect, bidirectional bandwidth, and topology.
3.2. Single node simulation setup
Domain partitioning on a single node.
3.3. Single node performance analysis
The performance analysis is conducted at the node level, focusing on the computational efficiency and energy consumption across different hardware architectures. On the LUMI cluster, the “slurm” workload manager is used to obtain energy measurements via “sacct”, whereas on MareNostrum 5, energy data is collected using the “ear” tool. Note that these measurements do not account for the energy consumed by the broader infrastructure, such as network equipment and cooling systems. Figure 2 provides a comparison of the time and energy-to-solution for the setup described in Section 3.2 across the four partitions. (left) Time-to-solution and (right) energy-to-solution on a single node.
Both GPU nodes need less time and energy-to-solution compared to the CPU-based nodes. However, the decrease in energy-to-solution is not as significant as the decrease in time-to-solution. In addition, while the AMD CPU-based LUMI-C node exhibits a higher time-to-solution than the Intel CPU-based MN5-GPP node, its energy consumption is significantly lower. The AMD MI250X-based LUMI-G node is both faster and more energy efficient than the NVIDIA-based MN5-ACC node. Figure 3 illustrates how the time-to-solution is distributed among the simulation components. Shares of the simulation components in the total run time on a single node, with inner-node communication (Comm.), and load imbalances (Load imb.).
The total run time is decomposed into the algorithms “Fluid”, “Velocity”, “Mapping”, and “Reduction”, as described in Section 2. In addition, the breakdown includes communication between CPUs and GPUs within a node, load imbalance overheads, and other operations such as boundary handling and evaluation routines. The load imbalance overheads are explicitly accounted for through MPI barriers right in front of communications to avoid load imbalances being included in the measurements of the blocking MPI communications. Additionally, an MPI barrier is placed before the fluid simulation, which exerts significant utilization on the main memory bandwidth. Without this barrier, processes could still be in previous routines with lower memory pressure. This would allow the processes being already in the fluid simulation to utilize more memory bandwidth, artificially increasing the measured memory bandwidth. Such a discrepancy could lead to overly optimistic results in comparisons with a performance model, particularly on the CPU, where multiple processes share the same main memory connection. The fluid simulation is the computationally most expensive component across all architectures. The fraction of velocity computation, mapping, communication, and load imbalances remains similar across different hardware configurations. However, the reduction step is relatively more expensive on GPU systems. Similarly, the “Other” category is notably larger on GPU systems, primarily influenced by an evaluation routine that computes the differential pressure across the bucket wall.
In the following paragraph, an in-depth analysis of the high cost associated with the reduction step on the NVIDIA architecture is provided. On the CPU, reductions are implemented as simple for-loops without race conditions. In contrast, on the GPU, atomic add operations are employed to ensure thread safety during concurrent updates, see Section 2.4. As a result, the performance of the reduction kernel is strongly influenced by the underlying particle-to-cell ratio. If particles are small relative to the grid cell size, updates from different threads primarily target disjoint memory locations, thereby minimizing atomic contention. In this regime, the cost of atomic operations remains low. Conversely, when a small number of large particles occupy multiple cells, concurrent updates to the same memory address increase, resulting in heightened contention, longer latencies, and a significantly higher kernel run time. To analyze the reduction cost on the MN5-ACC node, it is essential to consider the domain decomposition. On the MN5-ACC node, the simulation domain is decomposed into four blocks in the z-direction, each assigned to a separate GPU, see Table 3. The lowest block, assigned to “GPU0”, is densely populated with particles, whereas the uppermost block, assigned to “GPU3”, primarily contains fluid and the bucket wall segment, see Figure 1 for the numerical setup. A comparison of the reduction performance between GPU0 and GPU3 was performed using NVIDIA’s Nsight Compute profiler. 3 Interestingly, GPU0 executes nearly eight times as many atomic add operations as GPU3, yet requires only 2.26 ms per kernel invocation compared to 11.96 ms on GPU3. This behavior is further reflected in the measured memory throughput, which reaches 65.49 GB/s on GPU0 but drops to 2.71 GB/s on GPU3. The underlying cause is the disparity in memory stalls: for GPU0, 66.6 stalled cycles are reported on average waiting for memory, whereas 810.6 stalled cycles were measured on GPU3. These findings highlight that the high cost of reductions on GPUs arises from the use of atomic add operations, which induce memory contention, high latency, and reduced memory throughput when thread conflicts increase. An important insight from this study is that, for fully-resolved spherical particles with 20 fluid cells per spatial dimension, as used in the present setup, the reduction remains manageable. In this case, only a moderate number of threads reduce into the same memory address, thereby avoiding an all-to-one reduction. However, when individual large particles, such as the bucket wall segment in this configuration, are involved, the atomic add-based reduction can become a significant performance bottleneck. This observation underscores the need for further optimization of reduction strategies on GPUs, particularly for highly inhomogeneous particle distributions.
The evaluation routine for post-processing is performed on the CPU and, therefore, requires transferring the entire fluid field data from GPU to CPU. Although executed only once every 500 time steps, this operation constitutes a significant portion of the total run time, especially on the MN5-ACC node. Evaluation routines can become a bottleneck on GPU nodes, especially if they are executed on the CPU, as these evaluations require GPU-CPU memory transfers, and their run time accounts for a higher fraction of the overall run time since the overall run time on GPU nodes is lower. Load imbalances emerge due to heterogeneous domain compositions, where certain regions contain only fluid, while others are densely packed with particles, leading to highly variable workloads.
To analyze the fluid algorithm performance in detail, a Roofline performance model (Hager and Wellein, 2010) is employed in the following, estimating an upper bound for the achievable performance on each architecture. For assessing which parts of the fluid kernel cause which degradation from the Roofline performance, four different variations of Algorithm 3 are defined with increasing computational intensity: • • • •
Workload and resulting best-case run time according to the Roofline performance model for Kernels #1 to #4.
The best-case run time can be computed using the Roofline model as:
Notably, in all cases, the code exhibits a memory-bound behavior, i.e., the minimal achievable run time is determined by dividing the total memory transfer volume by the measured main memory bandwidth, see Table 1. The total FLOP are computed as: • • • •
The FLOP per cell numbers are obtained from the source code. The data transfer volume is computed as: • • •
Figure 4 illustrates the achieved Roofline performance for Kernels #1 to #4 with the ideal efficiency highlighted by a dashed blue line. Fraction of Roofline performance achieved by Kernels #1 to #4. The ideal efficiency is highlighted by a dashed blue line.
As the FLOP per byte increase from Kernel #1 to #4, the fraction of the achieved Roofline performance tends to decrease on all architectures, with the exception of the AMD EPYC CPU. This trend of reduced bandwidth utilization for increased FLOP per byte is generally expected, as less efficient issuing of memory instructions leads to reduced efficiency in memory utilization. The most significant drop can be observed from Kernel #3 to #4. As Kernel #4 introduces thread divergence, a significant reduction of memory efficiency is expected, especially on the GPU-based partitions. Thus, the MI250X exhibits the lowest memory efficiency for Kernel #4. In contrast, the NVIDIA H100 handles the complexity of Kernel #4 significantly better. Overall, the NVIDIA H100 demonstrates the best memory utilization of all considered architectures.
3.4. Simulation setups for scaling analysis
The scaling setups are based on the configuration used for the single-node performance analysis, as shown in Figure 5 (middle), with the number of nodes ranging from 1 to 512. However, on MN5-ACC, the scaling is limited to 100 nodes due to system restrictions. The baseline setup used throughout the scaling study employs the same physics model as Kernel #4. To perform the strong scaling analysis, the problem size remains fixed while the number of compute nodes increases. As a result, the workload per node decreases. Ideally, the total run time should decrease proportionally to the number of nodes used. In this study, strong scaling results are presented for three different problem sizes. The small problem size consists of 448 × 224 × 896 fluid cells and is scaled from one to eight compute nodes, resulting in the setup shown in Figure 5 (left). The medium problem size comprises 896 × 448 × 1792 fluid cells and is scaled from eight to 64 nodes. The large problem size has 1792 × 896 × 3584 fluid cells and is scaled from 64 to 512 nodes. In weak scaling, the problem size increases proportionally with the number of compute nodes, keeping the workload per node constant. Ideally, the total run time remains constant regardless of the number of nodes used. The weak scaling analysis starts with the setup shown in Figure 5 (middle) computed on a single node. The number of nodes is successively doubled, starting at one and ending at 512 nodes. At the same time, the domain size is doubled in all three spatial dimensions up to 8 × 8 × 8 = 512 times the initial domain. Specifically, for eight nodes, the domain size is doubled in all three directions once, resulting in the setup shown in Figure 5 (right). It is important to note that only the domain size and number of particles are scaled, while the bucket wall segment width remains unchanged. Numerical setup on (left) eight nodes in small strong scaling; (middle) a single node; (right) eight nodes in weak scaling. Black lines indicate the domain partitioning among the nodes. The visualization of the fluid phase is omitted for clarity.
3.5. Scaling analysis
The scaling analysis section explores the system’s performance for different problem sizes, focusing on both strong and weak scaling approaches.
3.5.1. Strong scaling
The results of the strong scaling analysis for the three problem sizes are presented in Figures 6 and 7. The ideal scaling behavior is highlighted by a dashed blue line in both figures. Strong scaling for three problem sizes up to 512 nodes. Strong scaling parallel efficiency for three problem sizes up to 512 nodes.

A similar strong scaling behavior across all three problem sizes was observed. In each case, the time-to-solution decreased as the number of nodes increased. The parallel efficiency on CPU-based nodes was higher than that on accelerated nodes. Specifically, when doubling the number of nodes three times, the parallel efficiency ranged between 90% on LUMI-C and 40% on LUMI-G, respectively. Nevertheless, as expected based on the comparison of time-to-solution on a single node in Figure 2, the GPU partitions were significantly faster than the CPU partitions. For all three problem sizes, the initial number of nodes on the GPU-based systems was faster than the final number of nodes on the CPU nodes, e.g., one GPU node was faster than eight CPU nodes for the small problem size.
3.5.2. Weak scaling
The results of the weak scaling analysis are presented in Figures 8 and 9. The ideal scaling behavior is highlighted by a dashed blue line in both figures. Weak scaling up to 512 nodes. Weak scaling parallel efficiency up to 512 nodes.

The weak scaling behavior of LUMI-C is similar to MN5-GPP, and LUMI-G is similar to MN5-ACC. However, the CPU partitions differ from the GPU partitions. The overall time-to-solution was significantly faster on the accelerated nodes compared to the general-purpose nodes. Run time and parallel efficiency saturated on the CPU partitions LUMI-C and MN5-GPP. In contrast, no saturation was observed on LUMI-G. The parallel efficiency on CPU-based systems was higher than on LUMI-G for larger node counts. For 512 nodes, the parallel efficiency was between 92% on LUMI-C and 64% on LUMI-G. In contrast to the GPU partitions, the communication cost was less significant on CPU partitions since the overall computation took longer, as a result, the parallel efficiency was higher. The parallel efficiency on the CPU partitions behaves as expected. At small node counts, a decrease in parallel efficiency is observed due to the increased cost of inter-process communication. As the number of nodes grows, the parallel efficiency saturates, since the communication pattern is dominated by nearest-neighbor exchanges. In contrast, two noteworthy features can be observed for the GPU partitions in Figure 9. First, on MN5-ACC up to four nodes, a superlinear scaling behavior is present, i.e., a parallel efficiency
This paragraph provides more details on how physical variations of the setup and optimization techniques affect the weak scaling parallel efficiency on LUMI-G reported in Figure 9, which will be denoted as “base case” in the following. The following variations from the base case are considered: increasing the bucket wall width during the scaling, removing particle dynamics, and removing the bucket wall segment, which results in porous media. Furthermore, communication hiding for the fluid field communication between neighboring blocks is implemented. Figure 10 compares the LUMI-G weak-scaling efficiency for these setup variations and optimization techniques. Weak scaling parallel efficiency on LUMI-G for different setups and code optimizations.
When the bucket wall width was included into the scaling, the parallel efficiency reduced significantly. In contrast, the parallel efficiency increased incrementally for the other setup variations and optimizations applied, with the most significant improvement occurring when the bucket wall was removed. With all setup simplifications and optimizations applied, the parallel efficiency can reach up to 89% on 512 nodes. Reducing the forces on the bucket wall has a significant influence on the scaling because it reduces a large number of threads into one variable. The remaining gap in parallel efficiency between 89% and 100% can be attributed to increasing communication costs that cannot be fully hidden after a certain point. Furthermore, load imbalances arose due to higher particle concentrations in deeper layers of the sediment bed, caused by increasing geomechanical stresses.
3.6. Discussion and lessons learned
The comparison between CPU and GPU partitions on a single node revealed notable differences in both time and energy-to-solution, with GPU partitions outperforming CPU partitions in both metrics. This was largely attributed to the significantly more powerful hardware of the GPUs. However, the energy-to-solution of the accelerated nodes was on average 82% lower than on the general-purpose nodes, while the time-to-solution was reduced by 92%. Such trends have been observed in the literature, where a single-GPU setup showed a 1.77× improvement in time-to-solution over a CPU, but only a 1.49× improvement in energy-to-solution, a discrepancy attributed to the higher power consumption of the GPU (Cadenelli et al., 2019). Other studies explain similar observations by the presence of host CPUs in GPU nodes, which still draw a significant amount of power, even if these CPUs are little used during the computation (Calore et al., 2016).
It is important to note that, despite the superior performance of GPU nodes in terms of time, the overall hardware costs of GPU-based nodes exceed those of CPU-based nodes by several multiples. As a result, when considering the cost-effectiveness of the GPU nodes, one must balance the gains in performance against the higher initial hardware investment. Additionally, reduction operations on GPUs are inherently more challenging than on CPUs, largely due to the need for efficient synchronization across numerous parallel threads. Using warp-level reductions before the atomic add for computing hydrodynamic forces on the bucket wall segment, thereby reducing each warp to a single atomic operation, resulted in a performance speedup on the H100 node. However, this approach led to slower reductions on the MI250X. Despite these differences in reduction speed, the choice of reduction method had no significant effect on parallel efficiency in the weak scaling tests. When comparing the performance of the accelerated nodes, note that the H100 nodes only have four GPUs compared to the MI250X nodes, which are equipped with eight GCDs. Furthermore, evaluation routines can significantly impact the performance of GPU nodes, as communication between CPU and GPU, input and output operations, and serialization can account for a substantial portion of the overall run time. These factors should be considered when analyzing GPU performance in real-world applications. To optimize performance, it is essential to minimize GPU-CPU memory transfers. Therefore, evaluation routines, which are often computed on the CPU, should be analyzed for potential execution on the GPU as well to prevent them from becoming a bottleneck. It was found that obtaining good performance on the AMD MI250X GPU was more challenging than on the NVIDIA H100. The MI250X exhibited significantly lower main memory bandwidth, particularly when thread divergence occurred, which was consistent with findings in the literature (Lehmann et al., 2022). Moreover, advanced API functionality, such as unified memory, was less performant in HIP than in CUDA, further complicating optimization efforts. Another critical consideration for GPU performance was register spilling, particularly when utilizing techniques like common subexpression elimination (CSE). While CSE reduced redundant computations, it increased register pressure, leading to excessive spilling to local memory, a problem that was less pronounced in CPU architectures. In the strong scaling analysis, a significant decrease in parallel efficiency was observed, which was more pronounced on GPU-based nodes. The decrease in parallel efficiency was caused by the problem size eventually becoming too small to effectively utilize the computational resources, while communication overhead increases. This effect is expected to be more pronounced on GPU-based systems, as they tend to be less efficient for small workloads. These findings align with those in recent literature (Karp et al., 2023; Kemmler et al., 2025d; Min et al., 2024). Similar behavior was observed in weak scaling, where the parallel efficiency of GPU nodes decreased more significantly. The performance of GPU systems in scaling was also closely linked to the physical setup, with small changes in configuration or workload distribution potentially leading to significant differences in performance, especially at large node counts.
Future work will focus on mitigating the observed load imbalance and memory contention during the reduction of hydrodynamic forces and torques on the bucket wall segment. In particular, load balancing techniques are to be explored, ranging from static approaches, suitable when sediment transport or fluidization remains moderate, to dynamic methods such as physics-aware domain decomposition. These approaches will assign subdomain weights based on local particle concentration and bucket-wall surface area, with optional dynamic remapping at runtime to maintain balanced workloads as sediment distribution evolves. On the GPU side, performance improvements due to employing more advanced reduction techniques beyond simple atomic add are to be studied in more detail. This includes privatized per-warp or per-block accumulators with shared-memory or tree-based reductions prior to a single global update. Additionally, the necessity for the computation of hydrodynamic forces on the bucket wall segment should be carefully assessed. When such forces are not strictly necessary for capturing the relevant physics due to minimal movement, they may be selectively omitted or approximated using surrogate models to trade negligible accuracy loss for substantial performance gains.
4. Conclusion
Fully-resolved coupled fluid-particle simulations represent a powerful approach to gain insights into complex multiphysics systems such as suction bucket foundations for offshore wind turbines. Nonetheless, the computational requirements of these simulations are substantial due to their high resolution and, therefore, the use of highly efficient and scalable implementations on modern supercomputing architectures becomes necessary. This work has provided a detailed description of the algorithmic characteristics of the dominant modules of this granular fluidization application along with a comprehensive performance analysis on four distinct hardware architectures, namely AMD EPYC 7763, AMD MI250X, Intel Sapphire Rapids 8480+, and NVIDIA H100. The performance was quantified in terms of time and energy-to-solution, utilizing a Roofline performance model for the fluid simulation, and evaluated under strong and weak scaling scenarios on up to 512 nodes. The obtained results have shown that: • GPU systems outperform CPU systems both in time and energy-to-solution, as expected, but the energy savings were less pronounced compared to the reduction in time-to-solution. • Reaching the maximum main memory bandwidth becomes more challenging with increasing kernel complexity in terms of FLOP per byte and branching. The NVIDIA H100 achieved the best main memory efficiency. The AMD MI250X exhibited the lowest main-memory efficiency, but used the lowest energy-to-solution. • The implemented model demonstrated its potential for good scalability on modern supercomputing architectures, with parallel efficiencies reaching up to 92% in a weak scaling test on LUMI-C. • Although the GPU systems show a superior performance for compute-intensive kernels, they are also more susceptible to scalability degradation from evaluation routines, communication overhead, and reduction operations under certain conditions, for instance when the reduced portion of the domain represents a substantial fraction of the whole domain, as observed for the bucket wall segment. • Despite the lower energy-to-solution on accelerated nodes, the significantly higher hardware cost of accelerated nodes and reduced parallel efficiency might result in an overall higher cost-to-solution for large-scale simulations on GPU-based systems when both hardware and energy costs are considered. • Parallel efficiency is closely linked to the physical setup, with small modifications potentially causing significant performance differences at large node counts.
The obtained results underscore the importance of application-specific performance evaluation rather than relying solely on benchmark applications.
Footnotes
Acknowledgements
We acknowledge the EuroHPC Joint Undertaking for awarding this project access to the EuroHPC supercomputer LUMI, hosted by CSC (Finland) and the LUMI consortium through a EuroHPC Development Access call. We acknowledge the EuroHPC Joint Undertaking for awarding this project access to the EuroHPC supercomputer MareNostrum 5, hosted by BSC (Spain) through a EuroHPC Development Access call.
Author contributions
S. Kemmler: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data Curation, Writing - Original Draft, Visualization, Project administration; A. Artinov: Conceptualization, Writing - Review and Editing; P. Cuéllar: Conceptualization, Writing - Review and Editing; H. Köstler: Resources, Writing - Review and Editing, Supervision, Funding acquisition;
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work has received funding from the European High Performance Computing Joint Undertaking (JU) and Sweden, Germany, Spain, Greece, and Denmark under grant agreement No 101093393.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability statement
Data is available on Zenodo: https://doi.org/10.5281/zenodo.15063619 (Kemmler et al., 2025a).
