Abstract
Looking for high performance hydrocode simulations on heterogeneous architectures, we detail a performance portable implementation of a second-order accurate 2-D Cartesian explicit CFD solver using Julia’s Just-in-Time (JIT) compilation. In this work, a custom abstraction layer is used targeting two Julia packages, Polyester.jl for efficient shared memory multithreading on CPUs and KernelAbstractions.jl for appropriate backends on GPUs. Using very same optimizations and data structures than those used with Julia, comparisons to static C++ Kokkos compilation are then provided, including speedups and energy consumptions on high-end CPUs and GPUs available mid-2022. Using a single 64-core CPU with a few million cells to benefit from cache effects in multithread mode, the Julia code (≈0.5 × 109 cell-cycles/s) is superior to its C++ Kokkos counterpart, with a very same lower limit (≈0.16 × 109 cell-cycles/s) for higher numbers of cells. Using one GPU, the C++ Kokkos implementation is slightly superior, the Julia implementation tending to the same upper limit (≈1.5 × 109 cell-cycles/s) when the GPU memory (40 GiB) is entirely used. With a small number of floating-point operations per cell and time step, Cartesian solvers are singular in the CFD landscape, such solvers being essentially memory bandwidth bound on both CPUs and GPUs. In this context, at the compute node level, the compute capability of the CPU(s) cannot be underestimated, with (much) more memory available per cell for multi-physics variables and - year over year - improved memory bandwidths, larger caches and higher floating-point capabilities. Indeed, for high performance computing (HPC) simulations involving many MPI processes, communications between compute nodes become significant and best efforts are requested to overlap communications with computations. The performance portable Julia implementation of the CFD solver presented here combines domain decomposition and directional splitting using a static scheduling approach. Benefits from asynchronous communications appear with 16 GPUs on 4 nodes. At best, on this small-size configuration, the GPU mode of the Julia performance portable code brings at full GPUs’ memory capacity a factor of 14× in performance and a factor of 8× in device energy efficiency compared to the CPU mode. Such a work, among others, confirms the potential of the Julia programming language and its emerging HPC software stack, offering (i) the power of a scripting language, (ii) the performances of a compiled language, and perhaps even more importantly (iii) an access to a compilation toolchain with new opportunities for developers to tackle heterogeneous computing architectures.
Keywords
Get full access to this article
View all access options for this article.
