Julia versus C++ Kokkos for performance portable Cartesian CFD solvers on heterogeneous architectures

Abstract

Looking for high performance hydrocode simulations on heterogeneous architectures, we detail a performance portable implementation of a second-order accurate 2-D Cartesian explicit CFD solver using Julia’s Just-in-Time (JIT) compilation. In this work, a custom abstraction layer is used targeting two Julia packages, Polyester.jl for efficient shared memory multithreading on CPUs and KernelAbstractions.jl for appropriate backends on GPUs. Using very same optimizations and data structures than those used with Julia, comparisons to static C++ Kokkos compilation are then provided, including speedups and energy consumptions on high-end CPUs and GPUs available mid-2022. Using a single 64-core CPU with a few million cells to benefit from cache effects in multithread mode, the Julia code (≈0.5 × 10⁹ cell-cycles/s) is superior to its C++ Kokkos counterpart, with a very same lower limit (≈0.16 × 10⁹ cell-cycles/s) for higher numbers of cells. Using one GPU, the C++ Kokkos implementation is slightly superior, the Julia implementation tending to the same upper limit (≈1.5 × 10⁹ cell-cycles/s) when the GPU memory (40 GiB) is entirely used. With a small number of floating-point operations per cell and time step, Cartesian solvers are singular in the CFD landscape, such solvers being essentially memory bandwidth bound on both CPUs and GPUs. In this context, at the compute node level, the compute capability of the CPU(s) cannot be underestimated, with (much) more memory available per cell for multi-physics variables and - year over year - improved memory bandwidths, larger caches and higher floating-point capabilities. Indeed, for high performance computing (HPC) simulations involving many MPI processes, communications between compute nodes become significant and best efforts are requested to overlap communications with computations. The performance portable Julia implementation of the CFD solver presented here combines domain decomposition and directional splitting using a static scheduling approach. Benefits from asynchronous communications appear with 16 GPUs on 4 nodes. At best, on this small-size configuration, the GPU mode of the Julia performance portable code brings at full GPUs’ memory capacity a factor of 14× in performance and a factor of 8× in device energy efficiency compared to the CPU mode. Such a work, among others, confirms the potential of the Julia programming language and its emerging HPC software stack, offering (i) the power of a scripting language, (ii) the performances of a compiled language, and perhaps even more importantly (iii) an access to a compilation toolchain with new opportunities for developers to tackle heterogeneous computing architectures.

Keywords

Julia C++Kokkos HPC performance portability LLVM CPU GPU multithreading openmp CUDA ROCm MPI asynchronous communications weak scalability Godunov-type schemes Euler equations

Get full access to this article

View all access options for this article.

References

Bauer

(2023a) LIKWID.jl. Original-date:2020-10-20T14:37:42Z. https://github.com/JuliaPerf/LIKWID.jl

Bauer

(2023b) ThreadPinning.jl. Original-date: 2021-10-13T09:30:43Z. https://github.com/carstenbauer/ThreadPinning.jl

Beckingsale

Burmark

Hornung

, et al. (2019) RAJA: portable performance for large-scale scientific applications. 2019 IEEE/ACM international workshop on performance, portability and productivity in HPC (P3HPC). IEEE, 71–81. Available at: https://ieeexplore.ieee.org/document/8945721.

Besard

(2022) oneAPI.jl. https://github.com/JuliaGPU/oneAPI.jl

Besard

Foket

De Sutter

(2019) Effective extensible programming: unleashing Julia on GPUs. IEEE Transactions on Parallel and Distributed Systems 30(4): 827–841, Number: 4 Conference Name.

Bezanson

Edelman

Karpinski

, et al. (2017) Julia: a fresh approach to numerical computing. SIAM Review 59(1): 65–98, Publisher: Society for Industrial and Applied Mathematics.

Byrne

Wilcox

Churavy

(2021) MPI.jl: Julia bindings for the message passing interface. Proceedings of the JuliaCon Conferences 1(1): 68.

Carter

Trott

Sunderland

(2014) Kokkos: enabling manycore performance portability through polymorphic memory access patterns. Journal of Parallel and Distributed Computing 74(12): 3202–3216. https://www.sciencedirect.com/science/article/pii/S0743731514001257

Churavy

(2023) KernelAbstractions.jl. https://github.com/JuliaGPU/KernelAbstractions.jl

10.

Churavy

Godoy

Bauer

, et al. (2022) Bridging HPC communities through the Julia programming language. Submitted for review. https://arxiv.org/abs/2211.02740

11.

Dakin

Jourdren

(2016) High-order accurate Lagrange-remap hydrodynamic schemes on staggered Cartesian grids. Comptes Rendus Mathematique 354(2): 211–217, Num Pages: 7 Place: Paris Publisher: Elsevier France-Editions Scientifiques Medicales Elsevier Web of Science ID: WOS:000373518200018.

12.

Danial

(2021) cloc: v1.92. https://doi.org/10.5281/zenodo.5760077

13.

Djoudi

Barthou

Carribault

, et al. (2005) Maqao : modular assembler quality analyzer and optimizer for itanium 2. https://www.labri.fr/perso/barthou/ps/maqao.pdf

14.

Duboc

Enaux

Jaouen

, et al. (2010) High-order dimensionally split Lagrange-remap schemes for compressible hydrodynamics. Comptes Rendus Mathematique 348(1-2): 105–110, Num Pages: 6 Place: Paris Publisher: Elsevier France-Editions Scientifiques Medicales Elsevier Web of Science ID: WOS:000274520500023.

15.

Elrod

(2023a) LoopVectorization.jl. Original-date: 2019-01-14T05:55:52Z. https://github.com/JuliaSIMD/LoopVectorization.jl

16.

Elrod

(2023b) Polyester.jl. Original-date: 2021-02-20T01:45:49Z. https://github.com/JuliaSIMD/Polyester.jl

17.

Evans

Siegel

Draeger

, et al. (2022) A survey of software implementations used by application codes in the Exascale Computing Project. The International Journal of High Performance Computing Applications 36(1): 5–12, Number: 1 Publisher: Sage Publications Ltd STM.

18.

Gamblin

LeGendre

Collette

, et al. (2015) The Spack package manager: bringing order to HPC software chaos. IEEE Computer Society: 1–12. https://www.computer.org/csdl/proceedings-article/sc/2015/2807623/12OmNBf94Xq.ISSN:2167-4337

19.

Godoy

Valero-Lara

Anderson

, et al. (2023) Julia as a unifying end-to-end workflow language on the Frontier exascale system. In: Proceedings of the SC ’23 workshops of the international conference on high performance computing, network, storage, and analysis, SC-W ’23. New York, NY, USA: Association for Computing Machinery, 1989–1999.

20.

Grete

Dolence

Miller

, et al. (2022) Parthenon—a performance portable block-structured adaptive mesh refinement framework. The International Journal of High Performance Computing Applications. Publisher: Sage Publications Ltd STM, 10943420221143775.

21.

Gruber

Eitzinger

Hager

, et al. (2023) Likwid. https://zenodo.org/records/10105559

22.

Haber

(2023) PAPI.jl. Original-date: 2020-10-28T15:46:18Z. https://github.com/JuliaPerf/PAPI.jl

23.

Heuze

Jaouen

Jourdren

(2009) Dissipative issue of high-order shock capturing schemes with non-convex equations of state. Journal of Computational Physics 228(3): 833–860, Num Pages: 28 Place: San Diego Publisher: Academic Press Inc Elsevier Science Web of Science ID: WOS:000262552500012.

24.

Hunold

Steiner

(2020) Benchmarking Julia’s communication performance: is Julia HPC ready or full HPC? 2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). IEEE, 20–25. Available at: https://ieeexplore.ieee.org/document/9307882/?arnumber=9307882.

25.

Innes

(2023) MacroTools.jl. Original-date:2015-07-09T14:20:08Z. https://github.com/FluxML/MacroTools.jl

26.

Jourdren

(2005) HERA: a hydrodynamic AMR platform for multi-physics simulations. In: Plewa

Linde

Gregory Weirs

(eds) Adaptive Mesh Refinement - Theory and Applications. Berlin, Heidelberg: Springer, 283–294. DOI: 10.1007/3-540-27039-6_19.

27.

Konstantinidis

Cotronis

(2017) A quantitative roofline model for GPU kernel performance estimation using micro-benchmarks and hardware metric profiling. Journal of Parallel and Distributed Computing 107: 37–56. https://www.sciencedirect.com/science/article/pii/S0743731517301247

28.

Lin

McIntosh-Smith

(2021) Comparing Julia to performance portable parallel programming models for HPC. 2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). IEEE, 94–105. Available at: https://ieeexplore.ieee.org/document/9652798.

29.

Mazouz

Touati

SAA

Barthou

(2011) Performance evaluation and analysis of thread pinning strategies on multi-core platforms: case study of SPEC OMP applications on intel architectures. 2011 International Conference on High Performance Computing & Simulation. IEEE, 273–279. Available at: https://ieeexplore.ieee.org/document/5999834.

30.

Pérache

Jourdren

Namyst

(2008) MPC: a unified parallel runtime for clusters of NUMA machines. In: Luque

Margalef

Benítez

(eds) Euro-Par 2008 – Parallel Processing. Berlin, Heidelberg: Springer, 78–88.

31.

Ramadhan

Wagner

Hill

, et al. (2020) Oceananigans.jl: fast and friendly geophysical fluid dynamics on GPUs. Issue: 53 Pages: 2018 Publication Title: Journal of Open Source Software Volume: 5 original-date: 2018-10-13T14:15:44Z. https://github.com/CliMA/Oceananigans.jl

32.

Samaroo

Smirnov

Churavy

, et al. (2023) AMDGPU.jl. https://github.com/JuliaGPU/AMDGPU.jl.Original-date:2020-07-02T16:16:24Z

33.

Schanen

Maldonado

Pacaud

, et al. (2020) ExaPF.jl: a power flow solver for GPUs. In: Proceedings of JuliaCon 2020. Groupe d’études et de recherche en analyse des décisions (GERAD). https://www.gerad.ca/fr/papers/G-2020-74

34.

Schlottke-Lakemper

Gassner

Ranocha

, et al. (2021) Trixi.jl: adaptive high-order numerical simulations of hyperbolic PDEs in Julia. https://github.com/trixi-framework/Trixi.jl

35.

Trott

Lebrun-Grandié

Arndt

, et al. (2022) Kokkos 3: programming model extensions for the exascale era. IEEE Transactions on Parallel and Distributed Systems 33(4): 805–817.

36.

Woodward

Colella

(1984) The numerical simulation of two-dimensional fluid flow with strong shocks. Journal of Computational Physics 54(1): 115–173. https://www.sciencedirect.com/science/article/pii/0021999184901426

37.

Zhang

Almgren

Beckner

, et al. (2019) AMReX: a framework for block-structured adaptive mesh refinement. Journal of Open Source Software 4(37): 1370. https://joss.theoj.org/papers/10.21105/joss.01370