Sage Journals: Discover world-class research

Abstract

A high-performance implementation of a multiphase lattice Boltzmann method based on the conservative Allen-Cahn model supporting high-density ratios and high Reynolds numbers is presented. Meta-programming techniques are used to generate optimized code for CPUs and GPUs automatically. The coupled model is specified in a high-level symbolic description and optimized through automatic transformations. The memory footprint of the resulting algorithm is reduced through the fusion of compute kernels. A roofline analysis demonstrates the excellent efficiency of the generated code on a single GPU. The resulting single GPU code has been integrated into the multiphysics framework waLBerla to run massively parallel simulations on large domains. Communication hiding and GPUDirect-enabled MPI yield near-perfect scaling behavior. Scaling experiments are conducted on the Piz Daint supercomputer with up to 2048 GPUs, simulating several hundred fully resolved bubbles. Further, validation of the implementation is shown in a physically relevant scenario—a three-dimensional rising air bubble in water.

Keywords

GPGPU code generation performance engineering multiphase flow lattice Boltzmann

1. Introduction

The numerical simulation of multiphase flow is a challenging field of computational fluid dynamics (see Prosperetti and Tryggvason, 2007). Although a wide variety of different approaches have been developed, simulating the dynamics of immiscible fluids with high-density ratio and at high Reynolds numbers is still considered complicated (see Huang et al., 2015). Such multiphase flows require models for the interfacial dynamics (see Yan et al., 2011). A full resolution of these phenomena is usually impractical for macroscopic CFD techniques since the interface is only a few nanometers thick as pointed out by Fakhari et al. (2017b).

Therefore, sharp interface techniques model the interface as two-sided boundary conditions on a free surface and can thus achieve a discontinuous transition (see Bogner et al., 2015; Körner et al., 2005; Thürey et al., 2009). The modeling and implementation of these boundary conditions can be complicated as stated by Bogner et al. (2016), especially in a parallel setting. Diffuse-interface models, in contrast, represent the interface in a transition region of a finite thickness that is typically much wider than the true physical interface (see Anderson et al., 1998). Thus, the sharp interface between the fluids is replaced by a smooth transition with thickness $ξ$ of a few grid cells. This removes abrupt jumps and certain singularities coming along with a sharp interface. In this work, we will focus on phase-field modeling. Here, an advection-diffusion-type equation is solved based on either Cahn-Hilliard (see Cahn and Hilliard, 1958) or Allen-Cahn models (ACM) (see Allen et al., 1976) to track the interface. The governing equations can be solved using the lattice Boltzmann method (LBM) that is based on the kinetic theory expressed by the original Boltzmann equation (see Li et al., 2016).

Typically, the simulation domain is discretized with a Cartesian grid for 3D LBM simulations. Due to its explicit time-stepping scheme and its high data locality, the LBM is well suited for extensive parallelization (see Bauer et al., 2020a). For the simulation of multiphase flows, additional force terms are needed that employ non local data points so that the ideal situation may worsen somewhat. Fakhari et al. have shown how to improve the locality for the conservative ACM (see Fakhari et al., 2017b). In order to resolve physically relevant phenomena, a sufficiently high resolution at the interface between the fluid phases is necessary. Thus we may need simulation domains containing billions of grid points and beyond. This creates the need for using high-performance computing (HPC) systems and for a highly parallelized and run-time efficient algorithms. However, highly optimized implementations in low-level languages like Fortran or C often suffer from poor flexibility and poor extensibility so that algorithmic changes and the development of new models may get tedious.

To overcome this problem, we here use a work flow where we generate the compute intensive kernels at compile time with a code generator realized in Python. Thus we obtain the highest possible performance while maintaining maximal flexibility (see Bauer et al., 2019). Furthermore, the symbolic description of the complete ACM allows us to use IPython (see Pérez and Granger, 2007) as an interactive prototyping environment. Thus, changes in the model, like additional terms, different discretization schemes, different versions of the LBM or different LB stencils can be incorporated directly on the level of the defining mathematical equations. These equations can be represented in LaTeX-form. The generated code can then run in parallel with OpenMP parallel on a single CPU or on a single GPU. Once a working prototype has been created in this working mode, the automatically generated code kernels can be integrated into existing HPC software as external C++ files. In this way, we can execute massively parallel simulations. Note that this workflow permits to describe physical models in symbolic form and still run with maximal efficiency on parallel supercomputers. Using code generation, we realize a higher level of abstraction and thus an improved separation of concerns.

The remainder of the article is structured as follows. In Section 2, we will summarize related work to the conservative ACM. In Section 3, we will introduce the governing equations of the conservative ACM. Section 4 presents details of the implementation, by first introducing the code generation toolkit lbmpy by Bauer et al. (2020b) which constitutes the basis of our implementation. Then, we present the phase-field algorithm itself in a straightforward and an improved form, where the improvements essentially lie in the minimization of the memory footprint. This is primarily achieved by changing the structure of the algorithm in order to be able to fuse several compute kernels. The performance of our implementation is discussed in Section 5. We first show a comparison of the straightforward and the improved algorithm. Then the performance of the improved version is analyzed on a single GPU with a roofline approach. After that the scaling behavior on up to $2048$ GPU nodes is presented in Section 5.2. For the scaling on many GPU nodes, communication hiding strategies are explained and analyzed. Finally, we validate the physical correctness of our implementation with test cases for a rising bubble in Sections 6.1 and 6.2.

2. Related work

The interface tracking in this work is carried out with the Allen-Cahn equation (ACE) (see Allen et al., 1976). A modification of the ACE to a phase-field model was proposed by Sun and Beckermann (2007). Nevertheless, it is Cahn-Hilliard theory (see Cahn and Hilliard, 1958), which is most often used in phase-field models to perform the interface tracking. A reason for that is the implicit conservation of the phase-field and thus the conservation of mass. As a drawback, it includes fourth-order spatial derivatives, which worsens the locality of the LB framework as pointed out by Geier et al. (2015a). In order to make the ACE accessible for phase-field models, Pao-Hsiung and Yan-Ting (2011) have presented it in conservative form. Furthermore, the conservative ACE contains only second-order derivatives which allows a more efficient implementation.

In the work of Geier et al. (2015a) the conservative ACE was first solved using a single relaxation time (SRT) algorithm. Additionally, they proposed an improvement of the algorithm by solving the collision step in the central moment space and adapting the equilibrium formulation which makes it possible to directly calculate the gradient of the phase-field locally via the moments. This promising approach, however, leads to a loss of accuracy (see Fakhari et al., 2019). On the other hand Fakhari et al. (2017b) used an SRT formulation to solve the conservative ACE with isotropic finite-differences (see Kumar, 2004) to compute the curvature of the phase-field. This approach was later extended by Mitchell et al. (2018a) for the three-dimensional case. A disadvantage of this approach is, that it becomes complicated to apply single array streaming patterns (see Wittmann et al., 2016) like the AA-pattern or the Esoteric Twist to the LBM as stated by Geier et al. (2015a). This is due to a newly introduced non-locality in the update process, originating in the finite difference calculation (see Geier et al., 2015a). In this publication, the phase-field LBE is presented with a multiple relaxation time (MRT) formulation which was first published by Ren et al. (2016). This formulation was also used in recent studies by Dinesh Kumar et al. (2019).

3. Model description

3.1. LB model for interface tracking

As described by Fakhari et al. (2017b), the phase-field $ϕ$ in the conservative ACM assumes two extreme values, $ϕ_{L}$ and $ϕ_{H}$ , in the bulk of the lighter and the heavier fluid. There are different possibilities to choose these values (see Fakhari et al., 2017b). Throughout the simulations in this work we set $ϕ_{L} = 0$ and $ϕ_{H} = 1$ , respectively. The phase-field equation for two immiscible fluids reads

\frac{\partial ϕ}{\partial t} + \nabla \cdot ϕ u = \nabla \cdot M [\nabla ϕ - \frac{\nabla ϕ}{\nabla ϕ} θ],

where $θ = 1 - 4 {(ϕ - ϕ_{0})}^{2} / ξ$ , t is the time, u is the macroscopic velocity, $ξ$ is the interface thickness and $ϕ_{0} = (ϕ_{L} + ϕ_{H}) / 2$ indicates the location of the phase-field. Further, the mobility is

M = τ_{ϕ} c_{s}^{2} Δ t

is related to the phase-field relaxation time $τ_{ϕ}$ . The speed of sound $c_{s} = c / \sqrt{3}$ , where $c = Δ x / Δ t$ , and $Δ x = Δ t = 1$ , which is common practice for uniform grids (see Fakhari et al., 2017b).

In the equilibrium state the profile of the phase-field $ϕ$ of an interface located at $x_{0}$ is

ϕ (x) = ϕ_{0} \pm \frac{ϕ_{H} - ϕ_{L}}{2} tanh (\frac{x - x_{0}}{ξ / 2}) .

The LB model for equation (1) to update the phase-field distribution function h_i can be written as (see Geier et al., 2015a)

\begin{matrix} h_{i} (x + e_{i} Δ t, t + Δ t) = Ω_{i j}^{h} (h_{j}^{eq} - h_{j} - \frac{1}{2} F_{j}^{ϕ}) |_{(x, t)} \\ + h_{i} (x, t) + F_{i}^{ϕ} (x, t), \end{matrix}

in which the forcing term is given by (see Fakhari et al., 2017b)

F_{i}^{ϕ} (x, t) = Δ t θ w_{i} e_{i} \cdot \frac{\nabla ϕ}{\nabla ϕ} .

In equation (4) $Ω_{i j}^{h}$ represents the elements of the collision matrix and takes the form $Ω = M^{- 1} S M$ , where M is the moment matrix (see Fakhari and Lee, 2013) and S is the diagonal relaxation matrix. As described by Ren et al. (2016) we relax the first-order moments by $1 / τ_{ϕ}$ and all other moments by one. The phase-field relaxation time $τ_{ϕ}$ is calculated with equation (2). The parameters w_i and $e_{i}$ correspond to the lattice weights and the mesoscopic velocity. The equilibrium phase-field distribution function $h_{i}^{eq} = ϕ Γ_{i}$ , where

Γ_{i} = w_{i} [1 + \frac{e_{i} \cdot u}{c_{s}^{2}} + \frac{{(e_{i} \cdot u)}^{2}}{2 c_{s}^{4}} - \frac{u \cdot u}{2 c_{s}^{2}}]

is the dimensionless distribution function. By taking the zeroth moment of the phase-field distribution functions the phase-field $ϕ$ can be evaluated

ϕ = \sum_{i} h_{i} .

The density $ρ$ for the whole domain is calculated by using a linear interpolation

ρ = ρ_{L} + (ϕ - ϕ_{L}) (ρ_{H} - ρ_{L}),

where $ρ_{H}$ and $ν_{H}$ are the density and the kinematic viscosity of the heavier fluid while $ρ_{L}$ and $ν_{L}$ are the density and the kinematic viscosity of the lighter fluid.

3.2. LB model for hydrodynamics

In a macroscopic form the continuity and the incompressible Navier-Stokes equations describe the evolution of a flow field and can be written as

\begin{array}{l} \nabla \cdot u = 0 \\ ρ (\frac{\partial u}{\partial t} + u \cdot \nabla u) = - \nabla p + \nabla \cdot Π + F, \end{array}

where $ρ$ is the density, p the pressure, $Π = μ (\nabla u + \nabla u^{T})$ the viscous stress tensor, $μ$ the dynamic viscosity and $F = F_{s} + F_{b}$ are the surface tension force and external forces respectively. To solve equation (9) the following LB model is used (see Dinesh Kumar et al., 2019)

\begin{matrix} g_{i} (x + e_{i} Δ t, t + Δ t) = Ω_{i j}^{g} (g_{j}^{eq} - g_{j} - \frac{1}{2} F_{j}) |_{(x, t)} \\ + g_{i} (x, t) + F_{i} (x, t), \end{matrix}

where the hydrodynamic forcing is given by

F_{i} (x, t) = Δ t w_{i} \frac{e_{i} F}{ρ c_{s}^{2}},

and g_i is the velocity-based distribution function for incompressible fluids. The equilibrium distribution function is

g_{i}^{eq} = p^{*} w_{i} + (Γ_{i} - w_{i}),

where $p^{*} = p / ρ c_{s}^{2}$ denotes the normalized pressure. The hydrodynamic force F consists of four terms

F = F_{p} + F_{s} + F_{μ} + F_{b} .

The pressure force can be obtained as

F_{p} = - p^{*} c_{s}^{2} \nabla ρ,

where the normalized pressure is calculated as the zeroth moment of the hydrodynamic distribution function

p^{*} = \sum_{i} g_{i} .

The surface tension force

F_{s} = μ_{ϕ} \nabla ϕ,

is the product of the chemical potential

μ_{ϕ} = 4 β (ϕ - ϕ_{L}) (ϕ - ϕ_{H}) (ϕ - ϕ_{0}) - κ \nabla^{2} ϕ

and the gradient of the phase-field. The coefficients $β = 12 σ / ξ$ and $κ = 3 σ ξ / 2$ link the interface thickness and the surface tension. For an MRT scheme the viscous force is computed as

\begin{array}{l} F_{μ, i}^{MRT} = & - \frac{ν}{c_{s}^{2} Δ t} [\sum_{α} e_{α i} e_{α j} \\ \times \sum_{β} Ω_{α β} (g_{β} - g_{β}^{eq})] \frac{\partial ρ}{\partial x_{j}}, \end{array}

where the viscosity $ν$ is related to hydrodynamic relaxation time $τ$

ν = τ c_{s}^{2} Δ t .

There are a few different ways to interpolate the hydrodynamic relaxation time as it is shown in the work of Fakhari et al. (2017b). Overall, they have demonstrated to get the most stable results with a linear interpolation. Therefore, we will use it in this work

τ = τ_{L} + (ϕ - ϕ_{L}) (τ_{H} - τ_{L}) .

We relax the second-order moments with the hydrodynamic relaxation rate

s_{ν} = \frac{1}{τ + 1 / 2},

when solving equation (10) to ensure the correct viscosity of the fluid. All other moments are relaxed by one. The velocity u is obtained via the first moments of the hydrodynamic distribution function and gets shifted by the external forces

u = \sum_{i} g_{i} e_{i} + \frac{F}{2 ρ} Δ t .

In order to approximate the gradient in equations (5), (14), (16) and (18) a second-order isotropic stencil can be applied (see Kumar, 2004; Ramadugu et al., 2013)

\nabla ϕ = \frac{c}{c_{s}^{2} {(Δ x)}^{2}} \sum_{i} e_{i} w_{i} ϕ (x + e_{i} Δ t, t) .

The Laplacian in equation (17) can be approximated with

\nabla^{2} ϕ = \frac{2 c^{2}}{c_{s}^{2} {(Δ x)}^{2}} \sum_{i} w_{i} [ϕ (x + e_{i} Δ t, t) - ϕ (x, t)] .

4. Software design for a flexible implementation

4.1. Code generation

Our implementation of the conservative ACM is based on the open-source LBM code generation framework lbmpy¹ (see Bauer et al., 2020b). Using this meta-programming approach, we address the often encountered trade-off between code flexibility, readability, and maintainability on the one hand, and platform-specific performance engineering on the other hand. Especially when targeting modern heterogeneous HPC architectures, a highly optimized compute kernel, may require that loops are unrolled, common subexpressions extracted, and possibly hardware-specific intrinsics are used. In state-of-the art optimized software (see Hager and Wellein, 2010), these transformations are essential, and must be performed manually for each target architecture. Clearly, the resulting codes are time-consuming to develop, error prone, hard to read, difficult to maintain and often very hard to adapt and extend. Flexibility and maintainability have been sacrificed, since such complex programming techniques are essential to get the full performance available on the system.

Here, in contrast, we employ the LBM code generation framework lbmpy. Thanks to the automated code transformations, the LB scheme can be specified in a high-level symbolic representation. The hardware- and problem-specific transformations are applied automatically so that starting from an abstract representation, highly efficient C-code for CPUs or CUDA/OpenCL code for GPUs can be generated with little effort.

Our new tool lbmpy is realized as a Python package that in turn is built by using the stencil code generation and transformation framework pystencils² (see Bauer et al., 2019). The flexibility of lbmpy results from the fully symbolic representation of collision operators and compute kernels, utilizing the computer algebra system SymPy (see Meurer et al., 2017). The package offers an interactive environment for method prototyping and development on a single workstation, similar to what FEniCS (see Alnæs et al., 2015) is in the context of finite element methods. Generated kernels can then be easily integrated into the HPC framework waLBerla, which is designed to run massively parallel simulations for a wide range of scientific applications (see Bauer et al., 2020a). In this workflow, lbmpy is employed for generating optimized compute and communication kernels, whereas waLBerla provides the software structure to use these kernels in large scale scenarios on supercomputers. lbmpy can generate kernels for moment-based LB schemes, namely single-relaxation-time (SRT), two-relaxation-times (TRT), and multiple-relaxation-time (MRT) methods. Additionally, modern cumulant and entropically stabilized collision operators are supported (see Geier et al., 2015b; Karlin et al., 1998).

When implementing the coupled multiphase scheme as described in Section 3 with lbmpy, we can reuse several major building blocks that are already part of lbmpy (see Figure 1). First we can choose between different single-phase collision operators to use for the Allen-Cahn and the hydrodynamic LBM. We can easily switch between different lattices, allowing us to quickly explore the accuracy-performance trade-off between stencils with more or less neighbors. A native 2D implementation is also quickly available by selecting the D2Q9 lattice model.

Figure 1.

Flexibility of the conservative ACM with the lbmpy code generation framework. The boxes on the right show the two LB steps. On the left options are shown which can be applied to the two LB steps by lbmpy. The connecting lines show a possible configuration which will be used for the benchmark in this section.

Then, the selected collision operators of lbmpy can be adapted to the specific requirements of the scheme. In our case, we have to add the forcing terms equations (5) and (13). This is done on the symbolic level, such that no additional arrays for storing these terms have to be introduced, as would typically be the case when extending an existing LB method implemented in C/C++. Also no additional iteration passes are needed to compute the force terms. The additional forces computed directly and are within the loops that update the LB distributions, thus significantly saving memory and operational overhead. Furthermore, optimization passes, like common subexpression elimination, SIMD vectorization via intrinsics, or CUDA index mapping are done automatically by transformations further down the pipeline, the new force terms are fully included in the optimization. Note how this leads to a clean separation of concerns between model development and optimization with obvious benefits for code maintainability and flexibility without sacrificing the possibility to achieve best possible performance.

On the modeling level, this code generation approach and our tools allow the application developer to express the methods using a concise mathematical notation. LB collision operators are formulated in so-called collision space spanned by moments or cumulants (see Coreixas et al., 2019). For each moment/cumulant a relaxation rate and its respective equilibrium value is chosen. For a detailed description of this formalism and its realization in Python see Bauer et al. (2020b).

Similarly, our system supports the mathematical formulation of differential operators that can be discretized automatically with various numerical approximations of derivatives. This functionality is employed to express the forcing terms. The Python formulation directly mimics the mathematical definition as shown in equations (23) and (24), i.e. it provides a gradient and Laplacian operator. The user then can select between different finite difference discretizations, selecting stencil neighborhood, approximation order, and isotropy requirements.

Starting from the symbolic representations, we create the compute kernels for our application. Knowing the details of the model, in particular the stencil types, at compile time allows the system to simplify expressions, and run common subexpression elimination to reduce the number of floating point operations (FLOPs) drastically.

An overview of the complete workflow including the combination of lbmpy and waLBerla for MPI parallel execution is illustrated in Figure 2. As described above the creation of the phase-field model is accomplished directly with lbmpy, which forms a convenient prototyping environment since all equations can be stated as symbolic representations. lbmpy does not only produce the compute kernels, but can generate also the pack and unpack information as it is needed for the MPI communication routines. This is again completely automatic, since the symbolic representations expose all field accesses and thus the data that must be kept in the ghost layers. A ghost layer is a single layer of cells around each subdomain used for the communication between neighboring subdomains. Furthermore, also the routines are generated, to implement the boundary conditions.

Figure 2.

Complete workflow of combining lbmpy and waLBerla for MPI parallel execution. Furthermore, lbmpy can be used as a stand-alone package for prototyping.

The complete generation process can be configured to produce C-Code or code for GPUs with CUDA and alternatively OpenCL. These kernels can be directly called as python functions to be run in an interactive environment or combined with the HPC framework waLBerla .

To demonstrate the usage of lbmpy we show how the update rule for the hydrodynamic distribution functions g_i is realized. Following Mitchell et al. (2018b) equation (10), should be formulated as

\begin{array}{l} g (x & + e_{i} Δ t, t + Δ t) = M^{- 1} [m \\ - (m^{eq} + 0.5 F_{m}) S + F_{m}], \end{array}

where $m = M g$ and the forcing is given by $F_{m} = ρ^{- 1} (0, F_{x}, F_{y}, F_{z},0, \dots)$ . This formulation drastically reduces the number for FLOPs needed in each cell compared to (10) where the force is applied in the particle distribution function (PDF) space. Note here that this kind of modification can be implemented in lbmpy in a very simple way. After creating the LB method it contains the moment matrix M , the relaxation matrix S and the moment equilibrium values $m_{i}^{eq}$ . These variables are stated in SymPy and can be used to directly write equation (25).

4.2. Algorithm

To discuss how the model of Section 3 can be realized, we will first present a straightforward implementation. The corresponding algorithm is displayed in Algorithm 1 and will be discussed briefly in the following. We start the time loop with time step size $Δ t$ , after initializing all fields. For MPI parallel simulations, waLBerla uses a domain partitioning into subdomains that are assigned to CPUs/GPUs (see Bauer et al., 2020a). Thus, we perform the collision for both LB steps on each subdomain. Following that, we communicate the relevant PDF values of the ghost layers on each process for both PDF fields. Next, the streaming step for the phase-field LB step is executed. In order to update the phase-field, we will then calculate the sum of the phase-field PDFs for each cell, according to equation (7). Before we finalize the streaming for the hydrodynamic LB step, we do the communication of the phase-field $ϕ$ . As the last step, we update the velocity field with the first-order moments of the hydrodynamic PDFs according to equation (22). To update the macroscopic variables $ϕ$ and $u$ each LB field has to be written one more time.

Algorithm 1.

Straightforward algorithm for the conservative ACM.

Based on this straightforward algorithm, we will now outline substantial improvements that can be made. To lower the memory footprint of the phase-field model, we combine the collision and the streaming of the phase-field distribution functions and the update of the phase-field into one phase-field LB step. Accordingly, the collision and the streaming of the velocity PDFs and the update of the velocity field get combined to one hydrodynamic LB step. In this manner, the phase-field and velocity PDFs, as well as the phase-field and velocity field, get updated in only two instead of six compute kernels. A detailed overview of the proposed algorithm for the conservative ACM is presented in Algorithm 2. In our proposed algorithm, we subdivide each LB step into iterations over an outer and an inner domain, similarly to Feichtinger et al. (2015) but with variable cell width. This is pointed out in Figure 3. For illustration purposes, the figure illustrates only the two-dimensional case. The three-dimensional case is completely analogous. As shown, the frame width controls the iteration space of the outer and inner domain. This can be done for all directions independently. In the case illustrated, we have a frame width of four cells in x-direction and two grid cells in y-direction.

Figure 3.

Subdivision of the domain for communication hiding.

After the initialization of all fields, we start the time loop with time step size $Δ t$ . Communication and computation can now be overlapped. While updating the block interior of each subdomain with a stream-pull-collide scheme, we start the communication of the phase-field PDFs in the ghost layers. Since the update for the phase-field is the zeroth moment of the phase-field PDFs as described by equation (7), we also resolve the summation in the same kernel to minimize the memory footprint. Note that for achieving even better accuracy, another intermediate storage for the phase-field would be required here. Once the computation is completed, we wait for the communication to finish and update the frame of each subdomain.

When the LB step for interface tracking is carried out, we start the communication of the velocity-based distribution function together with the phase-field. Note that these communication requirements can now be combined so that only one MPI-message must be sent. Simultaneously, the inner part of the domain gets updated with the hydrodynamic LB step in a collide-stream-push manner. According to equation (18), we need to form the non-equilibrium for the viscous force, which makes a collide-stream-push scheme more convenient to use. To lower the memory pressure, we update the velocity field with equation (22) in the same kernel accordingly. To finish a single time step, we wait for the communication and update the outer part of the domain. Consequently, in Algorithm 2, a one-step two-grid algorithm is applied for both LB steps. In the literature, improvements to the algorithms have been proposed, which do not rely on calculating finite differences Geier et al. (2015a). This allows the use of all optimization techniques, which are available for the LBM, like using a single grid algorithm, e.g. the AA-Pattern or the EsoTwist Wittmann et al. (2016). However, it comes with a loss of accuracy and makes the algorithm less stable. Therefore, we do not use the proposed technique in this work and relay on using two population sets instead.

Algorithm 2.

Improved algorithm for the conservative ACM.

5. Benchmark results

In the following, we compare Algorithm 1 and Algorithm 2. For the straightforward algorithm, we need to load each PDF field three times during a single time step. In the improved algorithm, it is only necessary to load each field once. As we will discuss in Section 5.1 in more detail, the performance limiting factor of the model is the memory bandwidth. Therefore, we expect an increase in the performance of our improved algorithm approximately by a factor of three. To measure the performance of both algorithms, we initialize a squared domain of $260^{3}$ cells on an NVIDIA Tesla P100 GPU. The domain consists of liquid with $ρ_{H} = 1$ and a gas bubble in the middle with $ρ_{L} = 0.001$ with a radius of $R = 65$ grid cells. The mobility is set to $M = 0.02$ , the interface thickness is chosen to be $ξ = 5$ , the surface tension is $σ {= 10}^{- 4}$ and the relaxation time is $τ = 0.53$ . In all directions, periodic boundary conditions are applied, and there is no external force acting on the bubble. The same parameter setup is used for all other benchmarks in the next sections if not stated differently. We measure the performance in Mega Lattice Updates per Second (MLUPs) after $200$ time steps. For the unoptimized algorithm, we measure $211$ MLUPs, and for the improved algorithm, we measure $550$ MLUPs. Unlike expected, we cannot entirely observe an improvement of a factor of three between the two algorithms. One reason might be that caching works better in the straight forward algorithm due to simpler compute kernels.

In the following sections we first investigate the performance of Algorithm 2 for a single GPU. Afterward, we analyze the scaling behavior on an increasing number of GPUs in a weak scaling benchmark. For all investigations, we use a D3Q15 SRT LB scheme for the interface tracking and a D3Q27 MRT LB scheme for the velocity distribution function similarly to Mitchell et al. (2018a).

5.1. Single GPU

For the performance analysis on a single GPU, we focus on the NVIDIA Tesla V100 due to its wide distribution and its usage in the top supercomputers Summit³ and Sierra.⁴ Further, we discuss the performance on a NVIDIA Tesla P100 because it is used in the Piz Daint⁵ supercomputer, where we ran a weak scaling benchmark which is shown in Section 5.2. In this section, the two LB steps are analyzed independently. In order to determine whether the LB steps are memory- or compute-bound, the balance model is used, which is based on the code balance $B_{c}$

B_{c} = \frac{n_{b}}{n_{f}},

and the machine balance $B_{m}$ (see Hager and Wellein, 2010)

B_{m} = \frac{b_{s}}{p_{peak}} .

The machine balance describes the ratio of the machine bandwidth $b_{s}$ in bytes to the peak performance $p_{peak}$ in FLOPs. The code balance, on the other hand, describes the ratio of bytes $n_{b}$ loaded and stored for the execution of the algorithm to the executed FLOPs $n_{f}$ . The limiting factor of the algorithm is given by

l = min (1, \frac{B_{m}}{B_{c}}) .

If the “light speed” balance l is less than one a code is memory limited. To be able to calculate l, values for $b_{s}$ , $p_{peak}$ , $n_{b}$ and $n_{f}$ need to be stated.

As specified by the vendor, the V100 has a nominal bandwidth of $900$ GB/s (see NVIDIA Corporation, 2017). By running a STREAM copy benchmark, we get $808$ GB/s as the stream copy bandwidth. This synthetic benchmark implements a vector copy $a_{i} = b_{i}$ and describes the behavior of the LB steps more realistically than the nominal bandwidth of the GPUs (see Bauer et al., 2020b; Feichtinger et al., 2015). Therefore, we will only refer to the stream copy bandwidth in further discussions. For the P100 a stream copy bandwidth of $542$ GB/s can be measured in the same way.

The peak performances for the accelerator hardware is taken from the white paper by NVIDIA (see NVIDIA Corporation, 2017). For the V100 a double-precision peak performance of $7.8$ TFLOPs can be found while it is $5.3$ TFLOPs for the P100.

To determine $n_{b}$ for the phase-field step, we first need to think about the data which needs to be read and written in a single cell in each iteration. In each cell, we update the phase-field PDFs in a stream-pull-collide manner. This means we need to read and write $15$ double-precision values per time step. Furthermore, the velocity field is required in this calculation. Hence, another three double-precision values have to be loaded. Additionally, we need to take the forcing term into account as it is described by equation (5). In this term, we approximate the curvature of the phase-field with a second-order isotropic stencil as introduced in equations (23) and (24). This results in a 15-point stencil for the phase-field LB step. To estimate a lower limit for the memory traffic, we assume an ideal situation, which is reached when every grid point of the phase-field needs to be loaded only once. This would be the case if per cell only one value is loaded, and all additional values can be reused from cache since other threads already loaded them. Finally, we evaluate the zeroth moment of the phase-field distributions to update the phase-field. Therefore, one more double-precision value is stored. Thus, we have in total $19$ double-precision values to load and 16 to store. Altogether this makes $280$ bytes per cell per iteration for the phase-field LB step.

For the hydrodynamic LB step, we have a D3Q27 stencil resulting in $27$ reads and $27$ writes. Further, we evaluate the velocity field, and also update it. Thus another three loads and stores need to be performed. Once again, we assume the ideal scenario of only one load when calculating the gradient and the Laplacian of equation (13). This assumption leads to $31$ loads and $30$ stores. Hence, $488$ bytes per cell are needed per cell per iteration for the hydrodynamic LB step.

In order to get the number of operations, which are executed in one iteration per cell, $n_{f}$ we use the count_ops-function provided by lbmpy. For the phase-field LB step, we get a total of $320$ FLOPs. Due to a larger stencil, a more complicated collision operator and a force model consisting of several terms, we get more operations for the hydrodynamic LB step namely $809$ FLOPs. These values are obtained after applying common subexpression eliminations.

Combining the obtained values in Table 1, we can see that both LB steps are highly memory bound. Therefore, the maximal performance is given by

P_{max} = \frac{b_{s}}{n_{b}} .

Table 1.

Estimated performance results of the phase-field and hydrodynamic LB step. The memory bandwidth $b_{s}$ is determined with a STREAM copy benchmark. The peak performance $p_{peak}$ is given by the vendor (see NVIDIA Corporation, 2017).

Hardware	LB kernel	$b_{s}$ GB/s	$p_{peak}$ TFLOPS	$n_{b}$ bytes	$n_{f}$ FLOPS	$B_{c}$	$B_{m}$	l	Kernel estimate GLUPs
V100	phase	808	7.80	280	320	0.88	0.10	0.11	2.89
V100	hydro	808	7.80	488	809	0.60	0.10	0.16	1.66
P100	phase	542	5.30	280	320	0.88	0.10	0.11	1.94
P100	hydro	542	5.30	488	809	0.60	0.10	0.16	1.11

Additionally, we profiled the two LB steps with the NVIDIA profiler nvprof⁶ for both GPU architectures. The results for the limiting factors are illustrated in Figure 4. These measurements confirm the theoretical performance model very well, and show that the memory bandwidth is almost fully utilized for both GPUs.

Figure 4.

Compute units utilization and memory transfer measured with nvprof (https://docs.nvidia.com/cuda/profiler-users-guide/index.html). The memory transfer is based on the STREAM copy bandwidth of the hardware. The measurements are conducted for the phase-field and the hydrodynamic LB step separately on a Tesla V100 (a) and a Tesla P100 (b).

Consequently, the next step is to determine if the memory bandwidth is also reasonable. This means no unnecessary values should be transferred if they are not needed for the calculation. By running the phase-field LB step independently, we measure a performance of about $2.65$ GLUPs. With this information, we can calculate the effective bandwidth $b_{eff}$ for reads and writes by multiplying the measured performance with the transferred data, respectively. For reads, this results in $404$ GB/s, while for writes it results in $340$ GB/s. This matches well with the results measured by the NVIDIA profiler, which are $451$ GB/s and $337$ GB/s, respectively. It shows that there is not much unnecessary bandwidth utilization. Still, the measured bandwidth for loading values from the device memory is slightly higher than the calculated effective bandwidth. This indicates that the ideal assumption of every grid point of the phase-field being read only once is not completely true. One reason might be due to the high memory transfer caused by the underlying LB step. However, comparing the measured performance to the estimate in Table 1, we reach about 92% of the theoretical peak performance, which means that there is not much potential for further improvements. For the hydrodynamic LB step and for the P100 GPU, the results of the measured bandwidth and the effective bandwidth are gathered in Table 2. As one can see, the behavior of both LB steps is similar on both architectures. Moreover, both LB steps reach around 85% of the theoretical peak performance on both GPUs.

Table 2.

Calculated effective bandwidth $b_{eff}$ for loads and stores in comparison with the measured bandwidth $b_{m}$ by the nvprof. Additionally, the measured performance of the LB steps is given and compared with the estimated results of Table 1.^a

Hardware	LB kernel	$b_{m,reads}$ , GB/s	$b_{m,writes}$ , GB/s	$b_{eff,reads}$ , GB/s	$b_{eff,writes}$ , GB/s	Kernel measured, GLUPs	Ratio, %
V100	phase	451	337	404	340	2.66	92
V100	hydro	372	341	337	326	1.36	82
P100	phase	292	219	258	217	1.70	87
P100	hydro	255	228	237	230	0.96	86

^a https://docs.nvidia.com/cuda/profiler-users-guide/index.html

With the usage of code generation, we can easily change the discrete velocities used for the LB steps of the implementation. This makes it possible for us to evaluate the performance for different two- and three-dimensional stencils. As shown in Figure 5 we employed a D3Q27, D3Q19, D3Q15, and a D2Q9 stencil for both the phase-field and the hydrodynamic LB step. We show that we are able to reach about 86% of the theoretical peak performance for different three-dimensional stencils. This number increases to about 94% on a Tesla V100 and on a Tesla P100 in the two-dimensional case because the assumption that every value of the phase-field $ϕ$ is loaded only once is better fulfilled in those cases. The ideal assumption that each value of the phase-field $ϕ$ is only loaded once is better fulfilled in the two-dimensional case. Thus, we reach a higher relative performance of about 94% on a Tesla V100 and on a Tesla P100.

Figure 5.

Performance measurement for different LB stencils (D3Q27, D3Q19, D3Q15 and D2Q9) for the phase-field and the hydrodynamic LB step compared to the theoretical peak performances which are illustrated as black lines respectively. The white number in each bar shows the ratio between theoretical peak performance and measured performance. (a) Performance on a Tesla V100. (b) Performance on a Tesla P100.

Due to our implementation’s memory boundness, we expect at most a performance increase by a factor of two when using single precision. However, a detailed analysis of single-precision computation is beyond this work.

5.2. Weak scaling benchmark

The idea of communication hiding for LBM simulations was already studied in Feichtinger et al. (2015). As in Feichtinger et al. (2015) we partition each subdomain (block) into an outer part, to obtain a frame for each block, and an inner part, the block interior. In comparison to that, it is possible in our work, to choose the width of the block frame freely. We can then execute the LBM kernels first on the frame, then send it asynchronously as ghost values to the neighboring blocks. While the communication takes place, the LBM kernels can be executed on the inner domains. Due to our flexible implementation, we ran benchmarks for different frame widths on an increasing number of GPUs in order to find an optimal choice of the frame width. This weak scaling performance benchmark was carried out on the Piz Daint supercomputer on up to $2048$ GPU nodes. We set the physical parameters and the number of grid cells on each GPU as discussed at the beginning of this section. Thus we always have $260^{3}$ grid cells on each GPU. This benchmark is performed with different frame widths to determine their performance. As we increase the number of nodes, we get more communication overhead. It follows that a frame width of $(32, 8, 8)$ shows a significantly higher performance than an algorithm without communication hiding as pictured in Figure 6. On a single GPU, on the other hand, we can see that the performance of the algorithm without communication hiding is higher with 550 GLUPs. Configurations with a thicker frame width like $(64, 64, 64)$ would perform better on a single GPU than a frame width of $(32, 8, 8)$ but show no good scaling behavior similar to when we do not utilize communication hiding at all. Thus for the sake of simplicity, we only show the scaling behavior of three scenarios in Figure 6. A frame width of $(32, 8, 8)$ , as it delivers a good scaling behavior and performance. The native case describes a frame width of $(1, 1, 1)$ and “no communication hiding” refers to a test case where we do not utilize communication hiding strategies at all. Furthermore, the theoretical peak performance, which is calculated without taking the communication overhead into account, is shown as a gray line in the figure.

Figure 6.

Weak scaling performance benchmark on the Piz Daint supercomputer. The gray line shows the theoretical peak performance. With a thicker frame width (dark blue) we reach a parallel efficiency of almost 98% and also 70% of the theoretical peak performance. Furthermore, it can be seen, that a thin frame (red) shows worse performance and no separation of the domain (light blue) shows worse parallel efficiency.

Both, a frame width of $(32, 8, 8)$ and a frame width of $(1, 1, 1)$ show excellent scaling behavior. However, since the kernel to update the outer frame for the native case is not as efficient as in the other case, we see lower performance throughout the benchmark. If we do not utilize communication hiding at all, it is visible that we lose performance when scaling up our simulation. On $2048$ GPUs, we run a simulation setup with about $36$ billion lattice cells and one TLUPs. Furthermore, we reach a parallel efficiency of about 98%. We are still able to exploit 70% of the theoretical peak performance on $2048$ nodes. In comparison, we would be able only to achieve 62% of the maximum performance when using a frame width of $(1, 1, 1)$ . Not using communication hiding reduces the measured performance to 49% of the theoretical peak performance. This lets us conclude that we can save a significant amount of compute-time and resources by using our flexible implementation, which is not only possible for one specific LB configuration but for a wide variety of different setups to solve the LBEs including state of the art collision operators. On the other side, the user has a very convenient interactive development environment to work on new problems and can almost entirely work directly on the equation level.

6. Numerical validation

6.1. Single rising bubble

The motion of a single gas bubble rising in liquid has been studied for many centuries by various authors and is still a problem of great interest today (see Bhaga and Weber, 1981; Clift et al., 1978; Fakhari et al., 2017a; Grace, 1973; Lote et al., 2018; Mitchell et al., 2019; Tomiyama et al., 1998). This is due to its vast importance in many industrial applications and natural phenomena like the aerosol transfer from the sea, oxygen dissolution in lakes due to rain, bubble column reactors and the flow of foams and suspension, to name just a few. Because of the three-dimensional nature and nonlinear effects of the problem, the numerical simulation still remains a challenging task (see Tripathi et al., 2015). The evolution of the gas bubble in stationary fluids depends on a large variety of different parameters. These are the surface tension, the density difference between the fluids, the viscosity of the bulk media and the external pressure gradient or gravitational field through which buoyancy effects are observed in the gas phase. The parameters are developed into dimensionless groups in order to acquire comprehensive theories describing the problem (see Mitchell et al., 2019).

In this study we set up a computational domain of $256 \times 1024 \times 256$ cells and initially place a spherical bubble with a radius of $16$ grid cells at $(128, 256, 128)$ . We use no-slip boundary conditions at the top and bottom of our domain to mimic solid walls. In all other directions, periodic boundary conditions are applied. This setup is consistent with Mitchell et al. (2019). For the force acting on the bubble we use the volumetric buoyancy force

F_{b} = ρ g_{y} \hat{y},

where g_y is the magnitude of the gravitational acceleration, which is applied in the vertical direction $\hat{y}$ (see Fakhari et al., 2017a). The density $ρ$ is calculated by a linear interpolation with equation (8). For all simulations, a D3Q19 stencil is used for both LB steps with an MRT method and a weighted orthogonal moment set (see Fakhari et al., 2017a). To characterize the shape of the bubble, we need five dimensionless parameters. We use the Reynolds number based on the gravitational force

{Re}_{Gr} = \frac{ρ_{H} \sqrt{g_{y} D^{3}}}{μ_{H}},

where D is the initial diameter of the bubble. The Eötvös number

Eo = \frac{g_{y} ρ_{H} D^{2}}{σ},

which is also called the Bond number describes the influence of gravitational forces compared to surface tension forces. Further, we use the density ratio $ρ^{*} = 1000$ , the viscosity ratio $μ^{*} = 100$ , and the reference time

t_{ref} = \sqrt{\frac{D}{g_{y}}} .

Thus, the dimensionless time can be calculated as $t^{*} = t / t_{ref}$ For different Eötvös and gravitational Reynolds numbers, the terminal shape of a rising bubble can be seen in Figure 7. All simulations are carried out with a reference time of $t_{ref} = 18 000$ until $t^{*} = 10$ . We set the interface thickness to $ξ = 5$ cells and the mobility to $M = 0.04$ . Comparing our results with Mitchell et al. (2019), we see a good agreement regarding the terminal shape of the bubble. Further, we achieve the same behavior as described by the experiments of Bhaga and Weber (1981). The shape of the bubbles represents the increase of the effective force acting on them. As a result, we can observe a deformation from less spherical to flatter bubbles when increasing the Eötvös or the gravitational Reynolds number.

Figure 7.

Terminal shape of a single rising bubble with $ρ^{*} = 1000$ and $μ^{*} = 100$ under different Eötvös and gravity Reynolds numbers at $10 t^{*}$ . (a) $Eo = 1$ , ${Re}_{Gr} = 40$ . (b) $Eo = 30$ , ${Re}_{Gr} = 10$ . (c) $Eo = 5$ , ${Re}_{Gr} = 40$ . (d) $Eo = 30$ , ${Re}_{Gr} = 30$ . (e) $Eo = 100$ , ${Re}_{Gr} = 40$ . (f) $Eo = 30$ , ${Re}_{Gr} = 120$ .

Additionally, we calculate the drag coefficients with the terminal velocity of the bubbles

C_{D}^{LBM} = \frac{4}{3} \frac{g_{y} (ρ_{H} - ρ_{L}) D}{ρ_{H} u_{t}^{2}},

and compare it to the experiments carried out by Bhaga and Weber (1981). Based on their observations they set up the following empirical equation to calculate the drag coefficient of a rising bubble described by the gravity Reynolds number

C_{D}^{exp} = {[{2.67}^{\frac{9}{10}} + {(\frac{16}{{Re}_{Gr}})}^{\frac{9}{10}}]}^{\frac{10}{9}} .

As it can be seen in Figure 8 our results are in good agreement with the experimental investigations.

Figure 8.

Drag coefficient plotted against the Reynolds number of a single rising bubble. The dots show the results of the LB simulation with a constant Eötvös number of $Eo = 30$ while the blue line represents the analytical estimation from equation (35) made by Bhaga and Weber (1981) in 1981.

6.2. Bubble field

For demonstrating the robustness as well as our possibilities through the efficient and scalable implementation, we show a large scale bubble rise scenario with several hundred bubbles. The simulation is carried out on a $720 \times 560 \times 720$ domain, which gives about $290$ million lattice cells. Our simulation has run for 10 hours resulting in 500 000 time steps. For this simulation, we use an Eötvös number of $Eo = 50$ . We further specify the gravitational Reynolds number as ${Re}_{Gr} = 50$ and the mobility as $M = 0.08$ . We set the reference time to $t^{*} = 18 000$ . The density ratio and the viscosity ratio between the fluid and the bubbles are set to the ones of water and air ( $ρ^{*} = 1000$ , $μ^{*} = 100$ ). We initialize two layers of air bubbles at the bottom of our domain with a radius of $R = 16$ . To initialize the bubbles with a slightly different radius, we add a random value sampled from $[- R / 5, R / 5]$ to the radius. By having air bubbles with different radii, we can see that bubbles with a larger radius accelerate faster, which shows a good physical agreement. A screenshot of the simulation every 125 000 time steps can be seen in Figure 9(e). We can see clearly that complex physical phenomena are carried out in stable simulation. These are bubble coalescence, the coalescence of air bubbles with the liquid surface and bubble breakage.

Figure 9.

Large scale bubble rise scenario simulated on the Piz Daint supercomputer with several hundred air bubbles. (a) Initialization. (b) Time step 125 000. (c) Time step 250 000. (e) Time step 375 000. (f) Time step 500 000.

7. Conclusion

In this work we have shown an implementation of the conservative ACM based on meta-programming. With this technique we can generate highly efficient C, OpenCL and CUDA kernels which can be integrated in other frameworks for simulating large scale scenarios. For this work we have used the waLBerla framework to integrate our code. We have measured the efficiency our implementation on single GPUs. Excellent performance results compared to the roofline model could be shown for a Tesla V100 and a Tesla P100 where we could achieve about 85% of the theoretical peak performance for both architectures. Additionally, we have shown that our code not only performs very well for one configuration but keeps its excellent efficiency even for different stencils and different methods to solve the LBEs. It is even possible to directly generate 2D cases for testing which also show very good performance results. Through separating our iteration region in an inner and an outer part we could enable communication hiding, relevant for multi-GPU simulations with MPI. With this technique we are able to run large scale simulations with almost perfect scalability. To show this we have run a weak scaling benchmark on the Piz Daint supercomputer on $2048$ GPU nodes. We could show that our implementation has a parallel efficiency of almost 98%. To validate our code form a physical point of view we have measured the terminal shape of single rising air bubble in water under various Eötvös numbers and Reynolds numbers. We could show good agreement with the literature data regarding the terminal shape and the drag coefficient of rising bubbles. Finally, we have set up a larger scenario where we simulate several hundred air bubbles in the water. With our implementation, we were not only able to maintain a stable simulation for the complicated test case, but we could also observe complex phenomena like bubble coalescence, the coalescence of air bubbles with the liquid surface and bubble breakage.

In the last part of this paper, future work is identified. The interface between two fluids is resolved with several cells in the phase-field model. Thus it is crucial to use a very high resolution. This makes adaptive mesh refinement (AMR) a fundamental approach that should resolve the interface with higher resolution than the bulk fluid (see Fakhari et al., 2017a). Successful usage of AMR with the waLBerla framework in extreme-scale scenarios has been shown in several publications (see Bauer et al., 2020a; Schornbaum and Rüde, 2016, 2018). However, an adaption to GPUs will be future work. With the usage of AMR additional problems will occur, which needs to be tackled. One of those is the strong scaling capability of our implementation. While for weak scaling benchmarks, a single block in our block-structured domain can use the full GPU memory, this is not true anymore if AMR is applied, and thus, smaller refined blocks will be used in parts of the domain. Therefore, it will be important that a single GPU kernel can apply the computation simultaneously to several blocks allocated on the GPU. Otherwise, a GPU compute kernel’s performance will get deteriorate once the block is too small to keep all threads busy. This problem is critical when applying communication hiding strategies done in this work since this divides the individual blocks even further.

While more advanced LB collision operators like the cumulant LBM are available in the lbmpy framework, this has not yet been employed in the phase-field algorithm. As described by Geier et al. (2021), a careful numeric design becomes crucial when using the LBM, especially with more sophisticated collision operators. The annotated symbolic derivation of the algorithm can be made aware of roundoff and stability concerns. In fact, even an automated roundoff error analysis can be developed. This is, however, left to future work.

Footnotes

Acknowledgments

We appreciate the support by Travis Mitchell for this project. Furthermore, we thank Christoph Schwarzmeier and Christoph Rettinger for fruitful discussions on the topic.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: We are grateful to the Swiss National Supercomputing Centre (CSCS) for providing computational resources and access to the Piz Daint supercomputer. Further, the authors would like to thank the Bavarian Competence Network for Technical and Scientific High Performance Computing (KONWIHR), the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) for supporting project 408059952 and the Bundesministerium für Bildung und Forschung (BMBF, Federal Ministry of Education and Research) for supporting project 01IH15003A (SKAMPY).

ORCID iD

Markus Holzer

Supporting information

The multiphysics framework waLBerla is released as an open-source project and can be used under the terms of an GNU general public license. The source code is available at .

Notes

Author biographies

Markus Holzer is a research assistant at the Chair for System Simulation at the University Erlangen-Nürnberg. He holds a MSc degree in Computational Engineering and is one of the core developers of the waLBerla HPC framework and leading developer of the code generation frameworks pystencils and lbmpy. His research interests are code generation for lattice Boltzmann methods and large scale fluid simulations on heterogeneous and massively parallel hardware.

Martin Bauer is a research assistant at the Chair for System Simulation at the University Erlangen-Nürnberg. He holds a MSc(hons) degree in Computational Engineering and is one of the core developers of the waLBerla HPC framework. His research interests are efficient parallel algorithms for large scale multiphysics simulations, especially in the context of computational fluid dynamics and the lattice Boltzmann method.

Harald Köstler got his PhD in computer science in 2008 on variational models and parallel multigrid methods in medical image processing. 2014 he finished his habilitation on Efficient Numerical Algorithms and Software Engineering for High Performance Computing. Currently, he works at the Chair for System Simulation at the University of Erlangen-Nuremberg in Germany. His research interests include software engineering concepts especially using code generation for simulation software on HPC clusters, multigrid methods, and programming techniques for parallel hardware, especially GPUs. The application areas are computational fluid dynamics, rigid body dynamics, and medical imaging.

Ulrich Rüde heads the Chair for System Simulation at the University Erlangen Nürnberg. He studied Mathematics and Computer Science at Technische Universität München (TUM) and The Florida State University. He holds a PhD and Habilitation degrees from TUM. His research interest focuses on numerical simulation and high end computing, in particular computational fluid dynamics, multilevel methods, and software engineering for high performance computing. He is a Fellow of the Society of Industrial and Applied Mathematics.

References

Allen

Tsui

Vinter

(1976) On the absorption of infrared radiation by electrons in semiconductor inversion layers. Solid State Communications 20(4): 425–428.

Alnæs

Blechta

Hake

, et al. (2015) The FEniCS project version 1.5. Archive of Numerical Software 3(100). DOI:10.11588/ans.2015.100.20553.

Anderson

McFadden

Wheeler

(1998) Diffuse-interface methods in fluid mechanics. Annual Review of Fluid Mechanics 30: 139–165.

Bauer

Eibl

Godenschwager

, et al. (2020a) waLBerla: A block-structured high-performance framework for multiphysics simulations. Computers & Mathematics with Applications 81(1): 478–501. DOI:10.1016/j.camwa.2020.01.007.

Bauer

Hötzer

Ernst

, et al. (2019) Code generation for massively parallel phase-field simulations. Association for Computing Machinery 59: 1–32. DOI:10.1145/3295500.3356186.

Bauer

Kostler

Rude

(2020b) lbmpy: automatic code generation for efficient parallel lattice Boltzmann methods. arXiv: Mathematical Software.

Bhaga

Weber

(1981) Bubbles in viscous liquids: shapes, wakes and velocities. Journal of Fluid Mechanics 105: 61–85.

Bogner

Ammer

Rüde

(2015) Boundary conditions for free interfaces with the lattice Boltzmann method. Journal of Computational Physics 297: 1–12.

Bogner

Rüde

Harting

(2016) Curvature estimation from a volume-of-fluid indicator function for the simulation of surface tension and wetting with a free-surface lattice Boltzmann method. Physical Review E 93: 043302.

10.

Cahn

Hilliard

(1958) Free energy of a nonuniform system. i. interfacial free energy. The Journal of Chemical Physics 28: 258.

11.

Clift

Grace

Weber

(1978) Bubbles, Drops, and Particles. New York; London: Academic Press. Includes bibliographies and index.

12.

Coreixas

Chopard

Latt

(2019) Comprehensive comparison of collision models in the lattice Boltzmann framework: theoretical investigations. Physical Review E 100: 033305.

13.

Dinesh Kumar

Sannasiraj

Sundar

(2019) Phase field lattice Boltzmann model for air-water two phase flows. Physics of Fluids 31: 072103.

14.

Fakhari

Lee

(2013) Multiple-relaxation-time lattice Boltzmann method for immiscible fluids at high Reynolds numbers. Physical Review. E, Statistical, Nonlinear, and Soft Matter Physics 87: 023304.

15.

Fakhari

Bolster

Luo

(2017a) A weighted multiple-relaxation-time lattice Boltzmann method for multiphase flows and its application to partial coalescence cascades. Journal of Computational Physics 341: 22–43.

16.

Fakhari

Geier

Bolster

(2019) A simple phase-field model for interface tracking in three dimensions. Computers & Mathematics with Applications 78(4): 1154–1165.

17.

Fakhari

Mitchell

Leonardi

, et al. (2017b) Improved locality of the phase-field lattice-Boltzmann model for immiscible fluids at high density ratios. Physical Review E 96: 053301.

18.

Feichtinger

Habich

Köstler

, et al. (2015) Performance modeling and analysis of heterogeneous lattice Boltzmann simulations on CPU-GPU clusters. Parallel Computing 46: 1–13.

19.

Geier

Fakhari

Lee

(2015a) Conservative phase-field lattice Boltzmann model for interface tracking equation. Physical Review E 91: 063309.

20.

Geier

Lenz

Schönherr

, et al. (2021) Under-resolved and large eddy simulations of a decaying Taylor-Green vortex with the cumulant lattice Boltzmann method. Theoretical and Computational Fluid Dynamics 35: 169–208.

21.

Geier

Schönherr

Pasquali

, et al. (2015b) The cumulant lattice Boltzmann equation in three dimensions: theory and validation. Computers & Mathematics with Applications 70(4): 507–547.

22.

Grace

(1973) Shapes and velocities of bubbles rising in infinite liquid. Transactions of the Institution of Chemical Engineers 51: 116–120.

23.

Hager

Wellein

(2010) Introduction to High Performance Computing for Scientists and Engineers. 1st edn. USA: CRC Press, Inc. ISBN 143981192X.

24.

Huang

Sukop

(2015) Multiphase Lattice Boltzmann Methods: Theory and Application. Hoboken: John Wiley & Sons, Ltd. DOI: 10.1002/9781118971451.

25.

Karlin

Gorban

Succi

, et al. (1998) Maximum entropy principle for lattice kinetic equations. Physical Review Letter 81: 6.

26.

Körner

Thies

Hofmann

, et al. (2005) Lattice Boltzmann model for free surface flow for modeling foaming. Journal of Statistical Physics 121: 179–196.

27.

Kumar

(2004) Isotropic finite-differences. Journal of Computational Physics 201(1): 109–118.

28.

Luo

Kang

, et al. (2016) Lattice Boltzmann methods for multiphase flow and phase-change heat transfer. Progress in Energy and Combustion Science 52: 62–105.

29.

Lote

Vinod

Patwardhan

(2018) Comparison of models for drag and non-drag forces for gas-liquid two-phase bubbly flow. Multiphase Science and Technology 30(1): 31–76. DOI: 10.1615/MultScienTechn.2018025983.

30.

Meurer

Smith

Paprocki

, et al. (2017) Sympy: symbolic computing in python. PeerJ Computer Science 3: e103.

31.

Mitchell

Hill

Firouzi

, et al. (2019) Development and evaluation of multiphase closure models used in the simulation of unconventional wellbore dynamics. In: Asia Pacific Unconventional Resources Technology Conference, Brisbane, Australia, 18–19 November 2019. DOI: 10.15530/AP-URTEC-2019-198239.

32.

Mitchell

Leonardi

Fakhari

(2018a) Development of a three-dimensional phase-field lattice Boltzmann method for the study of immiscible fluids at high density ratios. International Journal of Multiphase Flow 107: 1–15.

33.

Mitchell

Leonardi

Firouzi

, et al. (2018b) Towards closure relations for the rise velocity of Taylor bubbles in annular piping using phase-field lattice Boltzmann techniques. In: Proceedings of the 21st Australasian Fluid Mechanics Conference, AFMC 2018, Adelaide, SA, Australia, 10–13 December 2018. Australasian Fluid Mechanics Society.

34.

NVIDIA Corporation (2017) NVIDIA Tesla V100 GPU architecture: The world’ s most advanced data center GPU. Technical Report WP-08608-001_v1.1, NVIDIA. Available at: https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf (accessed 3 May 2021).

35.

Pao-Hsiung

Yan-Ting

(2011) A conservative phase field method for solving incompressible two-phase flows. Journal of Computational Physics 230(1): 185–204.

36.

Pérez

Granger

(2007) IPython: a system for interactive scientific computing. Computing in Science and Engineering 9(3): 21–29. DOI: 10.1109/MCSE.2007.53.

37.

Prosperetti

Tryggvason

(2007) Computational Methods for Multiphase Flow. Cambridge: Cambridge University Press. DOI: 10.1017/CBO9780511607486.

38.

Ramadugu

Thampi

Adhikari

, et al. (2013) Lattice differential operators for computational physics. EPL (Europhysics Letters) 101: 50006.

39.

Ren

Song

Sukop

, et al. (2016) Improved lattice Boltzmann modeling of binary flow based on the conservative Allen-Cahn equation. Physical Review E 94: 023311.

40.

Schornbaum

Rüde

(2016) Massively parallel algorithms for the lattice Boltzmann method on non-uniform grids. SIAM Journal on Scientific Computing 38(2): C96–C126.

41.

Schornbaum

Rüde

(2018) Extreme-scale block-structured adaptive mesh refinement. SIAM Journal on Scientific Computing 40(3): C358–C387.

42.

Sun

Beckermann

(2007) Sharp interface tracking using the phase-field equation. Journal of Computational Physics 220(2): 626–653.

43.

Thürey

Pohl

Rüde

(2009) Hybrid parallelization techniques for lattice Boltzmann free surface flows. In: Parallel Computational Fluid Dynamics 2007. Lecture Notes in Computational Science and Engineering, Vol. 67. Berlin, Heidelberg: Springer. DOI: 10.1007/978-3-540-92744-0_22.

44.

Tomiyama

Kataoka

Zun

, et al. (1998) Drag coefficients of single bubbles under normal and micro gravity conditions. JSME International Journal Series B 41(2): 472–479.

45.

Tripathi

Sahu

Govindarajan

(2015) Dynamics of an initially spherical bubble rising in quiescent liquid. Nature Communications 6: 6268.

46.

Wittmann

Zeiser

Hager

, et al. (2016) Comparison of different propagation steps for lattice Boltzmann methods. Computers & Mathematics with Applications 65(6): 924–935.

47.

Yan

Dong

(2011) LBM, a useful tool for mesoscale modelling of single-phase and multiphase flow. Applied Thermal Engineering 31(5): 649–655.

Highly efficient lattice Boltzmann multiphase simulations of immiscible fluids at high-density ratios on CPUs and GPUs through code generation

Abstract

Keywords

1. Introduction

2. Related work

3. Model description

3.1. LB model for interface tracking

3.2. LB model for hydrodynamics

4. Software design for a flexible implementation

4.1. Code generation

4.2. Algorithm

5. Benchmark results

5.1. Single GPU

5.2. Weak scaling benchmark

6. Numerical validation

6.1. Single rising bubble

6.2. Bubble field

7. Conclusion

Footnotes

Acknowledgments

Declaration of conflicting interests

Funding

ORCID iD

Supporting information

Notes

Author biographies

References