202403-Articles-CAF-Symmetric-FSM-Published
202403-Articles-CAF-Symmetric-FSM-Published
Keywords: A strategy to improve the performance and reduce the memory footprint of simulations on meshes with
Reflection symmetries spatial reflection symmetries is presented in this work. By using an appropriate mirrored ordering of the
Arithmetic intensity unknowns, discrete partial differential operators are represented by matrices with a regular block structure
Memory footprint
that allows replacing the standard sparse matrix–vector product with a specialised version of the sparse matrix-
SpMV
matrix product, which has a significantly higher arithmetic intensity. Consequently, matrix multiplications are
SpMM
MPI+OpenMP+OpenCL/CUDA
accelerated, whereas their memory footprint is reduced, making massive simulations more affordable. As an
example of practical application, we consider the numerical simulation of turbulent incompressible flows using
a low-dissipation discretisation on unstructured collocated grids. All the required matrices are classified into
three sparsity patterns that correspond to the discrete Laplacian, gradient, and divergence operators. Therefore,
the above-mentioned benefits of exploiting spatial reflection symmetries are tested for these three matrices on
both CPU and GPU, showing up to 5.0x speed-ups and 8.0x memory savings. Finally, a roofline performance
analysis of the symmetry-aware sparse matrix–vector product is presented.
1. Introduction may have higher arithmetic intensity because of the numerical schemes
used [1,2], thanks to the presence of Riemann solvers [3], to us-
The design of digital processors constantly evolves to overcome ing mixed precision algorithms [4], merging simulations of multiple
limitations and bottlenecks. The formerly compute-bound nature of flow ensembles [5], exploiting uniform mesh directions with periodic
processors led to compute-centric programming languages and sim- conditions [6], or pursuing more efficient implementations.
ulation codes. However, raw computing power grows at a (much)
Secondly, the above-mentioned heterogeneity in HPC systems
faster pace than the speed of memory access, turning around the
makes cross-platform portability crucial. In this regard, our strategy is
problem. Increasingly complex memory hierarchies are found nowa-
breaking the interdependency between algorithms and their software
days in computing systems, and optimising traditional applications
for these systems is cumbersome. Moreover, new parallel program- implementation by casting calculations into a minimalist set of univer-
ming languages and frameworks emerged to target modern hardware sal kernels. There is an increasing interest towards the development of
(e.g., OpenMP, CUDA, OpenCL, HIP), and porting algorithms and appli- more abstract implementations. For instance, the PyFR framework [1]
cations has become restrictive. This scenario poses multiple challenges is mostly based on matrix multiplications and point-wise operations.
to the development of efficient and scalable scientific simulation codes. Another example is the Kokkos programming model [7], which includes
Firstly, only applications with very high arithmetic intensity can ap- computation abstractions for frequently used parallel computing pat-
proach the theoretical peak performance of modern high-performance terns and data structures. Namely, implementing an algorithm in terms
computing (HPC) systems. However, this is not the case in most prob- of Kokkos entities allows mapping the algorithm onto multiple archi-
lems in computational physics since they usually rely on sparse linear tectures. Some authors propose domain-specific tools to address this,
algebra (or equivalent) operations. Examples can be found in computa-
generalising the stencil computations for specific fields. For instance,
tional fluid dynamics (CFD), linear elasticity, structural mechanics or
a framework that automatically translates stencil functions written in
electromagnetic modelling, among others. Strategies helping to mit-
C++ to both central processing unit (CPU) and graphics processing unit
igate this include adapting more compute-intensive methods, which
∗ Corresponding author.
E-mail address: [email protected] (F.X. Trias).
https://doi.org/10.1016/j.compfluid.2024.106247
Received 31 December 2022; Received in revised form 22 December 2023; Accepted 13 March 2024
Available online 20 March 2024
0045-7930/© 2024 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
À. Alsalti-Baldellou et al. Computers and Fluids 275 (2024) 106247
(GPU) codes is proposed in [8]. In this regard, in our previous works [9, where 𝑛 stands for the mesh size and 𝒙1 , 𝒙2 ∈ R𝑛∕2 for 𝒙’s restriction to
10] we showed that all the operations involved in a typical CFD algo- each subdomain. Remarkably enough, mirrored grid points are in the
rithm for large-eddy simulation (LES) or direct numerical simulation same position within the subvectors, and discrete versions of all partial
(DNS) of incompressible turbulent flows can be simplified to three basic differential operators satisfy the following block structure:
linear algebra subroutines: a sparse matrix–vector product (SpMV), a ( )
𝖠1,1 𝖠1,2
linear combination of vectors and a dot product. From now on, we will 𝖠= ∈ R𝑛×𝑛 , (2)
𝖠2,1 𝖠2,2
refer to implementation models heavily based on algebraic subroutines
as algebraic or algebra-based. In such an implementation approach, where 𝖠𝑖,𝑗 ∈ R𝑛∕2×𝑛∕2 accounts for the couplings between the 𝑖th and
the kernel code shrinks to hundreds or even dozens of lines; the 𝑗th subdomains. As long as 𝐴 only depends on geometric quantities
portability becomes natural, and maintaining multiple implementations (which is typically the case), given that both subdomains are identical
takes little effort. Besides, standard libraries optimised for particular and thanks to the mirrored ordering, we have that:
architectures (e.g., cuSPARSE [11], clSPARSE [12]) can be linked in
𝖠1,1 = 𝖠2,2 and 𝖠1,2 = 𝖠2,1 , (3)
addition to specialised in-house implementations. Nevertheless, the
algebraic approach imposes restrictions and challenges that must be and, by denoting 𝖠𝑖 ≡ 𝖠1,𝑖 , we can rewrite Eq. (2) as:
addressed, such as the inherent low arithmetic intensity of the SpMV, ( )
which makes the simulation algorithm pronouncedly memory-bound. 𝖠1 𝖠2
𝖠= . (4)
In this context, the present work proposes a strategy to exploit 𝖠2 𝖠1
spatial reflection symmetries for accelerating virtually all matrix mul- The procedure above can be applied recursively to exploit an arbi-
tiplications, which generally are the most computationally expensive trary number of symmetries, 𝑠. For instance, taking advantage of 𝑠 = 2
kernel involved in the simulations [10]. This is done by replacing symmetries results in 4 mirrored subdomains on which, analogously
the standard SpMV with a specialised version of the sparse matrix- to Eq. (4), virtually all discrete operators satisfy the following:
matrix product (SpMM), a considerably more compute-intensive kernel
⎛𝖠1 𝖠2 𝖠3 𝖠4 ⎞
thanks to the lower memory traffic it entails [13]. Besides increas- ⎜ ⎟
𝖠 𝖠1 𝖠4 𝖠3 ⎟
ing the arithmetic intensity of the simulations, exploiting 𝑠 symme- 𝖠=⎜ 2 , (5)
⎜𝖠3 𝖠4 𝖠1 𝖠2 ⎟
tries allows reducing both the setup costs and memory footprint of ⎜𝖠 ⎟
⎝ 4 𝖠3 𝖠2 𝖠1 ⎠
the discrete operators by a factor of 2𝑠 , thus making high-fidelity
simulations significantly more affordable. Although being out of the where 𝖠𝑖 ∈ R𝑛∕4×𝑛∕4 corresponds to the couplings between the first and
scope of this work, symmetries can be further exploited to acceler- 𝑖th subdomains.
ate the solution of Poisson’s equation [14–17]. Remarkably enough, Thanks to the discretisation presented above, exploiting 𝑠 symme-
although focusing on CFD applications, the approach presented is tries allows meshing a 1∕2𝑠 fraction of the entire domain, henceforth
naturally extensible to other physical problems. However, we target named base mesh (see Fig. 6). Then, instead of building the full oper-
the DNS and LES of incompressible turbulent flows, which usually ators, 𝖠 ∈ R𝑛×𝑛 , it is only needed to build the base mesh’s couplings
𝑠 𝑠
exhibit spatial reflection symmetries. Indeed, vehicles generally present with itself, 𝖠1 ∈ R𝑛∕2 ×𝑛∕2 , and with its 2𝑠 − 1 mirrored counterparts,
𝑠 ×𝑛∕2𝑠
one symmetry [18]. Examples with two symmetries include jets [19], 𝖠2 , … , 𝖠2𝑠 ∈ R 𝑛∕2 . As a result, both the setup and memory
flames [20], multiphase [21], building and urban simulations [22,23]. footprint of the matrices are considerably reduced.
Finally, domains with three (or more) symmetries range from canonical Furthermore, while the sparsity pattern of 𝖠1 matches that of the
flows [24] to industrial devices such as nuclear reactors [25,26] or heat actual operator built upon the base mesh, the outer-subdomain cou-
exchangers [27]. plings, 𝖠2 , … , 𝖠2𝑠 , have very few non-zero entries (if any), making the
The remaining parts of this paper are organised as follows. Section 2 following splitting very advantageous:
defines the adequate discretisation of domains with symmetries and
derives the resulting structure exhibited by the discrete operators. 𝖠 = I2𝑠 ⊗ 𝖠inn + 𝖠out , (6)
Section 3 details the replacement of SpMV with the specialised version where 𝖠inn ∶= 𝖠1 ∈ R 𝑛∕2𝑠 ×𝑛∕2𝑠
and 𝖠out ∶= 𝖠 − I2𝑠 ⊗ 𝖠inn ∈ R𝑛×𝑛 .
of SpMM. Section 4 lands the previous results to the solution of the Indeed, as will be shown in Section 3, Eq. (6) allows accelerating the
Navier–Stokes equations, and Finally, Section 6 overviews the strategy matrix multiplications by replacing the standard SpMV with the more
and discusses future lines of work. compute-intensive SpMM.
Finally, let us note that the splitting of Eq. (6) does only require
2. Exploiting symmetries the geometric domain to be symmetric and is perfectly compatible with
asymmetric boundary conditions. Certainly, even if they are introduced
This section aims to show how to exploit spatial reflection symme- within the discrete operators, it is enough to assign their corresponding
tries to increase the arithmetic intensity of the matrix multiplications. matrix entries to 𝖠out .
Although applying identically to arbitrarily complex geometries, for
clarity, let us first present our strategy in its simplest form, i.e., on a 3. Faster and lighter matrix multiplications
one-dimensional mesh with a single reflection symmetry.
Hence, given the one-dimensional mesh of Fig. 1, let us order its 3.1. Optimising sparse matrix–vector multiplication on symmetric domains
grid points by first indexing the ones lying on one half and then those
in the other. Then, if we impose to the resulting subdomains the same SpMV is the most computationally expensive routine in many large-
local ordering (mirrored by the symmetry’s hyperplane), we ensure that scale simulations relying on iterative methods. Namely, it is a strongly
all the scalar fields satisfy the following: memory-bound kernel with a low arithmetic intensity (𝐼), which is
( )
𝒙1 the ratio of computing work in flop to memory traffic in bytes, and
𝒙= ∈ R𝑛 , (1) requires indirect memory accessing to the input vector, harming the
𝒙2
2
À. Alsalti-Baldellou et al. Computers and Fluids 275 (2024) 106247
3
À. Alsalti-Baldellou et al. Computers and Fluids 275 (2024) 106247
To evaluate the benefits of the proposed implementation in terms of twice as fast on dual-socket compute servers, as was show in [10]
memory accesses and footprint, we recall the memory traffic according (Fig. 6).
to Eq. (8). Given a square matrix satisfying the splitting of Eq. (6), Regarding the implementation of SymSpMV on GPUs, both for
i.e., 𝑚 = 𝑛 and 𝖠 = I2𝑠 ⊗ 𝖠inn + 𝖠out , we have that the number of rows CUDA and OpenCL, one work item (thread) is responsible for com-
of 𝖠inn is 𝑛∕2𝑠 . Table 1 outlines the memory accesses required for the puting one component of the output vector. Therefore, there is no
three implementations: the standard SpMV implementation (without inner loop over symmetries, which is in line 4 of Algorithm 2. The
exploiting mesh symmetries), the direct implementation casting the number of work items, or the grid size in CUDA terminology, is 𝑚 ×
product into one SpMM plus one SpMV, and the fused SymSpMV im- 𝑑. Since symmetrical instances are continuously numbered and the
plementation. Therefore, exploiting mesh symmetries reduces memory local workgroup size is divisible by 𝑑, the work items of symmetrical
access and storage by 12(2𝑠 − 1)nnz(𝖠inn ) − 4(𝑛∕2𝑠 + 1). instances are always in the same workgroup and even in the same
warp in CUDA terminology. This ensures that matrix coefficients are
Table 1 shared by 𝑑 neighbouring work items processing symmetrical instances
( )
Estimation of the memory accesses required to compute 𝖠𝒙 = I2𝑠 ⊗ 𝖠inn + 𝖠out 𝒙 using with a coalescing of memory transactions. This approach appears to
different implementations and considering a square matrix, double-precision coefficients be 2–3 times faster in the case of 3 symmetries (𝑑 = 8) than the
and CSR matrix format. naïve implementation of Algorithm 2 with 𝑚 work items and that inner
Implementation Memory accesses loop over symmetrical instances. Note that in the case of such a naïve
SpMV 12nnz(I2𝑠 ⊗ 𝖠inn + 𝖠out ) + 4(𝑛 + 1) + 16𝑛
[ ] [ ] implementation, using a loop with a constant limit is also notably faster
SpMM + SpMV nnz(𝖠inn ) + 4(𝑛∕2𝑠 + 1) + 16𝑛 + 12nnz(𝖠out ) + 4(𝑛 + 1) + 16𝑛 than with a variable limit.
SymSpMV 𝑠
12[nnz(𝖠inn ) + nnz(𝖠out )] + 4(𝑛 + 1) + 4(𝑛∕2 + 1) + 16𝑛
4
À. Alsalti-Baldellou et al. Computers and Fluids 275 (2024) 106247
the variables are cell-centred or staggered at the faces. The collocated the core of this algorithm (step 2) is indeed the projection of a staggered
velocity, 𝒖𝑐 ∈ R3𝑛 , is arranged as a column vector containing the velocity field, 𝒖𝑝𝑠 , using Algorithm 3. Then, steps 1 and 3 require a cell-
three spatial velocity components as 𝒖𝑐 = (𝒖1 , 𝒖2 , 𝒖3 )𝑇 where 𝒖𝑖 = to-face, 𝛤𝑐→𝑠 ∈ R𝑚×3𝑛 , and a face-to-cell, 𝛤𝑠→𝑐 ∈ R3𝑛×𝑚 , interpolation.
([𝒖𝑖 ]1 , [𝒖𝑖 ]2 , … , [𝒖𝑖 ]𝑛 ) ∈ R𝑛 are vectors containing the velocity compo- They must be related as follows:
nents corresponding to the 𝑥𝑖 -spatial direction. The staggered velocity
𝛤𝑠→𝑐 = −1 𝛤𝑐→𝑠 𝑇 𝑠 , (17)
vector 𝒖𝑠 = ([𝑢𝑠 ]1 , [𝑢𝑠 ]2 ,(… ), [𝑢𝑠 ]𝑚 )𝑇 ∈ R𝑚 , which is needed to compute
the convective term, 𝖢 𝒖𝑠 , results from the projection of a staggered to preserve the duality between the collocated gradient, 𝖦𝑐 , defined
predictor velocity, 𝒖𝑝𝑠 ∈ R𝑚 (see Algorithm ( ) 3), where 𝑚 is the number of in Eq. (16), and the (integrated) collocated divergence operator, 𝖬𝑐 ≡
faces.. The matrices ∈ R3𝑛×3𝑛 , 𝖢 𝒖𝑠 ∈ R3𝑛×3𝑛 , 𝖣 ∈ R3𝑛×3𝑛 are square 𝖬𝛤𝑐→𝑠 ∈ R𝑛×3𝑛 , i.e.,
block diagonal matrices given by:
( ) ( ) 𝖦𝑐 = −1 𝖬𝑐 𝑇 . (18)
= I3 ⊗ 𝑐 , 𝖢 𝒖𝑠 = I3 ⊗ 𝖢𝑐 𝒖𝑠 , 𝖣 = I3 ⊗ 𝖣𝑐 , (13)
Finally, the sequence of operations to advance one time-step is
where I3 ∈ R3×3 is the identity matrix, 𝑐 ∈ R𝑛×𝑛 is a diagonal outlined in Algorithm 5. Namely, the spatially discrete momentum
matrix equation, Eq. (11), is discretised in time using an explicit second-
( ) containing the sizes of the cell-centred control volumes and,
𝖢𝑐 𝒖𝑠 ∈ R𝑛×𝑛 and 𝖣𝑐 ∈ R𝑛×𝑛 are the collocated convective and diffu- order Adams–Bashforth (AB2) scheme for both convection and diffusion
sive operators, respectively. Finally, 𝖦𝑐 ∈ R3𝑛×𝑛 represents the discrete (steps 1 and 3), whereas the pressure–velocity coupling is solved using
gradient operator whereas the matrix 𝖬 ∈ R𝑛×𝑚 is the face-to-cell a fractional step method [33]. Here, the AB2 scheme is chosen for the
discrete divergence operator. sake of simplicity although more appropriate temporal schemes can be
used [34].
Algorithm 3 Projection of a staggered velocity field, 𝒖𝑝𝑠 ∈ R𝑚 . It returns
a divergence-free staggered velocity field, 𝒖𝑠 ∈ R𝑚 , i.e., 𝖬𝒖𝑠 = 𝟎𝑐 . Algorithm 5 Numerical resolution of NS equations using a Fractional
Step Method on a collocated grid.
Input: 𝖬, 𝖫, 𝖦, 𝒖𝑝𝑠
Output: 𝒖𝑠 , 𝒑̃ 𝑐 Input: 𝖬, 𝖫, 𝖦, 𝛤𝑐→𝑠 , 𝛤𝑠→𝑐 , , 𝒖𝑛𝑐 , 𝒖𝑛𝑠 (𝖬𝒖𝑛𝑠 = 𝟎𝑐 ), 𝖱𝑛−1
𝒖
̃ 𝑐 = 𝖬𝒖𝑝𝑠 Output: 𝒖𝑛+1 ,𝒖𝑛+1 (𝖬𝒖𝑛+1 = 𝟎𝑐 ), 𝒑̃ 𝑛+1
𝑐 (and 𝖱𝑛
1: Solve Poisson equation for (pseudo)pressure: 𝖫𝒑 𝑐 𝑠 𝑠
𝑛
)
𝑝
2: Correct the staggered velocity field: 𝒖𝑠 = 𝒖𝑠 − 𝖦𝒑̃𝑐 1: Computation of the convective, 𝖢 𝒖𝑠 , and the diffusive, 𝖣, terms:
( ( ) )
𝖱𝑛𝒖 ≡ −1 −𝖢 𝒖𝑛𝑠 + 𝖣 𝒖𝑛𝑐 . (19)
Algorithm 4 (Pseudo-)projection of a collocated velocity, 𝒖𝑝𝑐 ∈ R3𝑛 .
It returns a (quasi-)divergence-free collocated velocity, 𝒖𝑐 ∈ R3𝑛 , 2: Determination of 𝛥𝑡 using a CFL condition [34].
𝑝
i.e., 𝖬𝛤𝑐→𝑠 𝒖𝑐 ≈ 𝟎𝑐 . 3: Computing the predictor velocity, 𝒖𝑐 :
Input: 𝖬, 𝖫, 𝖦, 𝛤𝑐→𝑠 , 𝛤𝑠→𝑐 , 𝒖𝑝𝑐 ( )
3 𝑛 1 𝑛
Output: 𝒖𝑐 , 𝒖𝑠 , 𝒑̃ 𝑐 𝒖𝑝𝑐 = 𝒖𝑛𝑐 + 𝛥𝑡 𝖱 − 𝖱 (20)
2 𝒖 2 𝒖
𝑝 𝑝
1: Cell-to-face interpolation of the velocity field: 𝒖𝑠 = 𝛤𝑐→𝑠 𝒖𝑐 𝑝
𝑝 4: Projection of 𝒖𝑐 with Algorithm 4: it returns 𝒖𝑛+1
𝑐 , 𝒖𝑠 ̃ 𝑛+1
𝑛+1 and 𝒑
𝑐
2: Projection of 𝒖𝑠 with Algorithm 3: it returns 𝒖𝑠 (𝖬𝒖𝑠 = 𝟎𝑐 ) and 𝒑 ̃𝑐
𝑝
̃ 𝑐 = 𝒖𝑝𝑐 − 𝛤𝑠→𝑐 𝖦𝒑̃ 𝑐
3: Correct the collocated velocity field: 𝒖𝑐 = 𝒖𝑐 − 𝖦𝑐 𝒑
4.3. Constructing the discrete operators
4.2. Solving NS equations on collocated grids
This subsection briefly revise the construction of all the discrete
operators needed to solve the NS equations. The above-explained con-
Let us firstly consider the projection of a staggered (predictor)
straints imposed by the (skew-)symmetries strongly simplifies ‘‘the
velocity field, 𝒖𝑝𝑠 , onto a divergence-free space. This is a well-posed
discretisation problem’’ to a set of five basic discrete operators. Namely,
problem: it can be uniquely decomposed into a solenoidal velocity,
𝒖𝑠 , and the gradient of a scalar field, 𝖦𝒑̃ 𝑐 . It requires the solution
of a Poisson equation for pressure (or a pseudo-pressure) and the {𝑐 , 𝑠 , 𝖭𝑠 , 𝖬, 𝛱𝑐→𝑠 }. (21)
subsequent projection of the velocity field (see Algorithm 3). Here, the
The first three correspond to basic geometrical information of the
tilde in 𝒑̃ 𝑐 is to stress that it does not need to be the actual pressure,
mesh: namely, the diagonal matrices containing the cell-centred and
𝒑𝑐 , but instead some sort of pseudo-pressure. The matrix 𝖦 ∈ R𝑚×𝑛
staggered control volumes, 𝑐 and 𝑠 , and the matrix containing the
is the cell-to-face discrete gradient, and it is related with the discrete
face normal vector, 𝖭𝑠 ≡ (𝖭𝑠,1 , 𝖭𝑠,2 , 𝖭𝑠,3 ) ∈ R𝑚×3𝑚 where 𝖭𝑠,𝑖 ∈ R𝑚×𝑚
(integrated) divergence operator, 𝖬, via:
are diagonal matrices containing the 𝑥𝑖 -spatial components of the face
𝖦 ≡ −−1 𝑇
𝑠 𝖬 . (14) normal vectors, 𝒏𝑓 . The staggered control volumes, 𝑠 , are given by
Then, the discrete Laplacian operator, 𝖫 ∈ R𝑛×𝑛 is, by construction, a [𝑠 ]𝑓 ,𝑓 ≡ 𝐴𝑓 𝛿𝑓 , (22)
symmetric negative semi-definite matrix:
where 𝐴𝑓 is the area of the face 𝑓 and 𝛿𝑓 = |𝒏𝑓 ⋅ 𝑐1𝑐2|
⃖⃖⃖⃖⃖⃖⃖⃗ is the projected
𝖫 ≡ 𝖬𝖦 = −𝖬−1 𝑇
𝑠 𝖬 . (15) distance between adjacent cell centres (see Fig. 4). In this way, the sum
of volumes is exactly preserved tr(𝑠 ) = tr() = 𝑑tr(𝑐 ) (𝑑 = 2 for 2D
Notice that 𝑠 ∈ R𝑚×𝑚 is a diagonal matrix with strictly positive diag-
and 𝑑 = 3 for 3D) regardless of the mesh quality and the location of
onal elements that contains the staggered control volumes associated
the cell centres.
with the staggered velocity components.
Then, the face-to-cell discrete (integrated) divergence operator, 𝖬,
Nevertheless, the momentum equation, Eq. (11), requires the com-
is defined as follows:
putation of a cell-centred pressure gradient, 𝖦𝑐 𝒑̃ 𝑐 , which is approxi- ∑
mated via a face-to-cell (momentum) interpolation, 𝛤𝑠→𝑐 ∈ R3𝑛×𝑚 , as [𝖬𝒖𝑠 ]𝑘 = [𝒖𝑠 ]𝑓 𝐴𝑓 , (23)
follows: 𝑓 ∈𝐹𝑓 (𝑘)
𝖦𝑐 ≡ 𝛤𝑠→𝑐 𝖦∈ R 3𝑛×𝑛
. (16) where 𝐹𝑓 (𝑘) is the set of faces bordering the cell 𝑘. Finally, 𝛱𝑐→𝑠 ∈ R𝑚×𝑛
is an unweighted cell-to-face scalar field interpolation satisfying:
Actually, the overall process can be viewed as a (pseudo-)projection of 𝜙𝑐1 + 𝜙𝑐2
a collocated velocity field, 𝒖𝑝𝑐 ∈ R3𝑛 , outlined in Algorithm 4 Notice that 𝜙𝑓 ≈ [𝛱𝑐→𝑠 𝝓𝑐 ]𝑓 = , (24)
2
5
À. Alsalti-Baldellou et al. Computers and Fluids 275 (2024) 106247
Fig. 4. Left: face normal and neighbour labelling criterion. Right: definition of the volumes, 𝑠 , associated with the face-normal velocities, 𝒖𝑠 . Thick dashed rectangle is the volume
associated with the staggered velocity U4 = [𝒖𝑠 ]4 , i.e., [𝑠 ]4,4 = 𝐴4 𝛿4 where 𝐴4 is the face area and 𝛿4 = |𝒏4 ⋅ 𝑐1𝑐2|
⃖⃖⃖⃖⃖⃖⃖⃗ is the projected distance between adjacent cell centres. Thin
dash-dotted lines are placed to illustrate that the sum of volumes is exactly preserved tr(𝑠 ) = tr() = 𝑑tr(𝑐 ) (𝑑 = 2 for 2D and 𝑑 = 3 for 3D) regardless of the location of the cell
nodes.
where 𝑐1 and 𝑐2 are the cells adjacent to the face 𝑓 (see Fig. 4, left). The first four are already in the set of five basic discrete operators given
This is needed to construct the skew-symmetric convective operator in (21), whereas the other four can be written as a linear combination
according to Eq. (13) and: of basic ones (see Eqs. (14), (15), (17) and (28)). They can be classified
( ) according to their input and output spaces and their sparsity patterns.
𝖢𝑐 𝒖𝑠 ≡ 𝖬𝖴𝑠 𝛱𝑐→𝑠 , (25)
Apart from the two diagonal matrices 𝑐 and 𝑠 , there are only three
where 𝖴𝑠 ≡ diag(𝒖𝑠 ) ∈ R𝑚×𝑚 is a diagonal matrix that contains the types of pattern. Namely, the pattern corresponding to (i) the cell-to-
face velocities, 𝒖𝑠 ∈ R𝑚 . Although the local truncation error is only face incidence matrix 𝖳𝑐𝑠 ∈ R𝑚×𝑛 which has two non-zero elements
first-order for non-uniform grids, numerical tests showed that its global per row (a +1 and a −1 corresponding to the cells adjacent to a face),
truncation error is indeed second-order [32]. (ii) the face-to-cell incidence matrix, 𝖳𝑠𝑐 = 𝖳𝑐𝑠 𝑇 ∈ R𝑛×𝑚 , and (iii) the
The cell-to-face gradient, 𝖦, follows from Eqs. (14) and (22) leading graph Laplacian matrix, 𝖳𝑠𝑐 𝖳𝑐𝑠 ∈ R𝑛×𝑛 . For instance, for the mesh with
to: 4 control volumes and 8 faces shown in Fig. 4 (right), the face-to-cell
𝑝 − 𝑝𝑐2
[𝑠 𝖦𝒑𝑐 ]𝑓 = (𝑝𝑐1 − 𝑝𝑐2 )𝐴𝑓 ⟹ [𝖦𝒑𝑐 ]𝑓 = 𝑐1 , (26) incidence matrix reads:
𝛿𝑓
⎛ 0 0 −1 +1 0 0 +1 0 ⎞
and subsequently using Eqs. (13) and (15) yield the discrete Laplacian ⎜ ⎟
+1 0 0 −1 0 −1 0 0
and diffusive operators: 𝖳𝑠𝑐 = 𝖳𝑐𝑠 𝑇 =⎜ ⎟. (30)
⎜ −1 +1 0 0 0 0 0 +1 ⎟
∑ (𝜙𝑐1 − 𝜙𝑐2 )𝐴𝑓 ⎜ 0 −1 +1 0 +1 0 0 0 ⎟
[𝖫𝝓𝑐 ]𝑘 = and 𝖣𝑐 ≡ 𝜈𝖫, (27) ⎝ ⎠
𝑓 ∈𝐹𝑓 (𝑘)
𝛿𝑓
5. Numerical results
where 𝜈 is the kinematic viscosity. Notice that this discretisation of the
diffusive operator is valid for incompressible fluids with constant vis-
This section investigates the advantages of exploiting spatial reflec-
cosity. For non-constant viscosity values, the discretisation method has
tion symmetries in CFD simulations. With this aim, we have exploited
to be modified accordingly [35]. Finally, the cell-to-face (momentum)
interpolation is constructed as follows: up to 𝑠 = 3 symmetries on a cubic domain and a finned tube heat
exchanger. Apart from being relevant geometries, they have been se-
𝛤𝑐→𝑠 ≡ −1
𝑠 𝖭𝑠 𝛱 where 𝛱 = I3 ⊗ 𝛱𝑐→𝑠 , (28) lected to show the performance of our strategy both on structured
which corresponds to a volume-weighted interpolation. It must be and unstructured discretisations (see Figs. 5 and 6). Similarly, we
noted that an unweighted interpolation, 𝛤𝑐→𝑠 = 𝖭𝑠 𝛱, was proposed have run all the tests both on CPU and GPU architectures. On the
in the original paper [32]. However, as mentioned above, this can lead one hand, the CPU tests have been performed on a single node of
to stability issues. the MareNostrum4 supercomputer at the Barcelona Supercomputing
Center. Its non-uniform memory access (NUMA) nodes are equipped
Table 2
with two Intel Xeon 8160 CPUs (24 cores, 2.1 GHz, 33 MB L3 cache
Set of matrices needed to carry out a simulation classified accordingly to their input and 128 GB/s memory bandwidth) linked to 96 GB of RAM and
and output spaces and their sparsity patterns. interconnected through 12.5 GB/s Intel Omni-Path. On the other hand,
Input Output Diagonal Non-diagonal ⟶ Sparsity pattern the GPU tests have been performed on an NVIDIA RTX A5000 GPU
cells cells {𝑐 } {𝖫} |𝖳𝑠𝑐 𝖳𝑐𝑠 | (8192 CUDA cores, 24 GB GDDR6 and 768 GB/s memory bandwidth).
cells faces × {𝖦, 𝛱𝑐→𝑠 , 𝛤𝑐→𝑠 } |𝖳𝑐𝑠 | The two grids considered are of comparable size, with the structured
faces cells × {𝖬, 𝛤𝑠→𝑐 } |𝖳𝑠𝑐 | one containing 15.5M control volumes and the unstructured one 17.7M.
faces faces {𝑠 } {∅}
As explained in Section 2 and illustrated in Fig. 6, when exploiting
𝑠 symmetries, only a 1∕2𝑠 fraction of the entire domain, denoted as
4.4. Sparsity patterns
base mesh, needs to be discretised. Then, to build 𝖫, 𝖦 and 𝖬, we have
followed the discretisation of Section 4 but taking advantage of Eq. (6).
The set of eight matrices that are in practice needed to carry out a
That is, instead of building the entire operators, we have only built the
simulation are listed in Table 2. Namely,
base mesh’s couplings with itself and with its 2𝑠 mirrored counterparts.
{𝑐 , 𝑠 , 𝖬, 𝛱𝑐→𝑠 , 𝖦, 𝖫, 𝛤𝑠→𝑐 , 𝛤𝑐→𝑠 }. (29) The first immediate benefit of this approach is, apart from considerably
6
À. Alsalti-Baldellou et al. Computers and Fluids 275 (2024) 106247
7
À. Alsalti-Baldellou et al. Computers and Fluids 275 (2024) 106247
Fig. 10. Roofline model for the SymSpMV. Dashed lines correspond to fixed problem
and, again, the resulting speed-ups are ordered according to their arith- size (and halved base mesh), and solid lines to fixed base mesh size (and doubled problem
metic intensity. In particular, up to 3.3×, 2.8×and 2.2× accelerations size).
are attained in the products by 𝖫, 𝖣 and 𝖦, respectively.
In order to evaluate the efficiency of our SymSpMV, we have made a
roofline performance analysis [36]. It consists of a scatter plot that de-
symmetries for accelerating virtually all matrix multiplications. In par-
termines if a kernel is either memory- or compute-bound and compares
ticular, we presented a spatial discretisation and an unknowns’ or-
its implementation’s performance with the theoretical peak offered by
dering, making virtually all the discrete operators exhibit a regular
the hardware. It is displayed in Fig. 10 both for the CPU and GPU
block structure. Then, we showed how to use such a structure to
implementations of the SymSpMV and for all the matrices and grids
replace the standard SpMV with the more compute-intensive Sym-
considered. As discussed in Section 3, the more exploited symmetries,
SpMV, a specialisation of the SpMM allowing for significant memory
the higher the arithmetic intensity and, therefore, the closer to being
savings.
compute-bound. Although SymSpMV was memory-bound for all the
The strategy presented applies identically regardless of the geo-
cases considered, the relatively low double-precision peak performance
metric complexity of the problem. Furthermore, although focusing on
of the NVIDIA RTX A5000 made it approach the compute-bound region
CFD problems, it is naturally extensible to other applications. However,
when exploiting three symmetries. Hence, under such circumstances,
we target the DNS and LES of incompressible turbulent flows and,
using higher-order schemes would certainly make SymSpMV enter that
region. therefore, included numerical experiments based on structured and
Regarding the performance of our implementation, despite attaining unstructured discretisations of meaningful configurations. Namely, of
up to 5.0× speed-ups, SymSpMV is far from reaching its theoretical a cubic box and a finned tube heat exchanger.
peak, which leaves place for further improvements. In any case, the nu- The results obtained demonstrated the advantages of exploiting
merical results confirmed the advantages of exploiting spatial reflection symmetries. Indeed, by replacing the standard SpMV with our Sym-
symmetries for making the simulations lighter and faster, particularly SpMV, CPU and GPU executions reached up to 5.0× and 3.3× speed-ups,
benefiting GPU devices, given their limited memory capacity. respectively. On top of that, thanks to exploiting the operators’ regular
block structure, we could reduce their memory footprint up to 8.0×.
6. Conclusions Finally, the roofline performance analysis of our implementation of the
SymSpMV revealed that thanks to exploiting symmetries for increasing
Due to the FLOP-oriented design of modern supercomputers, only the arithmetic intensity, we successfully approached the operators’
applications with very high arithmetic intensities can approach HPC multiplication to the compute-bound region.
systems’ theoretical peak performances. This is certainly not the case It is clear that exploiting symmetries is a very effective strategy
for most computational physics applications, which generally rely on to make high-fidelity simulations significantly more affordable. In this
sparse linear algebra (or equivalent) kernels. This work presented sense, our method is particularly well-suited for GPUs, given their
a strategy to mitigate this limitation by exploiting spatial reflection limited memory capacity. In this sense, future lines of work include
8
À. Alsalti-Baldellou et al. Computers and Fluids 275 (2024) 106247
extending the current approach throughout an incompressible CFD sim- [9] Álvarez X, Gorobets A, Trias FX, Borrell R, Oyarzun G. HPC2 – a fully portable
ulation and analysing the resulting speed-ups, as well as combining it algebra-dominant framework for heterogeneous computing. Application to CFD.
Comput & Fluids 2018;173:285–92.
with similar strategies for accelerating the convergence of the Poisson
[10] Álvarez-Farré X, Gorobets A, Trias FX. A hierarchical parallel implementation
solvers. Additionally, despite the excellent speed-ups obtained, we aim for heterogeneous computing. application to algebra-based CFD simulations on
to enhance our implementation of the SymSpMV to make it approach hybrid supercomputers. Comput & Fluids 2021;214:104768.
its theoretical peak performance. We also plan to extend these analyses [11] cuSPARSE: The API Reference Guide for cuSPARSE, the CUDA Sparse Matrix
to other architectures and numerical schemes. Library. Tech. Rep. March, NVIDIA Corporation; 2020.
[12] Greathouse JL, Knox K, Poła J, Varaganti K, Daga M. clSPARSE: A vendor-
optimized open-source sparse BLAS library. In: IWOCL ’16: Proceedings of the
CRediT authorship contribution statement 4th international workshop on openCL. New York, NY, USA: ACM; 2016.
[13] Anzt H, Tomov S, Dongarra J. On the performance and energy efficiency of
Àdel Alsalti-Baldellou: Conceptualization, Methodology, Software. sparse linear algebra on GPUs. Int J High Perform Comput Appl 2017;31:375–90.
[14] Alsalti-Baldellou À, Álvarez-Farré X, Trias FX, Oliva A. Exploiting spatial
Xavier Álvarez-Farré: Methodology, Software. Guillem Colomer: Soft-
symmetries for solving Poisson’s equation. J Comput Phys 2023;486:112133.
ware. Andrey Gorobets: Methodology, Software. Carlos David Pérez- [15] Alsalti-Baldellou À, Janna C, Álvarez-Farré X, Trias FX. Exploiting symmetries for
Segarra: Funding acquisition. Assensi Oliva: Funding acquisition, preconditioning Poisson’s equation in CFD simulations. In: Proceedings of the
Resources. F. Xavier Trias: Conceptualization, Methodology, Supervi- Platform for Advanced Scientific Computing Conference. New York, NY, USA:
sion. ACM; 2023, p. 1–9.
[16] Gorobets A, Trias FX, Borrell R, Lehmkuhl O, Oliva A. Hybrid MPI+OpenMP
parallelization of an FFT-based 3D Poisson solver with one periodic direction.
Declaration of competing interest Comput & Fluids 2011;49:101–9, Mesh Symmetries.
[17] Shishkina O, Shishkin A, Wagner C. Simulation of turbulent thermal convection
The authors declare that they have no known competing finan- in complicated domains. J Comput Appl Math 2009;226:336–44.
[18] Löhner R, Othmer C, Mrosek M, Figueroa A, Degro A. Overnight industrial LES
cial interests or personal relationships that could have appeared to
for external aerodynamics. Comput & Fluids 2021;214.
influence the work reported in this paper. [19] Capuano F, Palumbo A, de Luca L. Comparative study of spectral-element and
finite-volume solvers for direct numerical simulation of synthetic jets. Comput
Data availability & Fluids 2019;179:228–37.
[20] Esclapez L, Ma PC, Mayhew E, Xu R, Stouffer S, Lee T, et al. Fuel effects on lean
blow-out in a realistic gas turbine combustor. Combust Flame 2017;181:82–99.
Data will be made available on request.
[21] Alsalti-Baldellou A, Álvarez-Farré X, Gorobets A, Trias FX. Efficient strategies
for solving the variable Poisson equation with large contrasts in the coefficients.
Acknowledgements In: 8th European Congress on Computational Methods in Applied Sciences and
Engineering. 273, CIMNE; 2022, p. 416–34.
[22] Yuan C, Ng E, Norford LK. Improving air quality in high-density cities by
A.A.B., X.A.F., G.C., C.D.P.S., A.O. and F.X.T. have been financially
understanding the relationship between air pollutant dispersion and urban
supported by two competitive R+D projects: RETOtwin (PDC2021- morphologies. Build Environ 2014;71:245–58.
120970-I00), given by MCIN/AEI/10.13039/501100011033 and Euro- [23] Hang J, Li Y, Sandberg M, Buccolieri R, Di Sabatino S. The influence of building
pean Union Next GenerationEU/PRTR, and FusionCAT (001- height variability on pollutant dispersion and pedestrian ventilation in idealized
P-001722), given by Generalitat de Catalunya RIS3CAT-FEDER. A.A.B. high-rise urban areas. Build Environ 2012;56:346–60.
[24] Kooij GL, Botchev MA, Frederix EM, Geurts BJ, Horn S, Lohse D, et al.
has also been supported by the predoctoral grants DIN2018-010061
Comparison of computational codes for direct numerical simulations of turbulent
and 2019-DI-90, given by MCIN/AEI/10.13039/501100011033 and the Rayleigh–Bénard convection. Comput & Fluids 2018;166:1–8.
Catalan Agency for Management of University and Research Grants [25] Fang J, Shaver DR, Tomboulides A, Min M, Fischer P, Lan Y-H, et al. Feasi-
(AGAUR), respectively. The numerical experiments have been con- bility of full-core pin resolved CFD simulations of small modular reactor with
momentum sources. Nucl Eng Des 2021;378:111143.
ducted on the Marenostrum4 supercomputer at the Barcelona Su-
[26] Liu B, He S, Moulinec C, Uribe J. Sub-channel CFD for nuclear fuel bundles.
percomputing Center under the project IM-2022-3-0026. The authors Nucl Eng Des 2019;355:110318.
thankfully acknowledge these institutions. [27] Paniagua L, Lehmkuhl O, Oliet C, Pérez-Segarra C-D. Large eddy simulations
(LES) on the flow and heat transfer in a wall-bounded pin matrix. Numer Heat
References Transfer B 2014;65:103–28.
[28] Filippone S, Cardellini V, Barbieri D, Fanfarillo A. Sparse matrix-vector
multiplication on GPGPUs. ACM Trans Math Software 2017;43(4).
[1] Witherden FD, Vermeire BC, Vincent PE. Heterogeneous computing on mixed
[29] Liu X, Smelyanskiy M, Chow E, Dubey P. Efficient sparse matrix-vector mul-
unstructured grids with pyfr. Comput & Fluids 2015;120:173–86.
tiplication on x86-based many-core processors. In: Proceedings of the 27th
[2] Borrell R, Dosimont D, Garcia-Gasulla M, Houzeaux G, Lehmkuhl O, Mehta V,
international ACM conference on international conference on supercomputing.
et al. Heterogeneous CPU/GPU co-execution of CFD simulations on the POWER9
ICS ’13, New York, NY, USA: Association for Computing Machinery; 2013, p.
architecture: Application to airplane aerodynamics. Future Gener Comput Syst
273–82.
2020;107:31–48.
[30] Krasnopolsky BI. An approach for accelerating incompressible turbulent flow
[3] Gorobets A, Bakhvalov P. Heterogeneous CPU+GPU parallelization for high-
simulations based on simultaneous modelling of multiple ensembles. Comput
accuracy scale-resolving simulations of compressible turbulent flows on hybrid
Phys Comm 2018;229:8–19.
supercomputers. Comput Phys Comm 2022;271:108231.
[31] Imamura S, Ono K, Yokokawa M. Iterative-method performance evaluation for
[4] Baboulin M, Buttari A, Dongarra J, Kurzak J, Langou J, Langou J, et al.
multiple vectors associated with a large-scale sparse matrix. Int J Comput Fluid
Accelerating scientific computations with mixed precision algorithms. Comput
Dyn 2016;30(6):395–401.
Phys Comm 2009;180:2526–33.
[32] Trias FX, Lehmkuhl O, Oliva A, Pérez-Segarra C, Verstappen R. Symmetry-
[5] Krasnopolsky BI. An approach for accelerating incompressible turbulent flow
preserving discretization of Navier-Stokes equations on collocated unstructured
simulations based on simultaneous modelling of multiple ensembles. Comput
meshes. J Comput Phys 2014;258:246–67.
Phys Comm 2018;229:8–19.
[33] Chorin AJ. Numerical solution of the Navier-Stokes equations. Math Comp
[6] Gorobets A, Trias FX, Oliva A. A parallel MPI+openmp+opencl algorithm
1968;22:745–62.
for hybrid supercomputations of incompressible flows. Comput & Fluids
[34] Trias FX, Lehmkuhl O. A self-adaptive strategy for the time-integration of
2013;88:764–72.
Navier-Stokes equations. Numer Heat Transfer B 2011;60(2):116–34.
[7] Edwards HC, Trott CR, Sunderland D. Kokkos: Enabling manycore performance
[35] Trias FX, Gorobets A, Oliva A. A simple approach to discretize the viscous term
portability through polymorphic memory access patterns. J Parallel Distrib
with spatially varying (eddy-)viscosity. J Comput Phys 2013;253:405–17.
Comput 2014;74(12):3202–16.
[36] Williams S, Waterman A, Patterson D. Roofline: An insightful visual performance
[8] Shimokawabe T, Aoki T, Onodera N. High-productivity framework for large-scale
model for multicore architectures. Commun ACM 2009;52(4):65–76.
GPU/CPU stencil applications. Procedia Comput Sci 2016;80:1646–57.