0% found this document useful (0 votes)
9 views

202403-Articles-CAF-Symmetric-FSM-Published

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

202403-Articles-CAF-Symmetric-FSM-Published

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Computers and Fluids 275 (2024) 106247

Contents lists available at ScienceDirect

Computers and Fluids


journal homepage: www.elsevier.com/locate/compfluid

Lighter and faster simulations on domains with symmetries


Àdel Alsalti-Baldellou a,b , Xavier Álvarez-Farré a,c , Guillem Colomer a , Andrey Gorobets d ,
Carlos David Pérez-Segarra a , Assensi Oliva a , F. Xavier Trias a ,∗
a Heat and Mass Transfer Technological Center, Technical University of Catalonia, Carrer de Colom 11, 08222, Terrassa (Barcelona), Spain
b Termo Fluids SL, Carrer de Magí Colet 8, 08204, Sabadell (Barcelona), Spain
c
High-Performance Computing and Visualization Team, SURF, Science Park 140, 1098 XG, Amsterdam, The Netherlands
d
Keldysh Institute of Applied Mathematics, Russian Academy of Sciences, Miusskaya Sq. 4, 125047, Moscow, Russia

ARTICLE INFO ABSTRACT

Keywords: A strategy to improve the performance and reduce the memory footprint of simulations on meshes with
Reflection symmetries spatial reflection symmetries is presented in this work. By using an appropriate mirrored ordering of the
Arithmetic intensity unknowns, discrete partial differential operators are represented by matrices with a regular block structure
Memory footprint
that allows replacing the standard sparse matrix–vector product with a specialised version of the sparse matrix-
SpMV
matrix product, which has a significantly higher arithmetic intensity. Consequently, matrix multiplications are
SpMM
MPI+OpenMP+OpenCL/CUDA
accelerated, whereas their memory footprint is reduced, making massive simulations more affordable. As an
example of practical application, we consider the numerical simulation of turbulent incompressible flows using
a low-dissipation discretisation on unstructured collocated grids. All the required matrices are classified into
three sparsity patterns that correspond to the discrete Laplacian, gradient, and divergence operators. Therefore,
the above-mentioned benefits of exploiting spatial reflection symmetries are tested for these three matrices on
both CPU and GPU, showing up to 5.0x speed-ups and 8.0x memory savings. Finally, a roofline performance
analysis of the symmetry-aware sparse matrix–vector product is presented.

1. Introduction may have higher arithmetic intensity because of the numerical schemes
used [1,2], thanks to the presence of Riemann solvers [3], to us-
The design of digital processors constantly evolves to overcome ing mixed precision algorithms [4], merging simulations of multiple
limitations and bottlenecks. The formerly compute-bound nature of flow ensembles [5], exploiting uniform mesh directions with periodic
processors led to compute-centric programming languages and sim- conditions [6], or pursuing more efficient implementations.
ulation codes. However, raw computing power grows at a (much)
Secondly, the above-mentioned heterogeneity in HPC systems
faster pace than the speed of memory access, turning around the
makes cross-platform portability crucial. In this regard, our strategy is
problem. Increasingly complex memory hierarchies are found nowa-
breaking the interdependency between algorithms and their software
days in computing systems, and optimising traditional applications
for these systems is cumbersome. Moreover, new parallel program- implementation by casting calculations into a minimalist set of univer-
ming languages and frameworks emerged to target modern hardware sal kernels. There is an increasing interest towards the development of
(e.g., OpenMP, CUDA, OpenCL, HIP), and porting algorithms and appli- more abstract implementations. For instance, the PyFR framework [1]
cations has become restrictive. This scenario poses multiple challenges is mostly based on matrix multiplications and point-wise operations.
to the development of efficient and scalable scientific simulation codes. Another example is the Kokkos programming model [7], which includes
Firstly, only applications with very high arithmetic intensity can ap- computation abstractions for frequently used parallel computing pat-
proach the theoretical peak performance of modern high-performance terns and data structures. Namely, implementing an algorithm in terms
computing (HPC) systems. However, this is not the case in most prob- of Kokkos entities allows mapping the algorithm onto multiple archi-
lems in computational physics since they usually rely on sparse linear tectures. Some authors propose domain-specific tools to address this,
algebra (or equivalent) operations. Examples can be found in computa-
generalising the stencil computations for specific fields. For instance,
tional fluid dynamics (CFD), linear elasticity, structural mechanics or
a framework that automatically translates stencil functions written in
electromagnetic modelling, among others. Strategies helping to mit-
C++ to both central processing unit (CPU) and graphics processing unit
igate this include adapting more compute-intensive methods, which

∗ Corresponding author.
E-mail address: [email protected] (F.X. Trias).

https://doi.org/10.1016/j.compfluid.2024.106247
Received 31 December 2022; Received in revised form 22 December 2023; Accepted 13 March 2024
Available online 20 March 2024
0045-7930/© 2024 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
À. Alsalti-Baldellou et al. Computers and Fluids 275 (2024) 106247

Fig. 1. Single-symmetry 1D mesh with mirrored ordering.

(GPU) codes is proposed in [8]. In this regard, in our previous works [9, where 𝑛 stands for the mesh size and 𝒙1 , 𝒙2 ∈ R𝑛∕2 for 𝒙’s restriction to
10] we showed that all the operations involved in a typical CFD algo- each subdomain. Remarkably enough, mirrored grid points are in the
rithm for large-eddy simulation (LES) or direct numerical simulation same position within the subvectors, and discrete versions of all partial
(DNS) of incompressible turbulent flows can be simplified to three basic differential operators satisfy the following block structure:
linear algebra subroutines: a sparse matrix–vector product (SpMV), a ( )
𝖠1,1 𝖠1,2
linear combination of vectors and a dot product. From now on, we will 𝖠= ∈ R𝑛×𝑛 , (2)
𝖠2,1 𝖠2,2
refer to implementation models heavily based on algebraic subroutines
as algebraic or algebra-based. In such an implementation approach, where 𝖠𝑖,𝑗 ∈ R𝑛∕2×𝑛∕2 accounts for the couplings between the 𝑖th and
the kernel code shrinks to hundreds or even dozens of lines; the 𝑗th subdomains. As long as 𝐴 only depends on geometric quantities
portability becomes natural, and maintaining multiple implementations (which is typically the case), given that both subdomains are identical
takes little effort. Besides, standard libraries optimised for particular and thanks to the mirrored ordering, we have that:
architectures (e.g., cuSPARSE [11], clSPARSE [12]) can be linked in
𝖠1,1 = 𝖠2,2 and 𝖠1,2 = 𝖠2,1 , (3)
addition to specialised in-house implementations. Nevertheless, the
algebraic approach imposes restrictions and challenges that must be and, by denoting 𝖠𝑖 ≡ 𝖠1,𝑖 , we can rewrite Eq. (2) as:
addressed, such as the inherent low arithmetic intensity of the SpMV, ( )
which makes the simulation algorithm pronouncedly memory-bound. 𝖠1 𝖠2
𝖠= . (4)
In this context, the present work proposes a strategy to exploit 𝖠2 𝖠1
spatial reflection symmetries for accelerating virtually all matrix mul- The procedure above can be applied recursively to exploit an arbi-
tiplications, which generally are the most computationally expensive trary number of symmetries, 𝑠. For instance, taking advantage of 𝑠 = 2
kernel involved in the simulations [10]. This is done by replacing symmetries results in 4 mirrored subdomains on which, analogously
the standard SpMV with a specialised version of the sparse matrix- to Eq. (4), virtually all discrete operators satisfy the following:
matrix product (SpMM), a considerably more compute-intensive kernel
⎛𝖠1 𝖠2 𝖠3 𝖠4 ⎞
thanks to the lower memory traffic it entails [13]. Besides increas- ⎜ ⎟
𝖠 𝖠1 𝖠4 𝖠3 ⎟
ing the arithmetic intensity of the simulations, exploiting 𝑠 symme- 𝖠=⎜ 2 , (5)
⎜𝖠3 𝖠4 𝖠1 𝖠2 ⎟
tries allows reducing both the setup costs and memory footprint of ⎜𝖠 ⎟
⎝ 4 𝖠3 𝖠2 𝖠1 ⎠
the discrete operators by a factor of 2𝑠 , thus making high-fidelity
simulations significantly more affordable. Although being out of the where 𝖠𝑖 ∈ R𝑛∕4×𝑛∕4 corresponds to the couplings between the first and
scope of this work, symmetries can be further exploited to acceler- 𝑖th subdomains.
ate the solution of Poisson’s equation [14–17]. Remarkably enough, Thanks to the discretisation presented above, exploiting 𝑠 symme-
although focusing on CFD applications, the approach presented is tries allows meshing a 1∕2𝑠 fraction of the entire domain, henceforth
naturally extensible to other physical problems. However, we target named base mesh (see Fig. 6). Then, instead of building the full oper-
the DNS and LES of incompressible turbulent flows, which usually ators, 𝖠 ∈ R𝑛×𝑛 , it is only needed to build the base mesh’s couplings
𝑠 𝑠
exhibit spatial reflection symmetries. Indeed, vehicles generally present with itself, 𝖠1 ∈ R𝑛∕2 ×𝑛∕2 , and with its 2𝑠 − 1 mirrored counterparts,
𝑠 ×𝑛∕2𝑠
one symmetry [18]. Examples with two symmetries include jets [19], 𝖠2 , … , 𝖠2𝑠 ∈ R 𝑛∕2 . As a result, both the setup and memory
flames [20], multiphase [21], building and urban simulations [22,23]. footprint of the matrices are considerably reduced.
Finally, domains with three (or more) symmetries range from canonical Furthermore, while the sparsity pattern of 𝖠1 matches that of the
flows [24] to industrial devices such as nuclear reactors [25,26] or heat actual operator built upon the base mesh, the outer-subdomain cou-
exchangers [27]. plings, 𝖠2 , … , 𝖠2𝑠 , have very few non-zero entries (if any), making the
The remaining parts of this paper are organised as follows. Section 2 following splitting very advantageous:
defines the adequate discretisation of domains with symmetries and
derives the resulting structure exhibited by the discrete operators. 𝖠 = I2𝑠 ⊗ 𝖠inn + 𝖠out , (6)
Section 3 details the replacement of SpMV with the specialised version where 𝖠inn ∶= 𝖠1 ∈ R 𝑛∕2𝑠 ×𝑛∕2𝑠
and 𝖠out ∶= 𝖠 − I2𝑠 ⊗ 𝖠inn ∈ R𝑛×𝑛 .
of SpMM. Section 4 lands the previous results to the solution of the Indeed, as will be shown in Section 3, Eq. (6) allows accelerating the
Navier–Stokes equations, and Finally, Section 6 overviews the strategy matrix multiplications by replacing the standard SpMV with the more
and discusses future lines of work. compute-intensive SpMM.
Finally, let us note that the splitting of Eq. (6) does only require
2. Exploiting symmetries the geometric domain to be symmetric and is perfectly compatible with
asymmetric boundary conditions. Certainly, even if they are introduced
This section aims to show how to exploit spatial reflection symme- within the discrete operators, it is enough to assign their corresponding
tries to increase the arithmetic intensity of the matrix multiplications. matrix entries to 𝖠out .
Although applying identically to arbitrarily complex geometries, for
clarity, let us first present our strategy in its simplest form, i.e., on a 3. Faster and lighter matrix multiplications
one-dimensional mesh with a single reflection symmetry.
Hence, given the one-dimensional mesh of Fig. 1, let us order its 3.1. Optimising sparse matrix–vector multiplication on symmetric domains
grid points by first indexing the ones lying on one half and then those
in the other. Then, if we impose to the resulting subdomains the same SpMV is the most computationally expensive routine in many large-
local ordering (mirrored by the symmetry’s hyperplane), we ensure that scale simulations relying on iterative methods. Namely, it is a strongly
all the scalar fields satisfy the following: memory-bound kernel with a low arithmetic intensity (𝐼), which is
( )
𝒙1 the ratio of computing work in flop to memory traffic in bytes, and
𝒙= ∈ R𝑛 , (1) requires indirect memory accessing to the input vector, harming the
𝒙2

2
À. Alsalti-Baldellou et al. Computers and Fluids 275 (2024) 106247

memory access efficiency. To top it off, in distributed-memory parallel


processing, vector elements and matrix rows are distributed among a
group of processes inducing data exchanges between them. Therefore,
the efficient execution of SpMV requires a fine-tuning process (e.g., right
choice of the sparse matrix storage format, proper workload balanc-
ing, unknowns reordering to reduce the matrix bandwidth or memory
access optimisation to minimise the cache misses).
Significant effort is devoted to studying and optimising SpMV for
different applications and state-of-the-art computing environments. The
introduction of GPUs into HPC systems motivated the research of new
sparse matrix storage formats and SpMV implementations, as reviewed
by Filippone et al. in [28]. The continuous evolution of CPUs also
motivates the research for efficient SpMV kernels on such architec-
tures [29]. However, all these efforts are still limited by the low Fig. 2. Theoretical speed-up of SpMM vs SpMV with respect to the number of vectors.
arithmetic intensity of SpMV.
8 bytes ⋅ nnz(𝖠); the size of the column indices, 4 bytes ⋅ nnz(𝖠); the
In some cases, a sparse matrix is to be multiplied by a set of vectors.
size of row pointer values, 4 bytes ⋅ (𝑚 + 1); the size of input and output
For instance:
vector coefficients, 𝑑 ⋅ 8 bytes ⋅ (𝑚 + 𝑛); and the auxiliary array, 𝒔, which
)𝑇 ⎛
𝛼𝖠 0 0 ⎞ ⎛𝒙1 ⎞
( )( is 𝑑 ⋅ 8 bytes. In the numerator, the number of operations required
𝛼I3 ⊗ 𝖠 𝒙1 , 𝒙2 , 𝒙3 = ⎜ 0 𝛼𝖠 0 ⎟ ⎜𝒙2 ⎟ . (7) are calculated as follows: in line 5 of the algorithm, for each matrix
⎜ ⎟⎜ ⎟
⎝0 0 𝛼𝖠⎠ ⎝𝒙3 ⎠ coefficient, there is one multiplication and one addition operation
Such a formulation applies to several scenarios in numerical algo- performed per vector, 2nnz(𝖠) ⋅ 𝑑; in line 9, for each output element,
rithm implementations that are increasingly common. Examples are there is one multiplication operation, 𝑚 ⋅ 𝑑.
spatial reflection symmetries, parallel-in-time methods [30], multiple The arithmetic intensity of the standard SpMV is 𝐼𝚂𝚙𝙼𝚅 = 𝐼𝚂𝚙𝙼𝙼 (1).
transport equations or multiple parameter simulations [31], among Consequently, the maximum speed-up achievable by replacing 𝑑 re-
others. cursive SpMV calls with a single SpMM equals 𝐼𝚂𝚙𝙼𝙼 (𝑑)∕𝐼𝚂𝚙𝙼𝙼 (1). This
In this context, the SpMM kernel is described in Algorithm 1. It upper-bound is plotted in Fig. 2. It is noteworthy that, being the
represents the product of a sparse matrix by a set of dense vectors. It upper-bound proportional to the average number of non-zeros per row,
results in significantly greater data reuse: in line 5, the coefficients of nnz(𝐴)∕𝑚, the use of high order schemes may strengthen the benefits
the sparse matrix are reused as many times as blocks are in the input of the SpMM. A lower-bound is also given considering zero temporal
vector. Needless to say, SpMM is applicable to matrices factored as the locality when accessing the input vector coefficients. This is calculated
Kronecker product of a diagonal matrix times another as in Eq. (7). by accounting for one access to 𝒙 for each non-zero coefficient, that
It is important to note that the way sets of vectors are stored has is, by replacing 8𝑛 ⋅ 𝑑 with 8nnz(𝐴) ⋅ 𝑑 in Eq. (8). The lower bound is
a significant impact on the implementation. There are three primary also plotted in Fig. 2. In this case, it does not vary with the number of
layouts: structure of arrays (strided), array of structures (interleaved), non-zeros per row.
and array of structures of arrays (tiled), each with its own unique The splitting described in Section 2 and summarised in Eq. (6) can
properties. This work only considers the interleaved approach, and be directly cast into one SpMM plus one SpMV. However, calling two
evaluating the various possible implementations is beyond its scope. distinct kernels leads to storing intermediate results, which is a rather
undesired behaviour. Therefore, a straightforward kernel fusion results
in a more efficient implementation to which we refer as SymSpMV.
Algorithm 1 SpMM implementation using the standard CSR matrix
format and an interleaved (array of structures) ordering.
Algorithm 2 SymSpMV implementation using the standard CSR matrix
Input: 𝖠, 𝒙, 𝒄
format and an interleaved (array of structures) ordering.
Output: 𝒚
1: for 𝑖 ← 1 to 𝑚 do
Input: 𝖠inn , 𝖠out , 𝒙, 𝒄
2: 𝒔 ← zeros(𝑑) Output: 𝒚
1: for 𝑖 ← 1 to 𝑚 do
3: for 𝑗 ← 𝖠.ptr[𝑖] to 𝖠.ptr[𝑖 + 1] do
4: for 𝑘 ← 1 to 𝑑 do 2: 𝒔 ← zeros(𝑑)
5: 𝒔[𝑘] ← 𝒔[𝑘] + 𝖠.val[𝑗] ⋅ 𝒙[𝑑 ⋅ 𝖠.idx[𝑗] + 𝑘] 3: for 𝑗 ← 𝖠inn .ptr[𝑖] to 𝖠inn .ptr[𝑖 + 1] do
6: end for 4: for 𝑘 ← 1 to 𝑑 do
7: end for 5: 𝒔[𝑘] ← 𝒔[𝑘] + 𝖠inn .val[𝑗] ⋅ 𝒙[𝑑 ⋅ 𝖠inn .idx[𝑗] + 𝑘]
8: for 𝑘 ← 1 to 𝑑 do 6: end for
9: 𝒚[𝑑 ⋅ 𝑖 + 𝑘] ← 𝒄[𝑘] ⋅ 𝒔[𝑘] 7: end for
10: end for 8: for 𝑘 ← 1 to 𝑑 do
11: end for
9: for 𝑗 ← 𝖠out .ptr[𝑑 ⋅ 𝑖 + 𝑘] to 𝖠out .ptr[𝑑 ⋅ 𝑖 + 𝑘 + 1] do
10: 𝒔[𝑘] ← 𝒔[𝑘] + 𝖠out .val[𝑗] ⋅ 𝒙[𝖠out .idx[𝑗]]
11: end for
According to Algorithm 1, using double-precision coefficients, com- 12: 𝒚[𝑑 ⋅ 𝑖 + 𝑘] ← 𝒄[𝑘] ⋅ 𝒔[𝑘]
pressed sparse row (CSR) format, and assuming ideal temporal locality 13: end for
when accessing the matrix and input vector coefficients, the arithmetic 14: end for
intensity of SpMM on an arbitrary matrix 𝖠 reads:
(2nnz(𝖠) + 𝑚) ⋅ 𝑑 Algorithm 2 describes the implementation of such kernel. Recall
𝐼𝚂𝚙𝙼𝙼 (𝑑) = . (8)
8nnz(𝖠) + 4nnz(𝖠) + 4(𝑚 + 1) + (8 𝑚 + 8𝑛 + 8) ⋅ 𝑑 𝑠 𝑠
that 𝖠inn ∶= 𝖠1 ∈ R𝑛∕2 ×𝑛∕2 and 𝖠out ∶= 𝖠 − I2𝑠 ⊗ 𝖠inn ∈ R𝑛×𝑛 . The
where nnz(𝖠), 𝑚 and 𝑛 are the number of non-zero elements, rows and resulting expression reads:
columns in the matrix, respectively, and 𝑑 is the number of vectors. In ( )
𝖠𝒙 = I2𝑠 ⊗ 𝖠inn + 𝖠out 𝒙 =
the denominator, the memory traffic required to perform the operation
( ) ( ) ( )
is the sum of the size of the matrix coefficients, which amounts to = 𝚂𝚙𝙼𝙼 𝖠inn , 𝒙 + 𝚂𝚙𝙼𝚅 𝖠out , 𝒙 = 𝚂𝚢𝚖𝚂𝚙𝙼𝚅 𝖠inn , 𝖠out , 𝒙 . (9)

3
À. Alsalti-Baldellou et al. Computers and Fluids 275 (2024) 106247

To evaluate the benefits of the proposed implementation in terms of twice as fast on dual-socket compute servers, as was show in [10]
memory accesses and footprint, we recall the memory traffic according (Fig. 6).
to Eq. (8). Given a square matrix satisfying the splitting of Eq. (6), Regarding the implementation of SymSpMV on GPUs, both for
i.e., 𝑚 = 𝑛 and 𝖠 = I2𝑠 ⊗ 𝖠inn + 𝖠out , we have that the number of rows CUDA and OpenCL, one work item (thread) is responsible for com-
of 𝖠inn is 𝑛∕2𝑠 . Table 1 outlines the memory accesses required for the puting one component of the output vector. Therefore, there is no
three implementations: the standard SpMV implementation (without inner loop over symmetries, which is in line 4 of Algorithm 2. The
exploiting mesh symmetries), the direct implementation casting the number of work items, or the grid size in CUDA terminology, is 𝑚 ×
product into one SpMM plus one SpMV, and the fused SymSpMV im- 𝑑. Since symmetrical instances are continuously numbered and the
plementation. Therefore, exploiting mesh symmetries reduces memory local workgroup size is divisible by 𝑑, the work items of symmetrical
access and storage by 12(2𝑠 − 1)nnz(𝖠inn ) − 4(𝑛∕2𝑠 + 1). instances are always in the same workgroup and even in the same
warp in CUDA terminology. This ensures that matrix coefficients are
Table 1 shared by 𝑑 neighbouring work items processing symmetrical instances
( )
Estimation of the memory accesses required to compute 𝖠𝒙 = I2𝑠 ⊗ 𝖠inn + 𝖠out 𝒙 using with a coalescing of memory transactions. This approach appears to
different implementations and considering a square matrix, double-precision coefficients be 2–3 times faster in the case of 3 symmetries (𝑑 = 8) than the
and CSR matrix format. naïve implementation of Algorithm 2 with 𝑚 work items and that inner
Implementation Memory accesses loop over symmetrical instances. Note that in the case of such a naïve
SpMV 12nnz(I2𝑠 ⊗ 𝖠inn + 𝖠out ) + 4(𝑛 + 1) + 16𝑛
[ ] [ ] implementation, using a loop with a constant limit is also notably faster
SpMM + SpMV nnz(𝖠inn ) + 4(𝑛∕2𝑠 + 1) + 16𝑛 + 12nnz(𝖠out ) + 4(𝑛 + 1) + 16𝑛 than with a variable limit.
SymSpMV 𝑠
12[nnz(𝖠inn ) + nnz(𝖠out )] + 4(𝑛 + 1) + 4(𝑛∕2 + 1) + 16𝑛

4. Application to CFD simulations


3.2. Parallel implementation details
In the previous sections, a strategy to exploit spatial reflection
Mesh cells are ordered according to mesh symmetries in such a way symmetries has been presented. Shortly, given a mesh with 𝑠 spatial
that symmetrical instances are grouped together in order to be placed reflection symmetries and an appropriate mirrored ordering of the
in memory compactly with a unit stride. Hierarchical parallelisation is unknowns (see Fig. 1), discrete differential operators would be typi-
based on the decomposition of the mesh graph, or, in other words, a cally represented by matrices satisfying Eq. (6). Hence, the standard
graph, whose adjacency matrix portrait is the same as the off-diagonal SpMV can be replaced by the more compute-intensive SpMM, whose
portrait of a sparse matrix of a discrete operator. To ensure that all implementation was discussed in the previous section. This strategy
symmetrical instances are always in the same subdomain, only the can be applied to virtually any linear differential operator discretised
graph of the base mesh (one of the subgraphs that correspond to on a mesh with spatial reflection symmetries and, more generally,
symmetrical parts of the computational domain) is decomposed, and to the numerical resolution of PDEs on this type of meshes. Here,
this distribution is directly applied to all of the symmetrical instances as an example of practical application, we consider the numerical
as shown in Fig. 3. At the upper level, mesh cells are distributed simulation of turbulent incompressible flows of Newtonian fluids with
between cluster nodes. Then, subdomains of cluster nodes are split constant physical properties. Under these assumptions, the governing
further between computational devices of a hybrid node, CPUs or GPUs. Navier–Stokes (NS) equations read:
Finally, subdomains of CPUs are further divided between OpenMP 𝜕𝒖
+ (𝒖 ⋅ ∇)𝒖 = 𝜈∇2 𝒖 − ∇𝑝, ∇ ⋅ 𝒖 = 0, (10)
threads. In MPI communications, all symmetric instances are packed 𝜕𝑡
together into a single buffer, so the number of MPI messages remains where 𝒖(𝒙, 𝑡) and 𝑝(𝒙, 𝑡) denote the velocity and kinematic pressure
the same. fields, and 𝜈 is the kinematic viscosity. Then, these equations have to be
discretised both in space and time. The spatial discretisation determines
the coefficients of the discrete operators and, therefore, their sparsity
patterns. Then, the temporal discretisation defines the overall algorithm
to solve the NS equations. This altogether determines the number of
calls to basic algebraic kernels; in particular, the number of calls to
SpMV and SpMM kernels and the corresponding sparsity patterns of
the different matrices. Therefore, both the spatial and time-integration
methods are outlined in the next paragraphs.

4.1. Symmetry-preserving discretisation of the NS equations

A symmetry-preserving discretisation (in terms of symmetries of the


underlying differential operators, not of the mesh symmetries) of the
NS equations (10) on collocated unstructured grids is briefly described
in this section. Otherwise stated, we follow the same matrix–vector
Fig. 3. Adequate partitioning of a mesh with 2 symmetries.
notation as in the original paper [32]. The spatial discretisation exactly
preserves the symmetries of the underlying differential operators: the
Mesh objects (vertices, cells or faces) are ordered by subdomains, convective operator is represented by a skew-symmetric matrix and
so that objects of one subdomain are placed continuously in the nu- the diffusive operator by a symmetric negative semi-definite matrix.
meration. OpenMP parallelisation is decomposition-based, it is not Shortly, the temporal evolution of the collocated velocity vector, 𝒖𝑐 ∈
using loop parallelism, so no loop work sharing directives are applied. R3𝑛 , is governed by the following algebraic system:
Instead, a range of indices is assigned to each thread, which defines the 𝑑𝒖𝑐 ( )
range of objects of its subdomain, or, in other words, a range of rows  + 𝖢 𝒖𝑠 𝒖𝑐 = 𝖣𝒖𝑐 − 𝖦𝑐 𝒑𝑐 , (11)
𝑑𝑡
of a matrix or elements of a vector. By doing so, each thread is working
𝖬𝒖𝑠 = 𝟎𝑐 , (12)
with a fixed dataset, allowing for a data placement aware of non-
uniform memory access (NUMA) through the first touch rule. Thread where 𝒑𝑐 ∈ R𝑛
is the cell-centred pressure and 𝑛 is the number of
affinity and NUMA-placement makes OpenMP-parallel SpMV about control volumes. The sub-indices 𝑐 and 𝑠 are used to refer whether

4
À. Alsalti-Baldellou et al. Computers and Fluids 275 (2024) 106247

the variables are cell-centred or staggered at the faces. The collocated the core of this algorithm (step 2) is indeed the projection of a staggered
velocity, 𝒖𝑐 ∈ R3𝑛 , is arranged as a column vector containing the velocity field, 𝒖𝑝𝑠 , using Algorithm 3. Then, steps 1 and 3 require a cell-
three spatial velocity components as 𝒖𝑐 = (𝒖1 , 𝒖2 , 𝒖3 )𝑇 where 𝒖𝑖 = to-face, 𝛤𝑐→𝑠 ∈ R𝑚×3𝑛 , and a face-to-cell, 𝛤𝑠→𝑐 ∈ R3𝑛×𝑚 , interpolation.
([𝒖𝑖 ]1 , [𝒖𝑖 ]2 , … , [𝒖𝑖 ]𝑛 ) ∈ R𝑛 are vectors containing the velocity compo- They must be related as follows:
nents corresponding to the 𝑥𝑖 -spatial direction. The staggered velocity
𝛤𝑠→𝑐 = −1 𝛤𝑐→𝑠 𝑇 𝑠 , (17)
vector 𝒖𝑠 = ([𝑢𝑠 ]1 , [𝑢𝑠 ]2 ,(… ), [𝑢𝑠 ]𝑚 )𝑇 ∈ R𝑚 , which is needed to compute
the convective term, 𝖢 𝒖𝑠 , results from the projection of a staggered to preserve the duality between the collocated gradient, 𝖦𝑐 , defined
predictor velocity, 𝒖𝑝𝑠 ∈ R𝑚 (see Algorithm ( ) 3), where 𝑚 is the number of in Eq. (16), and the (integrated) collocated divergence operator, 𝖬𝑐 ≡
faces.. The matrices  ∈ R3𝑛×3𝑛 , 𝖢 𝒖𝑠 ∈ R3𝑛×3𝑛 , 𝖣 ∈ R3𝑛×3𝑛 are square 𝖬𝛤𝑐→𝑠 ∈ R𝑛×3𝑛 , i.e.,
block diagonal matrices given by:
( ) ( ) 𝖦𝑐 = −1 𝖬𝑐 𝑇 . (18)
 = I3 ⊗ 𝑐 , 𝖢 𝒖𝑠 = I3 ⊗ 𝖢𝑐 𝒖𝑠 , 𝖣 = I3 ⊗ 𝖣𝑐 , (13)
Finally, the sequence of operations to advance one time-step is
where I3 ∈ R3×3 is the identity matrix, 𝑐 ∈ R𝑛×𝑛 is a diagonal outlined in Algorithm 5. Namely, the spatially discrete momentum
matrix equation, Eq. (11), is discretised in time using an explicit second-
( ) containing the sizes of the cell-centred control volumes and,
𝖢𝑐 𝒖𝑠 ∈ R𝑛×𝑛 and 𝖣𝑐 ∈ R𝑛×𝑛 are the collocated convective and diffu- order Adams–Bashforth (AB2) scheme for both convection and diffusion
sive operators, respectively. Finally, 𝖦𝑐 ∈ R3𝑛×𝑛 represents the discrete (steps 1 and 3), whereas the pressure–velocity coupling is solved using
gradient operator whereas the matrix 𝖬 ∈ R𝑛×𝑚 is the face-to-cell a fractional step method [33]. Here, the AB2 scheme is chosen for the
discrete divergence operator. sake of simplicity although more appropriate temporal schemes can be
used [34].
Algorithm 3 Projection of a staggered velocity field, 𝒖𝑝𝑠 ∈ R𝑚 . It returns
a divergence-free staggered velocity field, 𝒖𝑠 ∈ R𝑚 , i.e., 𝖬𝒖𝑠 = 𝟎𝑐 . Algorithm 5 Numerical resolution of NS equations using a Fractional
Step Method on a collocated grid.
Input: 𝖬, 𝖫, 𝖦, 𝒖𝑝𝑠
Output: 𝒖𝑠 , 𝒑̃ 𝑐 Input: 𝖬, 𝖫, 𝖦, 𝛤𝑐→𝑠 , 𝛤𝑠→𝑐 , , 𝒖𝑛𝑐 , 𝒖𝑛𝑠 (𝖬𝒖𝑛𝑠 = 𝟎𝑐 ), 𝖱𝑛−1
𝒖
̃ 𝑐 = 𝖬𝒖𝑝𝑠 Output: 𝒖𝑛+1 ,𝒖𝑛+1 (𝖬𝒖𝑛+1 = 𝟎𝑐 ), 𝒑̃ 𝑛+1
𝑐 (and 𝖱𝑛
1: Solve Poisson equation for (pseudo)pressure: 𝖫𝒑 𝑐 𝑠 𝑠
𝑛
)
𝑝
2: Correct the staggered velocity field: 𝒖𝑠 = 𝒖𝑠 − 𝖦𝒑̃𝑐 1: Computation of the convective, 𝖢 𝒖𝑠 , and the diffusive, 𝖣, terms:

( ( ) )
𝖱𝑛𝒖 ≡ −1 −𝖢 𝒖𝑛𝑠 + 𝖣 𝒖𝑛𝑐 . (19)
Algorithm 4 (Pseudo-)projection of a collocated velocity, 𝒖𝑝𝑐 ∈ R3𝑛 .
It returns a (quasi-)divergence-free collocated velocity, 𝒖𝑐 ∈ R3𝑛 , 2: Determination of 𝛥𝑡 using a CFL condition [34].
𝑝
i.e., 𝖬𝛤𝑐→𝑠 𝒖𝑐 ≈ 𝟎𝑐 . 3: Computing the predictor velocity, 𝒖𝑐 :
Input: 𝖬, 𝖫, 𝖦, 𝛤𝑐→𝑠 , 𝛤𝑠→𝑐 , 𝒖𝑝𝑐 ( )
3 𝑛 1 𝑛
Output: 𝒖𝑐 , 𝒖𝑠 , 𝒑̃ 𝑐 𝒖𝑝𝑐 = 𝒖𝑛𝑐 + 𝛥𝑡 𝖱 − 𝖱 (20)
2 𝒖 2 𝒖
𝑝 𝑝
1: Cell-to-face interpolation of the velocity field: 𝒖𝑠 = 𝛤𝑐→𝑠 𝒖𝑐 𝑝
𝑝 4: Projection of 𝒖𝑐 with Algorithm 4: it returns 𝒖𝑛+1
𝑐 , 𝒖𝑠 ̃ 𝑛+1
𝑛+1 and 𝒑
𝑐
2: Projection of 𝒖𝑠 with Algorithm 3: it returns 𝒖𝑠 (𝖬𝒖𝑠 = 𝟎𝑐 ) and 𝒑 ̃𝑐
𝑝
̃ 𝑐 = 𝒖𝑝𝑐 − 𝛤𝑠→𝑐 𝖦𝒑̃ 𝑐
3: Correct the collocated velocity field: 𝒖𝑐 = 𝒖𝑐 − 𝖦𝑐 𝒑
4.3. Constructing the discrete operators
4.2. Solving NS equations on collocated grids
This subsection briefly revise the construction of all the discrete
operators needed to solve the NS equations. The above-explained con-
Let us firstly consider the projection of a staggered (predictor)
straints imposed by the (skew-)symmetries strongly simplifies ‘‘the
velocity field, 𝒖𝑝𝑠 , onto a divergence-free space. This is a well-posed
discretisation problem’’ to a set of five basic discrete operators. Namely,
problem: it can be uniquely decomposed into a solenoidal velocity,
𝒖𝑠 , and the gradient of a scalar field, 𝖦𝒑̃ 𝑐 . It requires the solution
of a Poisson equation for pressure (or a pseudo-pressure) and the {𝑐 , 𝑠 , 𝖭𝑠 , 𝖬, 𝛱𝑐→𝑠 }. (21)
subsequent projection of the velocity field (see Algorithm 3). Here, the
The first three correspond to basic geometrical information of the
tilde in 𝒑̃ 𝑐 is to stress that it does not need to be the actual pressure,
mesh: namely, the diagonal matrices containing the cell-centred and
𝒑𝑐 , but instead some sort of pseudo-pressure. The matrix 𝖦 ∈ R𝑚×𝑛
staggered control volumes, 𝑐 and 𝑠 , and the matrix containing the
is the cell-to-face discrete gradient, and it is related with the discrete
face normal vector, 𝖭𝑠 ≡ (𝖭𝑠,1 , 𝖭𝑠,2 , 𝖭𝑠,3 ) ∈ R𝑚×3𝑚 where 𝖭𝑠,𝑖 ∈ R𝑚×𝑚
(integrated) divergence operator, 𝖬, via:
are diagonal matrices containing the 𝑥𝑖 -spatial components of the face
𝖦 ≡ −−1 𝑇
𝑠 𝖬 . (14) normal vectors, 𝒏𝑓 . The staggered control volumes, 𝑠 , are given by

Then, the discrete Laplacian operator, 𝖫 ∈ R𝑛×𝑛 is, by construction, a [𝑠 ]𝑓 ,𝑓 ≡ 𝐴𝑓 𝛿𝑓 , (22)
symmetric negative semi-definite matrix:
where 𝐴𝑓 is the area of the face 𝑓 and 𝛿𝑓 = |𝒏𝑓 ⋅ 𝑐1𝑐2|
⃖⃖⃖⃖⃖⃖⃖⃗ is the projected
𝖫 ≡ 𝖬𝖦 = −𝖬−1 𝑇
𝑠 𝖬 . (15) distance between adjacent cell centres (see Fig. 4). In this way, the sum
of volumes is exactly preserved tr(𝑠 ) = tr() = 𝑑tr(𝑐 ) (𝑑 = 2 for 2D
Notice that 𝑠 ∈ R𝑚×𝑚 is a diagonal matrix with strictly positive diag-
and 𝑑 = 3 for 3D) regardless of the mesh quality and the location of
onal elements that contains the staggered control volumes associated
the cell centres.
with the staggered velocity components.
Then, the face-to-cell discrete (integrated) divergence operator, 𝖬,
Nevertheless, the momentum equation, Eq. (11), requires the com-
is defined as follows:
putation of a cell-centred pressure gradient, 𝖦𝑐 𝒑̃ 𝑐 , which is approxi- ∑
mated via a face-to-cell (momentum) interpolation, 𝛤𝑠→𝑐 ∈ R3𝑛×𝑚 , as [𝖬𝒖𝑠 ]𝑘 = [𝒖𝑠 ]𝑓 𝐴𝑓 , (23)
follows: 𝑓 ∈𝐹𝑓 (𝑘)

𝖦𝑐 ≡ 𝛤𝑠→𝑐 𝖦∈ R 3𝑛×𝑛
. (16) where 𝐹𝑓 (𝑘) is the set of faces bordering the cell 𝑘. Finally, 𝛱𝑐→𝑠 ∈ R𝑚×𝑛
is an unweighted cell-to-face scalar field interpolation satisfying:
Actually, the overall process can be viewed as a (pseudo-)projection of 𝜙𝑐1 + 𝜙𝑐2
a collocated velocity field, 𝒖𝑝𝑐 ∈ R3𝑛 , outlined in Algorithm 4 Notice that 𝜙𝑓 ≈ [𝛱𝑐→𝑠 𝝓𝑐 ]𝑓 = , (24)
2

5
À. Alsalti-Baldellou et al. Computers and Fluids 275 (2024) 106247

Fig. 4. Left: face normal and neighbour labelling criterion. Right: definition of the volumes, 𝑠 , associated with the face-normal velocities, 𝒖𝑠 . Thick dashed rectangle is the volume
associated with the staggered velocity U4 = [𝒖𝑠 ]4 , i.e., [𝑠 ]4,4 = 𝐴4 𝛿4 where 𝐴4 is the face area and 𝛿4 = |𝒏4 ⋅ 𝑐1𝑐2|
⃖⃖⃖⃖⃖⃖⃖⃗ is the projected distance between adjacent cell centres. Thin
dash-dotted lines are placed to illustrate that the sum of volumes is exactly preserved tr(𝑠 ) = tr() = 𝑑tr(𝑐 ) (𝑑 = 2 for 2D and 𝑑 = 3 for 3D) regardless of the location of the cell
nodes.

where 𝑐1 and 𝑐2 are the cells adjacent to the face 𝑓 (see Fig. 4, left). The first four are already in the set of five basic discrete operators given
This is needed to construct the skew-symmetric convective operator in (21), whereas the other four can be written as a linear combination
according to Eq. (13) and: of basic ones (see Eqs. (14), (15), (17) and (28)). They can be classified
( ) according to their input and output spaces and their sparsity patterns.
𝖢𝑐 𝒖𝑠 ≡ 𝖬𝖴𝑠 𝛱𝑐→𝑠 , (25)
Apart from the two diagonal matrices 𝑐 and 𝑠 , there are only three
where 𝖴𝑠 ≡ diag(𝒖𝑠 ) ∈ R𝑚×𝑚 is a diagonal matrix that contains the types of pattern. Namely, the pattern corresponding to (i) the cell-to-
face velocities, 𝒖𝑠 ∈ R𝑚 . Although the local truncation error is only face incidence matrix 𝖳𝑐𝑠 ∈ R𝑚×𝑛 which has two non-zero elements
first-order for non-uniform grids, numerical tests showed that its global per row (a +1 and a −1 corresponding to the cells adjacent to a face),
truncation error is indeed second-order [32]. (ii) the face-to-cell incidence matrix, 𝖳𝑠𝑐 = 𝖳𝑐𝑠 𝑇 ∈ R𝑛×𝑚 , and (iii) the
The cell-to-face gradient, 𝖦, follows from Eqs. (14) and (22) leading graph Laplacian matrix, 𝖳𝑠𝑐 𝖳𝑐𝑠 ∈ R𝑛×𝑛 . For instance, for the mesh with
to: 4 control volumes and 8 faces shown in Fig. 4 (right), the face-to-cell
𝑝 − 𝑝𝑐2
[𝑠 𝖦𝒑𝑐 ]𝑓 = (𝑝𝑐1 − 𝑝𝑐2 )𝐴𝑓 ⟹ [𝖦𝒑𝑐 ]𝑓 = 𝑐1 , (26) incidence matrix reads:
𝛿𝑓
⎛ 0 0 −1 +1 0 0 +1 0 ⎞
and subsequently using Eqs. (13) and (15) yield the discrete Laplacian ⎜ ⎟
+1 0 0 −1 0 −1 0 0
and diffusive operators: 𝖳𝑠𝑐 = 𝖳𝑐𝑠 𝑇 =⎜ ⎟. (30)
⎜ −1 +1 0 0 0 0 0 +1 ⎟
∑ (𝜙𝑐1 − 𝜙𝑐2 )𝐴𝑓 ⎜ 0 −1 +1 0 +1 0 0 0 ⎟
[𝖫𝝓𝑐 ]𝑘 = and 𝖣𝑐 ≡ 𝜈𝖫, (27) ⎝ ⎠
𝑓 ∈𝐹𝑓 (𝑘)
𝛿𝑓
5. Numerical results
where 𝜈 is the kinematic viscosity. Notice that this discretisation of the
diffusive operator is valid for incompressible fluids with constant vis-
This section investigates the advantages of exploiting spatial reflec-
cosity. For non-constant viscosity values, the discretisation method has
tion symmetries in CFD simulations. With this aim, we have exploited
to be modified accordingly [35]. Finally, the cell-to-face (momentum)
interpolation is constructed as follows: up to 𝑠 = 3 symmetries on a cubic domain and a finned tube heat
exchanger. Apart from being relevant geometries, they have been se-
𝛤𝑐→𝑠 ≡ −1
𝑠 𝖭𝑠 𝛱 where 𝛱 = I3 ⊗ 𝛱𝑐→𝑠 , (28) lected to show the performance of our strategy both on structured
which corresponds to a volume-weighted interpolation. It must be and unstructured discretisations (see Figs. 5 and 6). Similarly, we
noted that an unweighted interpolation, 𝛤𝑐→𝑠 = 𝖭𝑠 𝛱, was proposed have run all the tests both on CPU and GPU architectures. On the
in the original paper [32]. However, as mentioned above, this can lead one hand, the CPU tests have been performed on a single node of
to stability issues. the MareNostrum4 supercomputer at the Barcelona Supercomputing
Center. Its non-uniform memory access (NUMA) nodes are equipped
Table 2
with two Intel Xeon 8160 CPUs (24 cores, 2.1 GHz, 33 MB L3 cache
Set of matrices needed to carry out a simulation classified accordingly to their input and 128 GB/s memory bandwidth) linked to 96 GB of RAM and
and output spaces and their sparsity patterns. interconnected through 12.5 GB/s Intel Omni-Path. On the other hand,
Input Output Diagonal Non-diagonal ⟶ Sparsity pattern the GPU tests have been performed on an NVIDIA RTX A5000 GPU
cells cells {𝑐 } {𝖫} |𝖳𝑠𝑐 𝖳𝑐𝑠 | (8192 CUDA cores, 24 GB GDDR6 and 768 GB/s memory bandwidth).
cells faces × {𝖦, 𝛱𝑐→𝑠 , 𝛤𝑐→𝑠 } |𝖳𝑐𝑠 | The two grids considered are of comparable size, with the structured
faces cells × {𝖬, 𝛤𝑠→𝑐 } |𝖳𝑠𝑐 | one containing 15.5M control volumes and the unstructured one 17.7M.
faces faces {𝑠 } {∅}
As explained in Section 2 and illustrated in Fig. 6, when exploiting
𝑠 symmetries, only a 1∕2𝑠 fraction of the entire domain, denoted as
4.4. Sparsity patterns
base mesh, needs to be discretised. Then, to build 𝖫, 𝖦 and 𝖬, we have
followed the discretisation of Section 4 but taking advantage of Eq. (6).
The set of eight matrices that are in practice needed to carry out a
That is, instead of building the entire operators, we have only built the
simulation are listed in Table 2. Namely,
base mesh’s couplings with itself and with its 2𝑠 mirrored counterparts.
{𝑐 , 𝑠 , 𝖬, 𝛱𝑐→𝑠 , 𝖦, 𝖫, 𝛤𝑠→𝑐 , 𝛤𝑐→𝑠 }. (29) The first immediate benefit of this approach is, apart from considerably

6
À. Alsalti-Baldellou et al. Computers and Fluids 275 (2024) 106247

increasing the arithmetic intensity of the matrix multiplications results


in considerably higher performances, several architecture-specific as-
pects deserve to be discussed. Before delving into the CPU results of
Fig. 8(a), let us recall that, according to Eq. (8), the most compute-
intensive products are those by 𝖫, and the least compute-intensive,
those by 𝖦. As a result, their performances follow the same order.
Additionally, the lighter weight of the unstructured operators make
them enjoy more pronounced cache effects, particularly benefiting
the unstructured Laplacian, which reaches up to 5.0× speed-ups by
replacing SpMV with SymSpMV. In fact, by exploiting a single reflection
symmetry, a good deal of its inner-couplings, 𝖫inn ∈ R𝑛∕2×𝑛∕2 , fit in
the cache, leading to 2.9× faster multiplications. Remarkably enough,
having considerably more rows and roughly the same amount of non-
zeros prevents the gradient from enjoying such extra accelerations, not
exceeding 2.5× speed-ups.

Fig. 5. Coarse representation of the finned tube heat exchanger discretisation.

Fig. 6. Schema of the domain’s portions meshed when exploiting 𝑠 symmetries.

accelerating the operators’ setup, to reduce their memory footprint by


up to a factor of approximately 2𝑠 .
Fig. 7 shows the memory footprint of 𝖫, 𝖦 and 𝖬 for a varying
number of symmetries. The fact that 𝖦 and 𝖬 take the same space
follows from their sparsity patterns, which, according to Table 2, are
identical but transposed. Indeed, as detailed in Section 4.4, we have
that: 𝖳𝑠𝑐 = 𝖳𝑐𝑠 𝑇 . Remarkably enough, the relative memory savings are
not exactly 2𝑠 because our current implementation explicitly stores all
the outer-couplings of Eq. (6). Nevertheless, Fig. 7 confirms that being
so sparse makes them practically negligible. On the other hand, the fact
that the structured grid is heavier than the unstructured one despite
being smaller is an immediate consequence of having more neighbours
each control volume. In this sense, higher-order schemes would benefit
from considerably larger absolute memory savings.
Fig. 8. Speed-up on the application of 𝖫, 𝖦 and 𝖣 on the structured (left) and
unstructured (right) grids for a fixed problem size (and halved base mesh).

Our primary aim for exploiting symmetries is to make extreme-


scale CFD simulations faster and more affordable. Then, if exploiting
3 symmetries allows tackling a 1003 problem at the cost (roughly) of a
503 one, our aim is, instead, to solve a 2003 problem at the price of a
1003 one. For this reason, we do not expect to enjoy the extra speed-ups
granted from the greater CPU cache reuse in our targetted applications.
To better analyse this, we have included the results of Fig. 9. In it,
instead of fixing the size of the problem and exploiting an increasing
number of symmetries, thus halving the size of the base mesh, as in
Fig. 6, we proceed inversely. Hence, we fix the size of the base mesh and
exploit an increasing number of symmetries, thus doubling the size of
the problem. Although preventing the cache effects, Fig. 9(a) confirms
Fig. 7. Operators’ memory footprint on the structured (left) and unstructured (right) the advantages of exploiting symmetries for tackling larger problems.
grids. Indeed, our current CPU implementation attained up to 2.0×, 1.7×and
1.4× speed-ups on the products by 𝖫, 𝖣 and 𝖦, respectively.
Fig. 8 summarises the benefits of replacing SpMV with SymSpMV, When it comes to the GPU results of Figs. 8(b) and 9(b), the lack
the SpMM’s specialisation discussed in Section 3. While it is clear that of cache memory makes it attain similar accelerations on both analysis

7
À. Alsalti-Baldellou et al. Computers and Fluids 275 (2024) 106247

Fig. 9. Speed-up on the application of 𝖫, 𝖦 and 𝖣 on the structured (left) and


unstructured (right) grids for a fixed base mesh size (and doubled problem size).

Fig. 10. Roofline model for the SymSpMV. Dashed lines correspond to fixed problem
and, again, the resulting speed-ups are ordered according to their arith- size (and halved base mesh), and solid lines to fixed base mesh size (and doubled problem
metic intensity. In particular, up to 3.3×, 2.8×and 2.2× accelerations size).
are attained in the products by 𝖫, 𝖣 and 𝖦, respectively.
In order to evaluate the efficiency of our SymSpMV, we have made a
roofline performance analysis [36]. It consists of a scatter plot that de-
symmetries for accelerating virtually all matrix multiplications. In par-
termines if a kernel is either memory- or compute-bound and compares
ticular, we presented a spatial discretisation and an unknowns’ or-
its implementation’s performance with the theoretical peak offered by
dering, making virtually all the discrete operators exhibit a regular
the hardware. It is displayed in Fig. 10 both for the CPU and GPU
block structure. Then, we showed how to use such a structure to
implementations of the SymSpMV and for all the matrices and grids
replace the standard SpMV with the more compute-intensive Sym-
considered. As discussed in Section 3, the more exploited symmetries,
SpMV, a specialisation of the SpMM allowing for significant memory
the higher the arithmetic intensity and, therefore, the closer to being
savings.
compute-bound. Although SymSpMV was memory-bound for all the
The strategy presented applies identically regardless of the geo-
cases considered, the relatively low double-precision peak performance
metric complexity of the problem. Furthermore, although focusing on
of the NVIDIA RTX A5000 made it approach the compute-bound region
CFD problems, it is naturally extensible to other applications. However,
when exploiting three symmetries. Hence, under such circumstances,
we target the DNS and LES of incompressible turbulent flows and,
using higher-order schemes would certainly make SymSpMV enter that
region. therefore, included numerical experiments based on structured and
Regarding the performance of our implementation, despite attaining unstructured discretisations of meaningful configurations. Namely, of
up to 5.0× speed-ups, SymSpMV is far from reaching its theoretical a cubic box and a finned tube heat exchanger.
peak, which leaves place for further improvements. In any case, the nu- The results obtained demonstrated the advantages of exploiting
merical results confirmed the advantages of exploiting spatial reflection symmetries. Indeed, by replacing the standard SpMV with our Sym-
symmetries for making the simulations lighter and faster, particularly SpMV, CPU and GPU executions reached up to 5.0× and 3.3× speed-ups,
benefiting GPU devices, given their limited memory capacity. respectively. On top of that, thanks to exploiting the operators’ regular
block structure, we could reduce their memory footprint up to 8.0×.
6. Conclusions Finally, the roofline performance analysis of our implementation of the
SymSpMV revealed that thanks to exploiting symmetries for increasing
Due to the FLOP-oriented design of modern supercomputers, only the arithmetic intensity, we successfully approached the operators’
applications with very high arithmetic intensities can approach HPC multiplication to the compute-bound region.
systems’ theoretical peak performances. This is certainly not the case It is clear that exploiting symmetries is a very effective strategy
for most computational physics applications, which generally rely on to make high-fidelity simulations significantly more affordable. In this
sparse linear algebra (or equivalent) kernels. This work presented sense, our method is particularly well-suited for GPUs, given their
a strategy to mitigate this limitation by exploiting spatial reflection limited memory capacity. In this sense, future lines of work include

8
À. Alsalti-Baldellou et al. Computers and Fluids 275 (2024) 106247

extending the current approach throughout an incompressible CFD sim- [9] Álvarez X, Gorobets A, Trias FX, Borrell R, Oyarzun G. HPC2 – a fully portable
ulation and analysing the resulting speed-ups, as well as combining it algebra-dominant framework for heterogeneous computing. Application to CFD.
Comput & Fluids 2018;173:285–92.
with similar strategies for accelerating the convergence of the Poisson
[10] Álvarez-Farré X, Gorobets A, Trias FX. A hierarchical parallel implementation
solvers. Additionally, despite the excellent speed-ups obtained, we aim for heterogeneous computing. application to algebra-based CFD simulations on
to enhance our implementation of the SymSpMV to make it approach hybrid supercomputers. Comput & Fluids 2021;214:104768.
its theoretical peak performance. We also plan to extend these analyses [11] cuSPARSE: The API Reference Guide for cuSPARSE, the CUDA Sparse Matrix
to other architectures and numerical schemes. Library. Tech. Rep. March, NVIDIA Corporation; 2020.
[12] Greathouse JL, Knox K, Poła J, Varaganti K, Daga M. clSPARSE: A vendor-
optimized open-source sparse BLAS library. In: IWOCL ’16: Proceedings of the
CRediT authorship contribution statement 4th international workshop on openCL. New York, NY, USA: ACM; 2016.
[13] Anzt H, Tomov S, Dongarra J. On the performance and energy efficiency of
Àdel Alsalti-Baldellou: Conceptualization, Methodology, Software. sparse linear algebra on GPUs. Int J High Perform Comput Appl 2017;31:375–90.
[14] Alsalti-Baldellou À, Álvarez-Farré X, Trias FX, Oliva A. Exploiting spatial
Xavier Álvarez-Farré: Methodology, Software. Guillem Colomer: Soft-
symmetries for solving Poisson’s equation. J Comput Phys 2023;486:112133.
ware. Andrey Gorobets: Methodology, Software. Carlos David Pérez- [15] Alsalti-Baldellou À, Janna C, Álvarez-Farré X, Trias FX. Exploiting symmetries for
Segarra: Funding acquisition. Assensi Oliva: Funding acquisition, preconditioning Poisson’s equation in CFD simulations. In: Proceedings of the
Resources. F. Xavier Trias: Conceptualization, Methodology, Supervi- Platform for Advanced Scientific Computing Conference. New York, NY, USA:
sion. ACM; 2023, p. 1–9.
[16] Gorobets A, Trias FX, Borrell R, Lehmkuhl O, Oliva A. Hybrid MPI+OpenMP
parallelization of an FFT-based 3D Poisson solver with one periodic direction.
Declaration of competing interest Comput & Fluids 2011;49:101–9, Mesh Symmetries.
[17] Shishkina O, Shishkin A, Wagner C. Simulation of turbulent thermal convection
The authors declare that they have no known competing finan- in complicated domains. J Comput Appl Math 2009;226:336–44.
[18] Löhner R, Othmer C, Mrosek M, Figueroa A, Degro A. Overnight industrial LES
cial interests or personal relationships that could have appeared to
for external aerodynamics. Comput & Fluids 2021;214.
influence the work reported in this paper. [19] Capuano F, Palumbo A, de Luca L. Comparative study of spectral-element and
finite-volume solvers for direct numerical simulation of synthetic jets. Comput
Data availability & Fluids 2019;179:228–37.
[20] Esclapez L, Ma PC, Mayhew E, Xu R, Stouffer S, Lee T, et al. Fuel effects on lean
blow-out in a realistic gas turbine combustor. Combust Flame 2017;181:82–99.
Data will be made available on request.
[21] Alsalti-Baldellou A, Álvarez-Farré X, Gorobets A, Trias FX. Efficient strategies
for solving the variable Poisson equation with large contrasts in the coefficients.
Acknowledgements In: 8th European Congress on Computational Methods in Applied Sciences and
Engineering. 273, CIMNE; 2022, p. 416–34.
[22] Yuan C, Ng E, Norford LK. Improving air quality in high-density cities by
A.A.B., X.A.F., G.C., C.D.P.S., A.O. and F.X.T. have been financially
understanding the relationship between air pollutant dispersion and urban
supported by two competitive R+D projects: RETOtwin (PDC2021- morphologies. Build Environ 2014;71:245–58.
120970-I00), given by MCIN/AEI/10.13039/501100011033 and Euro- [23] Hang J, Li Y, Sandberg M, Buccolieri R, Di Sabatino S. The influence of building
pean Union Next GenerationEU/PRTR, and FusionCAT (001- height variability on pollutant dispersion and pedestrian ventilation in idealized
P-001722), given by Generalitat de Catalunya RIS3CAT-FEDER. A.A.B. high-rise urban areas. Build Environ 2012;56:346–60.
[24] Kooij GL, Botchev MA, Frederix EM, Geurts BJ, Horn S, Lohse D, et al.
has also been supported by the predoctoral grants DIN2018-010061
Comparison of computational codes for direct numerical simulations of turbulent
and 2019-DI-90, given by MCIN/AEI/10.13039/501100011033 and the Rayleigh–Bénard convection. Comput & Fluids 2018;166:1–8.
Catalan Agency for Management of University and Research Grants [25] Fang J, Shaver DR, Tomboulides A, Min M, Fischer P, Lan Y-H, et al. Feasi-
(AGAUR), respectively. The numerical experiments have been con- bility of full-core pin resolved CFD simulations of small modular reactor with
momentum sources. Nucl Eng Des 2021;378:111143.
ducted on the Marenostrum4 supercomputer at the Barcelona Su-
[26] Liu B, He S, Moulinec C, Uribe J. Sub-channel CFD for nuclear fuel bundles.
percomputing Center under the project IM-2022-3-0026. The authors Nucl Eng Des 2019;355:110318.
thankfully acknowledge these institutions. [27] Paniagua L, Lehmkuhl O, Oliet C, Pérez-Segarra C-D. Large eddy simulations
(LES) on the flow and heat transfer in a wall-bounded pin matrix. Numer Heat
References Transfer B 2014;65:103–28.
[28] Filippone S, Cardellini V, Barbieri D, Fanfarillo A. Sparse matrix-vector
multiplication on GPGPUs. ACM Trans Math Software 2017;43(4).
[1] Witherden FD, Vermeire BC, Vincent PE. Heterogeneous computing on mixed
[29] Liu X, Smelyanskiy M, Chow E, Dubey P. Efficient sparse matrix-vector mul-
unstructured grids with pyfr. Comput & Fluids 2015;120:173–86.
tiplication on x86-based many-core processors. In: Proceedings of the 27th
[2] Borrell R, Dosimont D, Garcia-Gasulla M, Houzeaux G, Lehmkuhl O, Mehta V,
international ACM conference on international conference on supercomputing.
et al. Heterogeneous CPU/GPU co-execution of CFD simulations on the POWER9
ICS ’13, New York, NY, USA: Association for Computing Machinery; 2013, p.
architecture: Application to airplane aerodynamics. Future Gener Comput Syst
273–82.
2020;107:31–48.
[30] Krasnopolsky BI. An approach for accelerating incompressible turbulent flow
[3] Gorobets A, Bakhvalov P. Heterogeneous CPU+GPU parallelization for high-
simulations based on simultaneous modelling of multiple ensembles. Comput
accuracy scale-resolving simulations of compressible turbulent flows on hybrid
Phys Comm 2018;229:8–19.
supercomputers. Comput Phys Comm 2022;271:108231.
[31] Imamura S, Ono K, Yokokawa M. Iterative-method performance evaluation for
[4] Baboulin M, Buttari A, Dongarra J, Kurzak J, Langou J, Langou J, et al.
multiple vectors associated with a large-scale sparse matrix. Int J Comput Fluid
Accelerating scientific computations with mixed precision algorithms. Comput
Dyn 2016;30(6):395–401.
Phys Comm 2009;180:2526–33.
[32] Trias FX, Lehmkuhl O, Oliva A, Pérez-Segarra C, Verstappen R. Symmetry-
[5] Krasnopolsky BI. An approach for accelerating incompressible turbulent flow
preserving discretization of Navier-Stokes equations on collocated unstructured
simulations based on simultaneous modelling of multiple ensembles. Comput
meshes. J Comput Phys 2014;258:246–67.
Phys Comm 2018;229:8–19.
[33] Chorin AJ. Numerical solution of the Navier-Stokes equations. Math Comp
[6] Gorobets A, Trias FX, Oliva A. A parallel MPI+openmp+opencl algorithm
1968;22:745–62.
for hybrid supercomputations of incompressible flows. Comput & Fluids
[34] Trias FX, Lehmkuhl O. A self-adaptive strategy for the time-integration of
2013;88:764–72.
Navier-Stokes equations. Numer Heat Transfer B 2011;60(2):116–34.
[7] Edwards HC, Trott CR, Sunderland D. Kokkos: Enabling manycore performance
[35] Trias FX, Gorobets A, Oliva A. A simple approach to discretize the viscous term
portability through polymorphic memory access patterns. J Parallel Distrib
with spatially varying (eddy-)viscosity. J Comput Phys 2013;253:405–17.
Comput 2014;74(12):3202–16.
[36] Williams S, Waterman A, Patterson D. Roofline: An insightful visual performance
[8] Shimokawabe T, Aoki T, Onodera N. High-productivity framework for large-scale
model for multicore architectures. Commun ACM 2009;52(4):65–76.
GPU/CPU stencil applications. Procedia Comput Sci 2016;80:1646–57.

You might also like