A Tutorial On NIMROD Physics Kernel Code Development
A Tutorial On NIMROD Physics Kernel Code Development
Code Development
Carl Sovinec
University of Wisconsin–Madison
Department of Engineering Physics
Contents i
Preface 1
II FORTRAN Implementation 19
A Implementation map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
B Data storage and manipulation . . . . . . . . . . . . . . . . . . . . . . . . . 19
B.1 Solution Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
B.2 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
B.3 Vector_types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
B.4 Matrix_types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
B.5 Data Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
B.6 Seams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
i
ii CONTENTS
III Parallelization 53
A Parallel communication introduction . . . . . . . . . . . . . . . . . . . . . . 53
B Grid block decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
C Fourier “layer” decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 55
D Global-line preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
List of Figures
iii
iv LIST OF FIGURES
Preface
The notes contained in this report were written for a tutorial given to members of the
NIMROD Code Development Team and other users of NIMROD. The presentation was
intended to provide a substantive introduction to the FORTRAN coding used by the kernel, so
that attendees would learn enough to be able to further develop NIMROD or to modify it
for their own applications.
This written version of the tutorial contains considerably more information than what
was given in the three half-day sessions. It is the author’s hope that these notes will serve
as a manual for the kernel for some time. Some of the information will inevitably become
dated as development continues. For example, we are already considering changes to the
time-advance that will alter the predictor-corrector approach discussed in Section C.2. In
addition, the 3.0.3 version described here is not complete at this time, but the only difference
reflected in this report from version 3.0.2 is the advance of temperature instead of pressure
in Section C.5. For those using earlier versions of NIMROD (prior to and including 2.3.4),
note that basis function flexibility was added at 3.0 and data structures and array ordering
are different than what is described in Section B.
If you find errors or have comments, please send them to me at [email protected].
1
2 LIST OF FIGURES
Chapter I
A Brief Reviews
A.1 Equations
NIMROD started from two general-purpose PDE solvers, Proto and Proteus. Though rem-
nants of these solvers are a small minority of the present code, most of NIMROD is modular
and not specific to our applications. Therefore, what NIMROD solves can and does change,
providing flexibility to developers and users.
Nonetheless, the algorithm implemented in NIMROD is tailored to solve fluid-based mod-
els of fusion plasmas. The equations are the Maxwell equations without displacement current
and the single fluid form (see [1]) of velocity moments of the electron and ion distribution
functions, neglecting terms of order me/mi smaller than other terms.
Maxwell without displacement current:
Ampere’s Law:
µ0 J = ∇ × B
Gauss’s Law:
∇·B = 0
Faraday’s Law:
∂B
= −∇ × E
∂t
Fluid Moments → Single Fluid Form
Quasi-neutrality is assumed for the time and spatial scales of interest, i.e.
• Zni % ne → n
3
4 CHAPTER I. FRAMEWORK OF NIMROD KERNEL
• However ∇ · E &= 0
Continuity:
∂n
+ V · ∇n = −n∇ · V
∂t
Center-of-mass Velocity Evolution:
∂V
ρ + ρV · ∇V = J × B − ∇p − ∇ · Π
∂t
• Π often set to ρν∇V
• Qe includes ηJ 2
• α = i, e
• pα = nα kTα
1 me ∂J e
E= −V
# $%× B& +ηJ + J×B + 2 +∇ · (JV + VJ) − (∇p e +∇ · Π
# $% & )
#$%& # ne$% & ne ∂t # $% & me
‘mhd’ elecd>0 advect=‘all’ neoclassical
‘hall’ # $% &
‘‘mhd & hall’ ‘2fl’
‘mhd & hall’
‘2fl’
‘2fl’
A steady-state part of the solution is separated from the solution fields in NIMROD. For
resistive MHD for example, ∂/∂t → 0 implies
A. BRIEF REVIEWS 5
2. ∇ · (ns Vs) = 0
3. ∇ × Es = ∇ × (ηJs − Vs × Bs) = 0
ns
4. γ−1 Vs · ∇Ts = −ps∇ · Vs + ∇ · nsχ · ∇Ts + Qs
Notes:
• Loading an ‘equilibrium’ into NIMROD means that 1 – 4 are assumed. This is appro-
priate and convenient in many cases, but it’s not appropriate when the ‘equilibrium’
already contains expected contributions from electromagnetic activity.
Decomposing each field into steady and evolving parts and canceling steady terms gives
(for MHD)
[A → As + A]
∂n
+ Vs · ∇n + V · ∇ns + V · ∇n = −ns∇ · V − n∇ · Vs − n∇ · V
∂t
∂V
(ρs + ρ) + ρ [(Vs + V) · ∇ (Vs + V)] + ρs [Vs · ∇V + V · ∇Vs + V · ∇V]
∂t
= Js × B + J × Bs + J × B − ∇p + ∇ · ν [ρ∇Vs + ρs∇V + ρ∇V]
∂B
= ∇ × (Vs × B + V × Bs + V × B − ηJ)
∂t
(n + ns) ∂T n ns
+ [(Vs + V) · ∇ (Ts + T )] + [Vs · ∇T + V · ∇Ts + V · ∇T ]
γ − 1 ∂t γ −1 γ−1
= −ps ∇ · V − p∇ · Vs − p∇ · V
+∇ · [ns χ · ∇T + nχ · Ts + nχ · ∇T ]
Notes:
6 CHAPTER I. FRAMEWORK OF NIMROD KERNEL
• The code further separates χ (and other dissipation coefficients ?) into steady and
evolving parts.
• The motivation is that time-evolving parts of fields can be orders of magnitude smaller
than the steady part, and linear terms tend to cancel, so skipping decompositions would
require tremendous accuracy in the large steady part. It would also require complete
source terms.
• Nonetheless, decomposition adds to computations per step and code complexity.
General Notes:
1. Collisional closures lead to local spatial derivatives only. Kinetic closures lead to
integro-differential equations.
∂
2. equations are the basis for the NIMROD time-advance.
∂t
• Global magnetic changes occur over the global resistive diffusion time.
• MHD waves propagate 104 – 1010 times faster.
• Topology-changing modes grow on intermediate time scales.
* All effects are present in equations sufficiently complete to model behavior in experi-
ments.
B Spatial Discretization
B.1 Fourier Representation
A pseudospectral method is used to represent the periodic direction of the domain. A
truncated Fourier series yields a set of coefficients suitable for numerical computation:
-
N
. /
A (φ) = A0 + An einφ + A∗n e−inφ
n=1
or
N 0
- 1
2πn 2πn
A (z) = A0 + An ei L
z
+ A∗ne−i L
z
n=1
All physical fields are real functions of space, so we solve for the complex coefficients, An ,
n > 0, and the real coefficient A0 .
• Linear computations find An (t) (typically) for a single n-value. (Input ‘lin_nmax’,
‘lin_nmodes’)
• Nonlinear simulations use FFTs to compute products on a uniform mesh in φ and z.
En, 0 ≤ n ≤ N
2N = 2lphi
8 CHAPTER I. FRAMEWORK OF NIMROD KERNEL
• One can show that setting Vn, Bn, etc. = 0 2for n > 2N
3
prevents aliasing
3 from quadratic
2lphi
nonlinearities (products of two functions). nmodes total = 3 + 1
1. Substituting a truncated series for f (φ) changes the PDE problem into a search for
the best solution on a restricted solution space.
2. Orthogonality of basis functions is used to find a system of discrete equations for the
coefficients.
e.g. Faraday’s law
∂B (φ)
= −∇ × E (φ)
∂t
4 N
5 4 N
5
∂ - -
B0 + Bneinφ + c.c. = −∇ × E0 + En einφ + c.c.
∂t n=1 n=1
6
!
apply dφe−in φ for 0 ≤ n# ≤ N
6 7 8
∂Bn! in# φ̂
2π =− dφ ∇pol × En! + × En! , 0 ≤ n# ≤ N
∂t R
!i(x)
1
x i!1 xi x i+1
!i(x) "i(x)
1 1
• Terms in the weak form must be integrable after any integration-by-parts; step func-
tions are admissible, but delta functions are not.
• Our fluid equations require a continuous solution space in this sense, but derivatives
need to be piecewise continuous only.
Using the restricted space implies that we seek solutions of the form
-
A (R, Z) = Aj αj (R, Z)
j
10 CHAPTER I. FRAMEWORK OF NIMROD KERNEL
• αj (R, Z) here may represent different polynomials. [αj &βj in 1D quadratic example
(Figure I.2)]
z y
eZ
ey
eR
ex
e#
ez x
For toroidal geometries, we must use the fact that êR, êφ are functions of φ, but we always
have êi · êj = δij .
B. SPATIAL DISCRETIZATION 11
• Expanding scalars
4 N
5
- -
S (R, Z, φ) → αj (R, Z) Sj 0 + Sj neinφ + c.c.
j n=1
• Expanding vectors
4 N
5
-- -
V (R, Z, φ) → αj (R, Z) êl Vj l0 + Vj lneinφ + c.c.
j l n=1
* For convenience ᾱj l ≡ αj êl and I’ll suppress c.c. and make n sums from 0 to N,
where N here is nmodes_total−1.
While our poloidal basis functions do not have orthogonality properties like einφ and êR ,
their finite extent means that
66
dRdZD s (αi (R, Z)) D t (αj (R, Z)) &= 0,
where D s is a spatial derivative operator of allowable order s (here 0 or 1), only where nodes
i and j are in the same element (inclusive of element borders).
Integrands of this form arise from a procedure analogous to that used with Fourier series
alone; we create a weak form of the PDEs by multiplying by a conjugated basis function.
Returning to Faraday’s law for resistive MHD, and using E, to represent the ideal −V ×B
field (after pseudospectral calculations):
• Integrate by parts to symmetrize the diffusion term and to avoid inadmissible deriva-
tives of our solution space.
666 0 1 - ∂B
! j ln
dRdZdφ ᾱj !l! e−in φ · ᾱj l einφ
∂t
j ln
666 0 1
−in! φ
=− dRdZdφ∇ × ᾱj !l! e ·
4 5
η - -
Bj ln∇ × ᾱj l einφ + Ej ln ᾱj leinφ
µ0 j ln j ln
12 CHAPTER I. FRAMEWORK OF NIMROD KERNEL
?? 0 1 -
−in! φ
+ ᾱj !l! e · Ej ln ᾱj l einφ × dS
j ln
- ∂Bj l!n! 6 6
2π dRdZ αj !αj
∂t
j
- 666 0 1 0 1
η ! !
+ Bj ln! dRdZdφ ∇ × ᾱj !l! e−in φ · ∇ × ᾱj lein φ
µ0
jl
666 0 1 -
! !
=− dRdZdφ ∇ × ᾱj !l! e−in φ · Ej l ᾱj l ein φ
jl
?? -
+ ᾱj !l! · Ej ln! ᾱj l × dS
jl
for all {j #, l# , n# }
• {Bj ln! } constitutes the solution vector at an advanced time (see C).
• For self-adjoint operators, such as I+∆t∇× µη0 ∇×, using the same functions for test and
basis functions (Galerkin) leads to the same equations as minimizing the corresponding
variational [2], i.e. Rayleigh-Ritz-Galerkin problem.
C. TEMPORAL DISCRETIZATION 13
• The mapping for non-uniform meshing has not been expressed explicitly. It entails:
>> >>
1. dRdZ → J dξdη, where ξ, η are logical coordinates, and J is the Jacobian,
∂α ∂ξ ∂α ∂η
2. ∂R → ∂α
∂ξ ∂R + ∂η ∂R and similarly for Z, where α is a low order polynomial of ξ
and η with uniform meshing in (ξ, η).
>1
The integrals are computed numerically with Gaussian quadrature 0 f (ξ) dξ →
3. @
i wif (ξi ) where {wi , ξi }, i = 1, . . . ng are prescribed by the quadrature method
[3].
C Temporal Discretization
NIMROD uses finite difference methods to discretize time. The solution field is marched in
a sequence of time-steps from initial conditions.
14 CHAPTER I. FRAMEWORK OF NIMROD KERNEL
BMHD = Bn − ∆t ∇ × EMHD
• Synergistic wave-flow effects require having the semi-implicit operator in the predic-
tor step and centering wave-producing terms in the corrector step when there are no
dissipation terms [7]
• NIMROD does not have separate centering parameters for wave and advective terms,
so we set centerings to 1/2 + δ and rely on having a little dissipation.
C.4 ∇ · B Constraint
Like many MHD codes capable of using nonuniform, nonorthogonal meshes, ∇ · B is not
identically 0 in NIMROD. Left uncontrolled, the error will grow until it crashes the simula-
tion. We have used an error-diffusion approach [8] that adds the unphysical diffusive term
κ∇·B∇∇ · B to Faraday’s law.
∂B
= −∇ × E + κ∇·B ∇∇ · B (I.1)
∂t
Applying ∇· gives
∂∇ · B
= ∇ · κ∇·B∇ (∇ · B) (I.2)
∂t
AA
so that the error is diffused if the boundary conditions maintain constant dS · B[9].
16 CHAPTER I. FRAMEWORK OF NIMROD KERNEL
D E
(∆n)pass = −∆t Vn+1 · ∇n∗ + n∗ ∇ · Vn+1
• pass = predict → n∗ = nn
• pass = correct → n∗ = nn + fn (∆n)predict
• nn+1 = nn + (∆n)correct
Note: f∗ ’s are centering coefficients, Cs∗ ’s are semi-implicit coefficients.
B ! " C
me eta
1+ ∇× + (∆t)fη ∇ × −(∆t)f∇·B κ∇·B ∇∇· (∆B)pass
ne2 µ0
F G
n+1 ∗ n 1
= (∆t)∇ × V × B − ∇ × B + ∇ · Πeneo + (∆t)κ∇·B ∇∇ · Bn
n
µ0 ne
• pass = predict → B∗ = Bn
B C
nn+1 n+1
+ (∆t)fχ ∇ · n χα · ∇ (∆Tα )pass
γ−1
F G
nn+1 n+1 ∗ n n+1 n+1 n
= −∆t Vα · ∇Tα + pα∇ · Vα − ∇ · n χα · ∇Tα + Q
γ −1
• α = e, i
n!1/2 n+1/2
V V
t
n!1 n
{B,n,T} {B,n,T}
• The different centering of the Hall time split is required for numerical stability (see
Nebel notes).
∂J
• the electron inertia term necessarily appears in both time-splits for B when it is
∂t
included in E.
• Fields used in semi-implicit operators are updated only after changing more than a set
level to reduce unnecessary compution of the matrix elements (which are costly).
18 CHAPTER I. FRAMEWORK OF NIMROD KERNEL
2. Time Advance
• Write global, energy, and probe diagnostics to binary and/or text files at desired
frequency in time-step (nhist)
• Write complete dump files at desired frequency in time-step (ndump)
• Problems such as lack of iterative solver convergence can stop simulations at any
point
Chapter II
FORTRAN Implementation
Preliminary remarks
As implied in the Framework section, the NIMROD kernel must complete a variety of
tasks each time step. Modularity has been essential in keeping the code manageable as the
physics model has grown in complexity.
FORTRAN 90 extensions are also an important ingredient. Arrays of data structures pro-
vide a clean way to store similar sets of data that have differing numbers of elements. FORTRAN
90 modules organize different types of data and subprograms. Array syntax leads to cleaner
code. Use of KIND when defining data settles platform-specific issues of precision. ([10] is
recommended.)
A Implementation map
All computationally intensive operations arise during either 1) the finite element calculations
to find rhs vectors and matrices, and 2) solution of the linear systems. Most of the operations
are repeated for each advance, and equation-specific operations occur at only two levels in
a hierarchy of subprograms for 1). The data and operation needs of subprograms at a level
in the hierarchy tend to be similar, so FORTRAN module grouping of subprograms on a level
helps orchestrate the hierarchy.
19
20
·adv * routines
manage an advance of one
field
!
!
!! ##
!
" ##
$
FINITE ELEMENT ITER 3D CG MOD
BOUNDARY (
perform 3D matrix solves
& ·matrix create
essential conditions '
' ,
·get rhs
& organize f.e. computations ' , ++ ++ %
REGULARITY ' , + +
' ITER CG F90
' , + +
% ' , + + perform real 2D matrix
% '
RBLOCK(TBLOCK) ' , + ! ! +
SURFACE ' , + + solves
' !!
' , !! + + %
'
·rblock make matrix
+ +
·surface comp rhs
perform numerical ' , !! ITER CG COMP
'!, + +
·rblock get rhs
Perform numerical integration in surface integrals + + perform complex 2D matrix
rblocks % + +
% + + - solves
% EDGE + + - -
INTEGRANDS NEOCLASS INT + + - -
+ -+ -
·edge network
Perform most equa- perform computations carry out communication +- +-
tion specific point- for across block borders
wise computations neoclassical-specific
) % %
) equations
) SURFACE INTS VECTOR TYPE MOD
)
) perform surface-specific perform ops on vector
% )
*
)
& computations coefficients
FFT MOD
% % %
% % %
# "
GENERIC EVALS MATRIX MOD
MATH TRAN
interpolate or find storage for data compute 2D matrix-vector
do a routine math operation
at quadrature points products
NOTE:1. all caps indicate a module name. 2.‘·’ indicates a subroutine name or interface to module procedure
CHAPTER II. FORTRAN IMPLEMENTATION
B. DATA STORAGE AND MANIPULATION 21
Before delving into the element-specific structures, let’s backup to the fields module
level. Here, there are also 1D pointer arrays of cvector_type and vector_type for each
solution and steady-state field, respectively. These structures are used to give block-type-
independent access to coefficients. They are allocated 1:nbl. These structures hold pointer
arrays that are used as pointers (as opposed to using them as ALLOCATABLE arrays in struc-
tures). There are pointer assignments from these arrays to the lagr_quad_type structure
arrays that have allocated storage space. (See [10])
In general, the element-specific structures(lagr_quad_type, here) are used in conjunction
with the basis functions (as occurs in getting field values at quadrature point locations). The
element-independent structure(vector_types) are used in linear algebra operations with the
coefficients.
The lagr_quad_types are defined in the lagr_quad_mod module, along with the in-
terpolation routines that use them. [The tri_linear_types are the analogous structures
for triangular elements in the tri_linear module.] The lagr_quad_2D_type is used for
φ-independent fields.
The structures are self-describing, containing array dimensions and character variables
for names. There are also some arrays used for interpolation. The main storage locations
are the fs, fsh, fsv, and fsi arrays.
• fs(iv,ix,iy,im) → coefficients for grid-vertex nodes.
fsi(:,3,ix,iy,:) fsi(:,4,ix,iy,:)
fsv(:,2,ix!1,iy,:) fsv(:,2,ix,iy,:)
fsi(:,1,ix,iy,:) fsi(:,2,ix,iy,:)
fsv(:,1,ix!1,iy,:) fsv(:,1,ix,iy,:)
fs(:,ix!1,iy!1,:) fs(:,ix,iy!1,:)
fsh(:,1,ix,iy!1,:) fsh(:,2,ix,iy!1,:)
B.2 Interpolation
The lagr_quad_mod module also contains subroutines to carry out the interpolation of fields
to arbitrary positions. This is a critical aspect of the finite element algorithm, since interpo-
lations are needed to carry out the numerical integrations.
For both real and complex data structures, there are single point evaluation routines
(*_eval) and routines for finding the field at the same offset in every cell of a block
∂ ∂
(*_all_eval routines). All routines start by finding αi|(ξ ! ,η! ) [and αi |(ξ !,η! ) and αi|(ξ ! ,η! )
∂ξ ∂η
B. DATA STORAGE AND MANIPULATION 23
if requested] for the (element–independent) basis function values at the requested offset
(ξ # , η# ). The lagr_quad_bases routine is used for this operation. [It’s also used at startup
to find the basis function information at quadrature point locations, stored in the rb%alpha,
dalpdr, dalpdz arrays for use as test functions in the finite element computations.] The
interpolation routines then multiply basis function values with respective coefficients.
The data structure, logical positions for the interpolation point, and desired derivative
order are passed into the single-point interpolation routine. Results for the interpolation are
returned in the structure arrays f, fx, and fy.
The all_eval routines need logical offsets (from the lower-left corner of each cell) instead
of logical coordinates, and they need arrays for the returned values and logical derivatives in
each cell.
See routine struct_set_lay in diagnose.f for an example of a single point interpolation
call. See generic_3D_all_eval for an example of an all_eval call.
Interpolation operations are also needed in tblocks. The computations are analogous
but somewhat more complicated due to the unstructured nature of these blocks. I won’t
cover tblocks in any detail, since they are due for a substantial upgrade to allow arbitrary
polynomial degree basis functions, like rblocks. However, development for released versions
must be suitable for simulations with tblocks.
B.3 Vector_types
We often perform linear algebra operations on vectors of coefficients. [Management routines,
iterative solves, etc.] Vector_type structures simplify these operations, putting block and
basis-specific array idiosyncrasies into module routines. The developer writes one call to a
subroutine interface in a loop over all blocks.
Consider an example from the adv_b_iso management routine. There is a loop over
Fourier components that calls two real 2D matrix solves per component. One solves α for
Re{BR}, Re{BZ }, and −Im{Bφ} for the F-comp, and the other solves for Im{BR}, Im{BZ },
and Re{Bφ }. After a solve in a predictor step (b_pass=bmhd pre or bhall pre), we need
to create a linear combination of the old field and the predicted solution. At this point,
which perform the assignment operations component array to component array or scalar to
each component array element.
B.4 Matrix_types
The matrix_type_mod module has definitions for structures used to store matrix elements
and for structures that store data for preconditioning iterative solves. The structures are
somewhat complicated in that there are levels containing pointer arrays of other structures.
The necessity of such complication arises from varying array sizes associated with different
blocks and basis types.
At the bottom of the structure, we have arrays for matrix elements that will be used
in matrix-vector product routines. Thus, array ordering has been chosen to optimize these
product routines which are called during every iteration of a conjugate gradient solve. Fur-
thermore, each lowest-level array contains all matrix elements that are multiplied with a
grid-vertex, horizontal-side, vertical-side, or cell-interior vector type array (arr, arrh, arrv,
and arri respectively) i.e. all coefficients for a single “basis-type” within a block. The matri-
ces are 2D, so there are no Fourier component indices at this level, and our basis functions do
not produce matrix couplings from the interior of one grid block to the interior of another.
The 6D matrix is unimaginatively named arr. Referencing one element appears as
B. DATA STORAGE AND MANIPULATION 25
arr(jq,jxoff,jyoff,iq,ix,iy)
where
jx = ix + jxoff
jy = iy + jyoff
For each dimension, vertex to vertex offsets are −1 ≤ j ∗ off ≤ 1, with storage zeroed
out where the offset extends beyond the block border. The offset range is set by the extent
of the corresponding basis functions.
i j
Figure II.2: 1-D Cubic : extent of αi in red and αj in green, integration gives joff= −1
Offsets are more limited for basis functions that have their nonzero node in a cell interior.
Integration here gives joff=-1 matrix elements, since cells are numbered from 1, while
vertices are numbered from 0. Thus for grid-vertex-to-cell connections −1 ≤ jxoff ≤
0, while for cell-to-grid-vertex connections 0 ≤ jxoff ≤ 1. Finally, j*off=0 for cell-to-
cell. The ‘outer-product’ nature implies unique offset ranges for grid-vertex, horizontal-side,
vertical-side, and cell-interior centered coefficients.
Though offset requirements are the same, we also need to distinguish different bases of
the same type, such as the two sketched here.
26 CHAPTER II. FORTRAN IMPLEMENTATION
i j
Figure II.3: 1-D Cubic : extent of αi in red and βj in green, note j is an internal node
i j
Figure II.4: 1-D Cubic : extent of βi in red and βj in green, note both i, j are internal nodes
Instead of giving the arrays 16 different names, we place arr in a structure contained by
a 4 × 4 array mat. One of the matrix element arrays is then referenced as
mat(jb,ib)%arr
where
1≡grid-vertex type
2≡horizontal-vertex type
3≡vertical-vertex type
4≡cell-interior type
Bilinear elements are an exception. There are only grid-vertex bases, so mat is a 1 × 1
array.
This level of the matrix structure also contains self-descriptions, such as nbtype=1 or
4; starting horizontal and vertical coordinates (=0 or 1) for each of the different types;
nb_type(1:4) is equal to the number of bases for each type (poly_degree-1 for types 2-3
and (poly_degree − 1)2 for type 4); and nq_type is equal to the quantity-index dimension
for each type.
All of this is placed in a structure, rbl_mat, so that we can have on array of rbl_mat’s
with one component component per grid block. A similar tbl_mat structure is also set at
this level, along with self-description.
Collecting all block_mats in the global_matrix_type allows one to define an array of
these structures with one array component for each Fourier component.
This is a 1D array since the only stored matrices are diagonal in the Fourier component
index, i.e. 2D.
matrix_type_mod also contains definitions of structures used to hold data for the pre-
conditioning step of the iterative solves. NIMROD has several options, and each has its own
storage needs.
So far, only matrix_type definitions have been discussed. The storage is located in the
matrix_storage_mod module. For most cases, each equation has its own matrix storage
that holds matrix elements from time-step to time-step with updates as needed. Anisotropic
operators require complex arrays, while isotropic operators use the same real matrix for two
sets of real and imaginary components, as in the B advance example in II.B.3.
28 CHAPTER II. FORTRAN IMPLEMENTATION
The matrix_vector product subroutines are available in the matrix_mod module. Inter-
face matvec allows one to use the same name for different data types. The large number of
low-level routines called by the interfaced matvecs arose from optimization. The operations
were initially written into one subroutine with adjustable offsets and starting indices. What’s
there now is ugly but faster due to having fixed index limits for each basis type resulting in
better compiler optimization.
Note that tblock matrix_vector routines are also kept in matrix_mod. They will undergo
major changes when tblock basis functions are generalized.
The matelim routines serve a related purpose - reducing the number of unknowns in a
linear system prior to calling an iterative solve. This amounts to a direct application of
matrix partitioning. For the linear system equation,
! "! " ! "
A11 A12 x1 b1
= (II.1)
A21 A22 x2 b2
where Aij are sub-matrices, xj and bi are the parts of vectors, and A and A22 are invertible,
x2 = A−1
22 (b2 − A21 x1 ) (II.2)
A11 − A12 A−1 A x = b1 − A12 A−1
22 b2 (II.3)
# $% 22 21& 1
Schur component
gives a reduced system. If A11 is sparse but A11 −A12 A−1 22 A21 is not, partitioning may slow the
computation. However, if A11 − A12 A−1 A
22 21 has the same structure as A11 , it can help. For
cell-interior to cell-interior sub-matrices, this is true due to the single-cell extent of interior
basis functions. Elimination is therefore used for 2D linear systems when poly_degree> 1.
11Additional notes regarding matrix storage: the stored coefficients are not summed
across block borders. Instead, vectors are summed, so that Ax is found through operations
separated by block, followed by a communication step to sum the product vector across block
borders.
encouraged repeated coding with slight modifications instead of taking the time to develop
general purpose tools for each task. Eventually it became too cumbersome, and the changes
made to achieve modularity at 2.1.8 were extensive.
Despite many changes since, the modularity revision has stood the test of time. Changing
terms or adding equations requires relatively minimal amount of coding. But, even major
changes, like generalizing finite element basis functions, are tractable. The scheme will need
to evolve as new strides are taken. [Particle closures come to mind.] However, thought and
planning for modularity and data partitioning reward in the long term.
NOTE
• These comments relate to coding for release. If you are only concerned about one
application, do what is necessary to get a result
• However, in many cases, a little extra effort leads to a lot more usage of development
work.
B.6 Seams
The seam structure holds information needed at the borders of grid blocks, including the
connectivity of adjacent blocks and geometric data and flags for setting boundary conditions.
The format is independent of block type to avoid block-specific coding at the step where
communication occurs.
The seam_storage_mod module hold the seam array, a list of blocks that touch the
boundary of the domain (exblock_list), and seam0. The seam0 structure is of edge_type,
like the seam array itself. seam0 was intended as a functionally equivalent seam for all space
beyond the domain, but the approach hampered parallelization. Instead, it is only used
during initialization to determine which vertices and cell sides lie along the boundary. We
still require nimset to create seam0.
As defined in the edge_type_mod module, an edge_type holds vertex and segment struc-
tures. Separation of the border data into a vertex structure and segment structure arises
from the different connectivity requirements; only two blocks can share a cell side, while
many blocks can share a vertex. The three logical arrays: expoint, excorner, and r0point,
are flags to indicate whether a boundary condition should be applied. Additional arrays
could be added to indicate where different boundary conditions should be applied.
Aside For flux injection boundary conditions, we have added the logical array applV0(1:2,:).
This logical is set at nimrod start up and is used to determine where to apply injection cur-
rent and the appropriate boundary condition. The first component applies to the vertex and
the second component to the associated segment.
The seam array is dimensioned from one to the number of blocks. For each block the
vertex and segment arrays have the same size, the number of border vertices equals the
30 CHAPTER II. FORTRAN IMPLEMENTATION
iv=7
is=7
is=nvert
iv=nvert
is=1 is=2
iv=1 iv=2
Figure II.5: iv=vertex index, is=segment index, nvert=count of segment and vertices
number of border segments, but the starting locations are offset (see Figure II.5). For
rblocks, the ordering is unique.
Both indices proceed counter clockwise around the block. For tblocks, there is no re-
quirement that the scan start at a particular internal vertex number, but the seam indexing
always progresses counter clockwise around its block. (entire domain for seam0)
Within the vertex_type structure, the ptr and ptr2 arrays hold the connectivity infor-
mation. The ptr array is in the original format with connections to seam0. It gets defined
in nimset and is saved in dump files. The ptr2 array is a duplicate of ptr, but references
to seam0 are removed. It is defined during the startup phase of a nimrod simulation and
is not saved to dump files. While ptr2 is the one used during simulations, ptr is the one
that must be defined during pre-processing. Both arrays have two dimensions. The second
index selects a particular connection, and the first index is either: 1 to select (global) block
numbers (running from 1 to nbl_total, not 1 to nbl, see information on parallelization), or
2 to select seam vertex number. (see Figure II.6)
Connections for the one vertex common to all blocks are stored as:
B. DATA STORAGE AND MANIPULATION 31
block2
block3
iv=11
...
iv=1 iv=2
block1 iv=5
iv=4
iv=3
iv=1 iv=2
iv=1 iv=2
• seam(1)%vertex(5)
– seam(1)%vertex(5)% ptr(1,1) = 3
– seam(1)%vertex(5)% ptr(2,1) = 11
– seam(1)%vertex(5)% ptr(1,2) = 2
– seam(1)%vertex(5)% ptr(2,2) = 2
• seam(2)%vertex(2)
– seam(2)%vertex(2)%ptr(1,1) = 1
– seam(2)%vertex(2)%ptr(2,1) = 5
– seam(2)%vertex(2)%ptr(1,2) = 3
– seam(2)%vertex(2)%ptr(2,2) = 11
• seam(3)%vertex(11)
32 CHAPTER II. FORTRAN IMPLEMENTATION
– seam(3)%vertex(11)%ptr(1,1) = 2
– seam(3)%vertex(11)%ptr(2,1) = 2
– seam(3)%vertex(11)%ptr(1,2) = 1
– seam(3)%vertex(11)%ptr(2,2) = 5
Notes on ptr:
– seam(3)%vertex(11)%ptr(1,3) = 3
– seam(3)%vertex(11)%ptr(2,3) = 11
The vertex_type stores interior vertex labels for its block in the intxy array. In rblocks
vertex(iv)%intxy(1:2) saves (ix,iy) for seam index iv. In tblocks, there is only one
block-interior vertex label. intxy(1) holds its value, and intxy(2) is set to 0.
Aside: Though discussion of tblocks has been avoided due to expected revisions, the vec-
tor_type section should have discussed their index ranges for data storage. Triangle vertices
are numbered from 0 to mvert, analogous to the logical coordinates in rblocks. The list is
1D in tblocks, but an extra array dimension is carried so that vector_type pointers can be
used. For vertex information, the extra index follows the rblock vertex index and has the
ranges 0:0. The extra dimension for cell information has the range 1:1, also consistent with
rblock cell numbering.
The various %seam_* arrays in vertex_type are temporary storage locations for the data
that is communicated across block boundaries. The %order array holds a unique prescription
(determined during NIMROD startup) of the summation order for the different contributions
at a border vertex, ensuring identical results for the different blocks. The %ave_factor and
%ave_factor_pre arrays hold weights that are used during the iterative solves.
For vertices along a domain boundary, tang and norm are allocated (1:2) to hold the R and
Z components of unit vectors in the boundary tangent and normal directions, respectively,
with respect to the poloidal plane. These unit vectors are used for setting Dirichlet boundary
conditions. The scalar rgeom is set to R for all vertices in the seam. It is used to find vertices
located at R = 0, where regularity conditions are applied.
The %segment array (belonging to edge_type and of type segment_type) holds similar
information; however, segment communication differers from vertex communication three
ways.
• First, as mentioned above, only two blocks can share a cell side.
B. DATA STORAGE AND MANIPULATION 33
• Second, segments are used to communicate off-diagonal matrix elements that extend
along the cell side.
The 1D ptr array reflects the limited connectivity. Matrix element communication makes
use of intxyp and inxyn internal vertex indices for the previous and next vertices along the
seam (intxys holds the internal cell indices). Finally, the tang and norm arrays have n extra
dimension corresponding to the different element side nodes if poly degree > 1.
Before describing the routines that handle block border communication, it is worth noting
what is written to the dump files for each seam. The subroutine dump_write_seam in the
dump module shows that very little is written. Only the ptr, intxy, and excorner arrays
plus descriptor information are saved for the vertices, but seam0 is also written. None of
the segment information is saved. Thus, NIMROD creates much of the vertex and all of
the segment information at startup. This ensures self-consistency among the different seam
structures and with the grid.
The routines called to perform block border communication are located in the edge
module. Central among these routines is edge_network, which invokes serial or parallel op-
erations to accumulate vertex and segment data. The edge_load and edge_unload routines
are used to transfer block vertex-type data to or from the seam storage. There are three sets
of routines for the three different vector types.
A typical example occurs at the end of the get_rhs routine in finite_element_mod.
The code segment starts with
DO ibl=1,nbl
CALL edge_load_carr(rhsdum(ibl),nqty,1_i4,nfour,nside,seam(ibl))
ENDDO
which loads vector component indices 1:nqty (starting location assumed) and Fourier com-
ponent indices 1:nfour (starting location set by the third parameter in the CALL statement)
for the vertex nodes and ‘nside cell-side nodes from the cvector_type rhsdum to seam_cin
arrays in seam. The single ibl loop includes both block types.
The next step is to perform the communication.
CALL edge_network(nqty,nfour,nside,.false.)
There are a few subtleties in the passed parameters. Were the operation for a real (nec-
essarily 2D) vector_type, nfour must be 0 (‘0_i4 in the statement to match the dummy
argument’s kind). Were the operation for a cvector_2D_type, as occurs in iter_cg_comp,
nfour must be 1 (‘1_i4). Finally, the fourth argument is a flag to perform the load and
34 CHAPTER II. FORTRAN IMPLEMENTATION
unload steps within edge_network. If true, it communicates the crhs and rhs vector types
in the computation_pointers module, which is something of an archaic remnant of the
pre-data-partitioning days.
The final step is to unload the border-summed data.
DO ibl=1,nbl
CALL edge_unload_carr(rhsdum(ibl),nqty,1_i4,nfour,nside,seam(ibl))
ENDDO
The pardata module also has structures for the seams that are used for parallel communi-
cation only. They are accessed indirectly through edge_network and do not appear in the
finite element or linear algebra coding.
two matrix solves (using the same matrix) for each Fourier component. The adv_v_clap
is an example; the matrix is the sum of a mass matrix and the discretized Laplacian. The
second type is diagonal in Fouries component, but all real and imaginary vector components
must be solved simultaneously, usually due to anisotropy. The full semi-implicit operator
has such anisotropy, hence adv_v_aniso is this type of management routine. The third type
of management routine deals with coupling among Fourier components, i.e. 3D matrices.
Full use of an evolving number density field requires a 3D matrix solve for velocity, and this
advance is managed by adv_v_3d.
All these management routine types are similar with respect to creating a set of 2D ma-
trices (used only for preconditioning in the 3D matrix-systems) and with respect to finding
the rhs vector. The call to matrix_create passes an array of global_matrix_type and an
array of matrix_factor_type, so that operators for each Fourier components can be saved.
Subroutine names for the integrand and essential boundary conditions are the next argu-
ments, followed by a b.c. flag and the ’pass’ character variable. If elimination of cell-interior
data is appropriate (see Sec. B.4), matrix elements will be modified within matrix_create,
and interior storage [rbl_mat(ibl)%mat(4,4)%arr] is then used to hold the inverse of the
interior submatrix (A−1
22 ).
The get_rhs calls are similar, except the parameter list is (integrand routine name,
cvector_type array, essential b.c. routine name, b.c. flags (2), logical switch for using
a surface integral, surface integrand name, global_matrix_type). The last argument is
optional and its presence indicates that interior elimination is used. [Use rmat_elim= or
cmat_elim= to signify real or complex matrix types.
Once the matrix and rhs are formed, there are vector and matrix operations to find the
product of the matrix and the old solution field, and the result is added to the rhs vector
before calling the iterative solve. This step allows us to write integrand routines for the
change of a solution field rather than its new value, but then solve for the new field so that
the relative tolerance of the iterative solver is not applied to the change, since |∆x| + |x| is
often true.
Summarizing:
• Integrands are written for A and b with A∆x = b to avoid coding semi-implicit terms
twice.
• Before the iterative solve, find b − Axold.
• The iterative solve used a ’matrix-free’ approach, where the full 3D matrix is never
formed. Instead, an rhs integrand name is passed.
• Although the names of the integrand, surface integrand, and boundary condition rou-
tines are passed through matrix_create and get_rhs, finite_element does not use
the integrands, surface_ints, or boundary modules. finite_element just needs
parameter-list information to call any routine out of the three classes. The parameter
list information, including assumed-shape array definitions, are provided by the inter-
face blocks. Note that all integrand routines suitable for comp_matrix_create, for
example, must use the same argument list as that provided by the interface block for
‘integrand’ in comp_matrix_create.
ig=1 ig=2
Figure II.8: 4 quad points, 4 elements in a block, dot denotes quadrature position, not node
tblock, and surface, are separated to provide a partitioning of the different geometric infor-
mation and integration procedures. This approach would allow us to incorporate additional
block types in a straightforward manner should the need arise.
The rblock_get_comp_rhs routine serves as an example of the numerical integration
performed in NIMROD. It starts by setting the storage arrays in the passed cvector_type
to ∅. It then allocates the temporary array, integrand, which is used to collect contributions
for test functions that are nonzero in a cell, one quadrature point at a time. Our quadrilateral
elements have (poly_degree+1)2 nonzero test functions in every cell (see the Figure II.7 and
observe the number of node positions).
The integration is performed as loop over quadrature points, using saved data for the
logical offsets from the bottom left corner of each cell, R, and wi ∗ J (ξi , ηi ). Thus each
iteration of the do-loop finds the contributions for the same respective quadrature point in
all elements of a block.
The call to the dummy name for the passed integrand then finds the equation and
algorithm-specific information from a quadrature point. A blank tblock is passed along
with the rb structure for the block, since the respective tblock integration routine calls the
same integrands and one has to be able to use the same argument list. In fact, if there were
only one type of block, we would not need the temporary integrand storage at all.
The final step in the integration is the scatter>into each basis function coefficient storage.
Note that wi ∗ J appears here, so in the integral dξdηJ (ξ, η)f (ξ, η), the integrand routine
finds f(ξ, η) only.
The matrix integration routines are somewhat more complicated, because contributions
have to be sorted for each basis type and offset pair, as described in B.4 [NIMROD uses two
different labeling schemes for the basis function nodes. The vector_type and global_matrix_type
separate basis types (grid vertex, horizontal side, vertical side, cell interior) for efficiency in
C. FINITE ELEMENT COMPUTATIONS 39
linear algebra. In contrast, the integrand routines use a single list for nonzero basis func-
tions in each element. The latter is simpler and more tractable for the block-independent
integrand coding, but each node may have more than one label in this scheme. The matrix
integration routines have to do the work of translating between the two labeling schemes.]
C.4 Integrands
Before delving further into the coding, let’s review what is typically needed for an integrand
computation. The Faraday’s law example in Section A.1 suggests the following:
1. Finite element basis functions and their derivatives with respect to R and Z.
2. Wavenumbers for the toridal direction
3. Values of solution fields and their derivatives.
4. Vector operations.
5. Pseudospectral computations.
6. Input parameters and global simulation data.
Some of this information is already stored at the quadrature point locations. The basis
function information depends on the grid and the selected solution space. Neither vary during
a simulation, so the basis function information is evaluated and saved in block-structure
array as described in B.1. Solution field data is interpolated to the quadrature pions and
stored at the end of the equation advances. It is available through the qp_type storage, also
described in B.1. Therefore, generic_alpha_eval and generic_ptr_set routines merely
set pointers to the appropriate storage locations and do not copy the data to new locations.
It is important to remember that integrand computations must not change any values
in the pointer arrays. Doing so will lead to unexpected and confusing consequences in other
integrand routines.
The input and global parameteres are satisfied by the F90 USE statement of modules given
those names. The array keff(1:nmodes), is provided to find the value of the wavenumber
associated with each Fourier component. The factor keff(1:nmodes)/bigr is the toroidal
wavenumber for the data at Fourier index imode. There are two things to note:
1. In the toroidal geometry, keff, holds the n-value, and bigr(:,:) holds R. In linear
2πn
geometry, keff holds kn = and bigr=1.
per length
2. Do not assume that a particular n-value is associated with some value of imode. Lin-
ear computations often have nmodes=1 and keff(1) equal to the input parameter
lin_nmax. Domain decomposition also affects the values and range of keff. This is
why there is a keff array.
40 CHAPTER II. FORTRAN IMPLEMENTATION
All of the stored matrices are 2D, i.e. diagonal in n, so separate indices for row and column
n indices are not used.
The get_mass subroutines then consists of a dimension check to determine the number
of nonzero finite elemnt basis function in each cell, a pointer set for α, and a double loop over
basis indices. Note that our operators have full storage, regardless of symmetry. This helps
make our matrix-vector products faster and it implies that the same routines are suitable
for forming nonsymmetric matrices.
The next term in the lhs (from pgB.2) arises from implicit diffusion. The adv_b_iso
management routine passes the curl_de_iso integrand name, so lets examine that routine.
It is more complicated than get_mass, but the general form is similar. The routine is used
for creating the resistive diffusion operator and the semi-implicit operator for Hall terms, so
there is coding for two different coefficients. The character variable integrand_flag, that
is set in the advance subroutine of nimrod.f and saved in the global module, determines
which coefficient to use. Note that calls to generic_ptr_set pass both rblock andtblock
structures for a field. One of them is always a blank structure, but this keeps block specific
coding out of integrands. Instead, the generic_evals module selects the appropriate data
storage.
After the impedence coefficient is found, the int array is filled. The basis loops appear
in a conditional statement to use or skip the ∇∇· term for the divergence cleaning. We will
skip them in this discussion. Thus we need ∇ × (αj ! êl! exp (−in# φ)) · ∇ × (αj êl exp (inφ))
only, as on pg 11. To understand what is in the basis function loops, we need to evaluate
∇ × (αj êl exp (inφ)). Though êl is êR , êZ or êφ only, it is straightforward to use a general
C. FINITE ELEMENT COMPUTATIONS 41
direction vector and then restrict attention to the three basis vectors of our coordinate
system.
∇ × (αj êl exp (inφ)) = ∇ (αj exp (inφ)) × êl + αj exp (inφ)∇ × (êl )
! "
∂α ∂α inα
∇ (αj exp (inφ)) = R̂ + Ẑ + αφ̂ exp (inφ)
∂R ∂Z R
F! " ! "
∂α inα inα ∂α
∇ (αj exp (inφ)) × êl = (êl )φ − (êl )Z R̂ + (êl )R − (êl )φ Ẑ
∂Z R R ∂R
! " G
∂α ∂α
+ (êl )Z − (êl )R φ̂ exp (inφ)
∂R ∂Z
∇ × (êl ) = 0 for l = R, Z
1
∇ × (êl ) = − Ẑ for l = φand toroidal geometry
R
Thus, the scalar product ∇ × (αj !êl! exp (−in# φ)) · ∇ × (αj êl exp (inφ)) is
! "! "
∂αj ! in#αj ! ∂αj inαj
(êl! )φ + (êl! )Z (êl )φ − (êl )Z
∂Z R ∂Z R
! ! " "! ! " "
inαj ! αj ! ∂αj ! inαj αj ∂α
+ − (êl! )R − + (êl! )φ (êl )R − + (êl )φ
R R ∂R R R ∂R
! "! "
∂αj ! ∂αj ! ∂αj ∂αj
+ (êl! )Z − (êl! )R (êl )Z − (êl )R
∂R ∂Z ∂R ∂Z
Finding the appropriate scalar product for a given (jq,iq) pair is then a matter of sub-
stituting R̂, Ẑ, φ̂ for êl! for l# ⇒ iq = 1, 2, 3, respectively. This simplifies the scalar product
greatly for each case. Note how the symmetry of the operator is preserved by construction.
For (jq, iq = (1, 1), (êl! )R = (êl )R = 1 and (êl! )Z = (êl! )φ = (êl )Z = (êl )φ = 0 so
! 2 "
n ∂αj v αiv
int(1,1,:,:,jv,iv)= αj vαiv + ×ziso
R ∂Z ∂Z
where ziso is an effective impedance.
Like other isotropic operators the resulting matrix elements are either purely real or
purely imaginary, and the only imaginary elements are those coupling poloidal and toroidal
vector components. Thus, only real poloidal coefficients are coupled to imaginary toroidal
coefficients and vice versa. Furthermore, the coupling between Re(BR, BZ ) and Im(Bφ ) is the
same as that between Im(BR , BZ ) and Re(Bφ ), making the operator phase independent. This
lets us solve for these separate groups of components as two real matrix equations, instead
of one complex matrix equation, saving CPU time. An example of the poloidal-toroidal
coupling for real coefficients is
42 CHAPTER II. FORTRAN IMPLEMENTATION
! "
nα j v αiv αiv
int(1,3,:,:,jv,iv)= + ×ziso
R R ∂R
! "
inαj v αiv αiv
which appears through a transpose of the (3, 1) element. [compare with − +
R R ∂R
from the scalar product on pg 41]
Notes on matrix integrands:
• The Fourier component index, jmode, is taken from the global module. It is the
do-loop index set in matrix_create and must not be changed in integrand routines.
• Other vector-differential operations acting on αj êl exp (inφ) are needed elsewhere. The
computations are derived in the same manner as ∇ × (αj êl exp (inφ)) given above.
1. The vector aspect (in a linear algebra sense) of the result implies one set of basis
function indices in the int array.
2. Coefficients for all Fourier components are created in one integrand call.
3. The int array has 5 dimensions, int(iq,:,:,iv,im), where im is the Fourier index
(imode).
Returning to Faraday’s law as an example, the brhs_mhd routine finds ∇×(αj ! êl! exp (−in# φ))·
E(R, Z, φ), where E may contain ideal mhd, resistive, and neoclassical contributions. As with
the matrix integrand, E is only needed at the quadrature points.
From the first executable statement, there are a number of pointers set to make various
fields available. The pointers for B are set to the storage for data from the end of the last
time-split, and the math_tran routine, math_curl is used to find the corresponding current
density. Then, ∇·B is found for error diffusion terms. After that, the be,ber, and bez arrays
are reset to the predicted B for the ideal electric field during corrector steps (indicated by
the integrand_flag. More pointers are then set for neoclassical calculations.
The first nonlinear computation appears after the neoclassical_init select block. The
V and B data is transformed to functions of φ, where the cross product, V ×B is determined.
The resulting nonlinear ideal E is then transformed to Fourier components (see Sec B.1).
Note that fft_nim concatenates the poloidal position indices into one index, so that real_be
has dimensions (1:3,1:mpseudo,1:nphi) for example. The mpseudo parameter can be less
C. FINITE ELEMENT COMPUTATIONS 43
than the number of cells in a block due to domain decomposition (see Sec C), so one should
not relate the poloidal index with the 2D poloidal indices used with Fourier components.
A loop over the Fourier index follows the pseudospectral computations. The linear ideal
E terms are created using the data for the specified steady solution, completing - (Vs × B +
(V × Bs + (V × B) (see Sec A.1). Then, the resistive and neoclassical terms are computed.
Near the end of the routine, we encounter the loop over test-function basis function
indices (j # , l# ). The terms are similar to those in curl_de_iso, except ∇ × (αj êl exp (−inφ))
is replaced by the local E, and there is a sign change for −∇ × (αj ! êl! exp (−in# φ)) · E.
The integrand routines used during iterative solves of 3D matrices are very similar to
rhs integrand routines. They are used to find the dot product of a 3D matrix and a vector
without forming matrix elements themselves (see Sec D.2). The only noteworthy difference
with rhs integrands is that the operand vector is used only once. It is therefore interpolated
to the quadrature locations in the integrand routine with a generic_all_eval call, and the
interpolate is left in local arrays. In p_aniso_dot, for example, pres,presr, and presz are
local arrays, not pointers.
where j is a basis function node along the boundary, and l and n are the direction-vector and
Fourier component indices. In some cases, a linear combination is set to zero. For Dirichlet
conditions on the normal component, for example, the rhs of Ax = b is modified to
where n̂j is the unit normal direction at boundary node j. The matrix becomes
(I − n̂j n̂j ) · A · (I − n̂j n̂j ) + n̂j n̂j (II.6)
Dotting n̂j into the modified system gives
xj n = n̂j · x = 0 (II.7)
Dotting (I − n̂j n̂j ) into the modified system gives
(I − n̂j n̂j ) · A · x̃ = b̃ (II.8)
The net effect is to apply the boundary condition and to remove xj n from the rest of the
linear system.
Notes:
• Changing boundary node equations is computationally more tractable than changing
the size of the linear system, as implied by finite element theory.
• Removing coefficients from the rest of the system [through ·(I − n̂n̂) on the right side of
A] may seem unnecessary. However, it preserves the symmetric form of the operator,
which is important when using conjugate gradients.
The dirichlet_rhs routine in the boundary module modifies a cvector_type through
operations like (I − n̂n̂)·. The ‘component’ character input is a flag describing which compo-
nent(s) should be set to 0. The seam data tang, norm, and intxy(s) for each vertex (segment)
are used to minimize the amount of computation.
The dirichlet_op and dirichlet_comp_op routines perform the related matrix opera-
tions described above. Though mathematically simple, the coding is involved, since it has
to address the offset storage scheme and multiple basis types for rblocks.
At present, the kernel always applies the same Dirichlet conditions to all boundary nodes
of a domain. Since the dirichlet_rhs is called one block at a time, different component
input could be provided for different blocks. However, the dirichlet_op routines would
need modification to be consistent with the rhs. It would also be straightforward to make
the ‘component’ information part of the seam structure, so that it could vary along the
boundary independent of the block decomposition.
Since our equations are constructed for the change of a field and not its new value, the
existing routines can be used for inhomogeneous Dirichlet conditions, too. If the conditions
do not change in time, one only needs to comment out the calls to dirichlet_rhs at the end
of the respective management routine and in dirichlet_vals_init. For time-dependent
inhomogeneous conditions, one can adjust the boundary node values from the management
routine level, then proceed with homogeneous conditions in the system for the change in the
field. Some consideration for time-centering the boundary data may be required. However,
the error is no more than O(∆t) in any case, provided that the time rate of change does not
appear explicitly.
C. FINITE ELEMENT COMPUTATIONS 45
where ψ is a series of nonnegative powers of its argument. For vectors, the Z-direction
component satisfies the relation for scalars, but
1. At R = 0 nodes only, change variables to V+ = (Vr + iVφ )/2 and V− = (Vr − iVφ )/2.
3. To preserve symmetry with nonzero V− column entries, add −i times the Vφ -row to
the VR-row and set all elements of the Vφ -row to 0 except for the diagonal.
After the linear system is solved, the regular_ave routine is used to set Vφ = iV− at the
appropriate nodes.
Returning to the power series behavior for R → 0, the leading order behavior for S0 ,
VR1 , and Vφ is the vanishing radial derivative. This is like a Neumann boundary condition.
Unfortunately, we cannot rely on a ‘natural B.C.’ approach because there is no surface with
46 CHAPTER II. FORTRAN IMPLEMENTATION
finite area along an R = 0 side of a computational domain. Instead, we add the penalty
matrix, 6
w ∂αj ! ∂αj
dV 2 , (II.12)
R ∂R ∂R
saved in the dr_penalty matrix structure, with suitable normalization to matrix rows for
S0 , VR1 and Vφ1 . The weight w is nonzero in elements along R = 0 only. Its positive value
∂
penalizes changes leading to ∂R &= 0 in these elements. It is not diffusive, since it is added to
linear equations for the changes in physical fields. The regular_op and regular_comp_op
routines add this penalty term before addressing the other regularity conditions.
[Looking through the subroutines, it appears that the penalty has been added for VR1
and Vφ1 only. We should keep this in mind if a problem arises with S0 (or VZ0 ) near R = 0.]
D Matrix Solution
At present, NIMROD relies on its own iterative solvers that are coded in FORTRAN 90 like
the rest of the algorithm. They perform the basic conjugate gradient steps in NIMROD’s
block decomposed vector_type structures, and the matrix_type storage arrays have been
arranged to optimized matrix-vector multiplication.
D.1 2D matrices
The iter_cg_f90 and iter_cg_comp modules contain the routines needed to solve symmetric-
positive-definite and Hermitian-positive-definite systems, respectively. The iter_cg module
holds interface blocks to give common names and the real and complex routines. The rest
of the iter_cg.f file has external subroutines that address solver-specific block-border com-
munication operations.
Within the long iter_cg_f90.f and iter_cg_comp.f files are separate modules for
routines for different preconditioning schemes, a module for managing partial factoriza-
tion(iter_*_fac), and a module for managing routines that solve a system(iter_cg_*).
The basic cg steps are performed by iter_solve_*, and may be compared with textbook
descriptions of cg, once one understands NIMROD’s vector_type manipulation.
Although normal seam communication is used during matrix_vector multiplication,
there are also solver-specific block communication operations. The partial factorization rou-
tines for preconditioning need matrix elements to be summed across block borders(including
off-diagonal elements connecting border nodes), unlike matrix_vector product routines.
The iter_mat_com routines execute this communication using special functionality in edge_seg_network
for the off-diagonal elements. After the partial factors are found and stored in the ma-
trix_factor_type structure, border elements of the matric storage are restored to their
original values by iter_mat_rest.
D. MATRIX SOLUTION 47
Other seam-related oddities are the averaging factors for border elements. The scalar
product of two vector_type is computed by iter_dot. The routine is simple, but it has
to account for redundant storage of coefficients along block borders, hence ave_factor. In
the preconditioning step, the solve of Ãz = r where à is an approximate A and r is the
residual, results from block-based partial-factors(the “direct”, “bl_ilu_*”, and “bl_diag*”
choices) are averaged at block borders. However, simply multiplying border z-value by
ave_factor and seaming after preconditioning destroys the symmetry of the operation(and
convergence). Instead,
√ we symmetrize the averaging by multiply border elements of r by
ave_factor_pre(∼ ave factor), inverting Ã, then multiply z by ave_factor_pre before
summing.
Each preconditioner has its own set of routines and factor storage, except for the global
and block line-Jacobi algorithms which share 1D matrix routines. The block-direct option
uses Lapack library routines. The block-incomplete factorization options are complicated but
effective for mid-range condition numbers(arising roughly when ∆texplicit−limit + ∆t + τA ).
Neither of these two schemes has been updated to function with higher-order node data. The
block and global line-Jacobi schemes use 1D solves of couplings along logical directions, and
the latter has its own parallel communication(see D). There is also diagonal preconditioning,
which inverts local direction vector couplings only. This simple scheme is the only one has
functions in both rblocks and tblocks. Computations with both block types and solver &=
‘diagonal’ will use the specified solver in rblocks and ‘diagonal’ in tblocks.
NIMROD grids may have periodic rblocks, degenerate points in rblocks, and cell-interior
data may or may not be eliminated. These idiosyncracies have little effect on the basic
cg steps, but they complicate the preconditioning operations. They also make coupling to
external library solver packages somewhat challenging.
D.2 3D matrices
The conjugate gradient solver for 3D systems is mathematically similar to the 2D solvers.
Computationally, it is quite different. The Fourier representation leads to matrices that are
dense in n-index when φ-dependencies appear in the lhs of an equation. Storage requirements
for such a matrix would have to grow as the number of Fourier components is increased, even
with parallel decomposition. Achieving parallel scaling would be very challenging.
To avoid these issues, we use a ‘matrix-free’ approach, where the matrix-vector products
needed for cg iteration are formed directly from rhs-type finite element computation. For
example, were the resistivity in our form if Faraday’s law on p.? a function of φ, it would
generate off-diagonal in n contributions:
- 6 6 6
η(R, Z, φ) !
Bj ln dRdZdφ ∇ × (ᾱj ! l! e−in φ ) · ∇ × (ᾱj l einφ ) (II.13)
j ln
µ 0
!
which may be nonzero for all n .
48 CHAPTER II. FORTRAN IMPLEMENTATION
Identifying the term in brackets as µ0J(R, Z, φ), the curl of the interpolated and FFT’ed
B, shows how the product can be found as a rhs computation. The Bj ln data are coeffi-
cients of the operand, but when they are associated with finite-element structures, calls to
generic_all_eval and math_curl create µ0Jn. Thus, instead of calling an explicit matrix-
vector product routine, iter_3d_solve calls get_rhs in finite element, passing a dot-product
integrand name. This integrand routine uses the coefficients of the operand in finite-element
interpolations.
The scheme uses 2D matrix structures to generate partial factors for preconditioning that
do not address n → n# coupling. It also accepts a separate 2D matrix structure to avoid
repetition of diagonal in n operations in the dot-integrand; the product is then the sum of
the finite element result and a 2D-matrix-vector multiplication. Further, iter_3d_cg solves
systems for changes in fields directly. The solution field at the beginning of the time split is
passed into iter_3d_cg_solve to scale norms appropriately for the stop condition.
E Start-up Routines
Development of the physics model often requires new *block_type and matrix storage. [The
storage modules and structure types are described in Sec. B.] Allocation and initialization
of storage are primary tasks of the start-up routines located in file nimrod_init.f. Adding
new variables and matrices is usually a straightforward task of identifying an existing similar
data structure and copying and modifying calls to allocation routines.
The variable_alloc routine creates quadrature-point storage with calls to rblock_qp_alloc
and tblock_qp_alloc. There are also lagr_quad_alloc and tri_linear_alloc calls to
create finite-element structures for nonfundamental fields (like J which is computed from
the fundamental B field) and work structures for predicted fields. Finally, variable_alloc
creates vector_type structures used for diagnostic computations. The quadrature_save
routine allocates and initializes quadrature_point storage for the steady-state (or ’equilib-
rium’) fields, and it initializes quadrature-point storage for dependent fields.
Additional field initialization information:
• The ê3 component of be_eq structures use the covariant component (RBφ) in toroidal
geometry, and the ê3 of ja_eq is the contravariant Jφ/R. The data is converted to
cylindrical components after evaluating at quadrature point locations. The cylindrical
components are saved in their respective qp structures.
• Initialization routines have conditional statements that determine what storage is cre-
ated for different physics model options.
F. UTILITIES 49
• The pointer_init routine links the vector_type and finite element structures via
pointer assignment. New fields will need to be added to this list too.
• The boundary_vals_init routine enforces the homogeneous Dirichlet boundary con-
ditions on the initial state. Selected calls can be commented out if inhomogeneous
conditions are appropriate.
The matrix_init routine allocates matrix and preconditioner storage structures. Opera-
tions common to all matrix allocations are coded in the *_matrix_init_alloc, iter_fac_alloc,
and *_matrix_fac_degen subroutines. Any new structure allocations can be coded by copy-
ing and modifying existing allocations.
F Utilities
The utilities.f file has subroutines for operations that are not encompassed by finite
_element and normal vector_type operations. The new_dt and ave_field_check sub-
routines are particularly important, though not elegantly coded, subroutines. Teh new_dt
routine computes the rate of flow through mesh cells to determine if advection should limit
the time step. Velocity vectors are averaged over all nodes in an element, and rates of ad-
vection are found from |v · xi /|xi|2 | in each cell, where xi is a vector displacement across the
cell in the i-th logical coordinate direction. A similar computation is performed for electron
flow when the two-fluid model is used.
The ave_field_check routine is an important part of our semi-implicit advance. The
‘linear’ terms in the operators use the steady-state fields and the n = 0 part of the solution.
The coefficient for the isotropic part of the operator is based on the maximum difference be-
tween total pressures and the steady_plus_n=0 pressures, over the φ-direction. Thus, both
the ‘linear’ and ‘nonlinear’ parts of the semi-implicit operator change in time. However, com-
puting matrix elements and finding partial factors fro preconditioning are computationally
intensive operations. To avoid calling these operations during every time step, we determine
how much the fields have changed sicne the last matrix update, and we compute new matrices
when the change exceeds a tolerance (ave_change_limit). [Matrices are also recomputed
when ∆t changes.]
The ave_field_check subroutine uses the vector_type pointers to facilitate the test.
It also considers the grid-vertex data only, assuming it would not remain fixed while other
coefficients change. If the tolerance is exceeded, flags such as b0_changed in the global
module are set to true, and the storage arrays for the n = 0 fields or nonlinear pressures
are updated. Parallel communication is required for domain decomposition of the Fourier
components (see Sec. C).
Before leaving utilities.f, let’s take a quick look at find_bb. This routine is used to
find the toroidally symmetric part of the b̂b̂ dyad, which is used for the 2D preconditioner
matrix for anisotropic thermal conduction. The computation requires information from
50 CHAPTER II. FORTRAN IMPLEMENTATION
all Fourier components, so it cannot occur in a matrix integrand routine. [Moving the
jmode loop from matrix_create to matrix integrands is neither practical nor preferrable.]
Thus, b̂ b̂ is created and stored at quadrature point locations, consistent with an integrand-
type of computation, but separate from the finite-element hierarchy. The routine uses the
mpi_allreduce command to sum contributions from different Fourier ’layers’.
Routines like find_bb may become common if we find it necessary to use more 3D linear
systems in our advances.
G Extrap_mod
The extrap_mod module holds data and routines for extrapolating solution fields to the new
time level, providing the initial guess for an iterative linear system solve. The amount of
data saved depends on the extrap_order input parameter.
Code development rarely requires modification of these routines, except extrap_init.
This routine decides how much storage is required, and it creates an index for locating the
data for each advance. Therefore, when adding a new equation, increase the dimension fo
extrap_q, extrap_nq, and extrap_int and define the new values in extrap_init. Again,
using existing code for examples is helpful.
I I/O
Coding new input parameters requires two changes to input.f (in the nimset directory
and an addition to parallel.f. The first part of the input module defines variables and
default values. A description of the parameters should be provided if other users will have
I. I/O 51
access to the modified code. A new parameter should be defined near related parameters
in the file. In addition, the parameter must be added to the appropriate namelist in the
read_namelist subroutine, so that it can be read from a nimrod.in file. The change required
in parallel.f is just to add the new parameterto the list in broadcast_input. This sends
the read values from the single processor that reads nimrod.in to all others. Be sure to use
another mpi_bcast or bcase_str with the same data type as an example.
Changes to dump files are made infrequently to maintain compatibility across code ver-
sions to the greatest extent possible. However, data structures are written and read with
generic subroutines, which makes it easy to add fields. The overall layout of a dump file is
Within the rblock and tblock routines are calls to structure-specific subroutines. These
calls can be copied and modified to add new fields. [But, don’t expect compatibility with
other dump files.]
Other dump notes:
• The read routines allocate data structures to appropriate sizes just before a record is
read.
• All data is written as 8-byte real to improve portability of the file while using FORTRAN
binary I/O.
• Only one processor reads and writes dump files. Data is transferred to other processors
via MPI communication coded in parallel_io.f.
• The NIMROD and NIMSET dump.f files are different to avoid parallel data requirements
in NIMSET. Any changes must be made to both files.
52 CHAPTER II. FORTRAN IMPLEMENTATION
Chapter III
Parallelization
53
54 CHAPTER III. PARALLELIZATION
process. This need is met by calling mpi_allreduce within grid_block do loops. Here an
array of data is sent by each process, and the MPI library routine sums the data element by
element with data from other processes that have difference Fourier components for the same
block. The resulting array is returned to all processes that participate in the communication.
Then, each process proceeds independently until its next MPI call.
1. Issue a mpi_irecv to every process in the exchange, which indicates readiness to accept
data.
C. FOURIER “LAYER” DECOMPOSITION 55
2. Issue a mpi_send to every process involved which sends data from the local process.
3. Perform some serial work, like seaming among different blocks of the local process to
avoid wasting CPU time while waiting for data to arrive.
4. Use mpi_waitany to verify that expected data from each participating process has
arrived. Once verification is complete, the process can continue on to other operations.
Caution: calls to nonblocking communication (mpi_isend,mpi_irecv,...) seem to have
difficulty with buffers that are part of F90 structures when the mpich implementation is used.
The difficulty is overcome by passing the first element, e.g.
CALL mpi_irecv(recv(irecv)%data(1),...)
instead of the entire array, %data,...
As a prelude to the next section, block border communication only involves processes
assigned to the same Fourier layer. This issue is addressed in the block2proc mapping
array. An example statement
inode=block2proc(ibl_global
assigns the node label for the process that owns global block number ibl_global - for the
same layer as the local process - to the variable inode.
Separating processes by some condition (e.g. same layer) is an important tool for NIM-
ROD due to the two distinct types of decomposition. Process groups are established by
the mpi_comm_split routine. There are two in parallel_block_init. One creates a com-
munication tag grouping all processes in a layer, comm_layer, and the other creates a tag
grouping all processes with the same subset of blocks, but different layers, comm_mode. These
tags are particularly useful for collective communication within the groups of processes. The
find_bb mpi_allreduce was one example. Others appear after scalar products of vectors in
2D iterative solves: different layers solve different systems simultaneously. The sum across
blocks involves processes in the same layer only, hence calls to mpi_allreduce with the
comm_layer tag. Where all processes are involved, or where ‘point to point’ communication
(which references the default node index) is used, the mpi_comm_world tag appears.
Configuration!space Decomp.
il=1
il=0
Normal Decomp
il=1
n=2
iphi=2
il=0 iphi=1
n=1
n=0
Figure III.1: Normal Decomp: nmodes=2 on il=0 and nmodes=1 on il=1, nf=mx×my=9
Config-space Decomp: nphi=2lphi = 8, mpseudo=5 on il=0 and mpseudo=4 on il=1
Parameters are mx=my=3, nlayer=2, lphi=3
Apart from the basic mpi_allreduce exchanges, communication among layers is required
to multiply functions of toroidal angle (see B.1). Furthermore, the decomposition of the
domain is changed before an “inverse” FFT (going from Fourier coefficients to values at
toroidal positions ) and after a “forward” FFT. The scheme lets us use serial FFT routines,
and it maintains a balanced workload among the processes, without redundancy.
For illustrative purposes, consider a poorly balanced choice of lphi= 3, 0 ≤ n ≤nmodes_total−1
with nmodes_total= 3, and nlayers= 2. (layers must be divisible by nlayers.) Since
pseudo-spectral operations are performed on block at a time, we can consider a single-block
problem without loss of generality. (see Figure III.1)
Notes:
• Communication is carried out inside fft_nim, isolating it from the integrand routines.
D. GLOBAL-LINE PRECONDITIONING 57
• calls to fft_nim with nr = nf duplicate the entire block’s config-space data on every
layer.
D Global-line preconditioning
That ill-conditional matrices arise at large δt implies global propagation of perturbations
within a single time advance. Krylov-space iterative methods require very many iterations
to reach convergence in these conditions, unless the precondition step provides global “relax-
ation”. The most effective method we have found for cg on very ill-conditioned matrices is a
line-Jacobi approach. We actually invert two approximate matrices and average the result.
They are defined by eliminating off-diagonal connections in the logical s-direction for one
approximation and by eliminating in the n-direction for the other. Each approximate system
Ãz = r is solved by inverting 1D matrices,
To achieve global relaxation, lines are extended across block borders. Factoring and solv-
ing these systems uses a domain swap, not unlike that used before pseudospectral compu-
tations. However, point-to-point communication is called (from parallel_line_* routines)
instead of collective communication. (see our 1999 APS poster, “Recent algorithmic ...” on
the web site.)
58 CHAPTER III. PARALLELIZATION
Bibliography
[1] N. A. Krall and Trivelpiece. Principles of Plasma Physics. San Francisco Press, 1986.
[2] G. Strang and G. J. Fix. An analysis of the finite element method. Wesley-Cambridge
Press, 1987.
[5] K. Lerbinger and J. F. Luciani. A new semi-implicit approach for MHD computation.
J. Comput. Phys., 97:444, 1991.
[6] D. S. Harned and Z. Mikić. Accurate semi-implicit treatment of the Hall effect in MHD
computations. J. Comput. Phys., 83:1, 1989.
[8] B. Marder. A method for incorporating Gauss’ law into electromagnetic PIC codes. J.
Comput. Phys., 68:48, 1987.
59