Michael Förster (Auth.) - Algorithmic Differentiation of Pragma-Defined Parallel Regions - Differentiating Computer Programs Containing OpenMP-Vieweg+Teubner Verlag (2014)
Michael Förster (Auth.) - Algorithmic Differentiation of Pragma-Defined Parallel Regions - Differentiating Computer Programs Containing OpenMP-Vieweg+Teubner Verlag (2014)
Algorithmic
Differentiation of
Pragma-Defined
Parallel Regions
Differentiating Computer Programs
Containing OpenMP
Michael Förster
RWTH Aachen University
Aachen, Germany
Springer Vieweg
© Springer Fachmedien Wiesbaden 2014
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole
or part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical
way, and transmission or information storage and retrieval, electronic adaptation, compu-
ter software, or by similar or dissimilar methodology now known or hereafter developed.
Exempted from this legal reservation are brief excerpts in connection with reviews or schol-
arly analysis or material supplied specifically for the purpose of being entered and executed
on a computer system, for exclusive use by the purchaser of the work. Duplication of this
publication or parts thereof is permitted only under the provisions of the Copyright Law of
the Publisher’s location, in its current version, and permission for use must always be obtained
from Springer. Permissions for use may be obtained through RightsLink at the Copyright
Clearance Center. Violations are liable to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are
exempt from the relevant protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date
of publication, neither the authors nor the editors nor the publisher can accept any legal re-
sponsibility for any errors or omissions that may be made. The publisher makes no warranty,
express or implied, with respect to the material contained herein.
Abstract V
Acknowledgments VII
6 Conclusions 331
6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
Bibliography 397
Index 403
1 Motivation and Introduction
The pragma declares the counting loop as parallelizable and the OpenMP enabled
compiler distributes the iterations among the group of threads executing the pro-
gram. To get a respective Pthreads implementation, one would have to write the
code completely different. The loop body would be put into an external function
and the distribution among the threads has to be implemented by the developer.
The only thing that differentiate the above code from its sequential version is the
pragma. The developer has to be aware whether or not a loop is parallelizable.
With OpenMP, the way how the iterations are distributed among the threads can
be determined by the backend compiler since it generates the associated code.
Some successful approaches where computationally intensive numerical kernels
have been parallelized with the help of OpenMP can be found in [2], [41] or [40].
In addition to this, the coming subsections introduces some numerical problems
where OpenMP can be used to allow simultaneous computation on a multicore
machine.
A similar development as from Pthreads to OpenMP can be observed in the do-
main of high performance computing on GPUs. The high performance computing
on GPUs was started around the year 2002. This was first more or less a work-
around since the typical graphic related operations were exploited for doing linear
algebra computations such as the dot product. In order to allow a general purpose
programming on their GPUs, NVIDIA publishes the CUDA SDK [61, 69] in 2007.
In 2008, the Khronos group publishes the OpenCL framework [44, 25] that was
meant to write programs for heterogeneous platforms as, for example, a system
with a CPU and a FPGA. In November 2011, a new API standard was published
called OpenACC [62]. OpenACC makes extensive use of pragmas to abstract the
programming on a lower level with CUDA and OpenCL. If the OpenACC stan-
dard will have the same success as the OpenMP standard is uncertain at this time,
in particular, because the recent OpenMP 4.0 standard introduces also a couple of
new pragmas to support accelerators and vectorization.
A typical application in numerical simulations is, in addition to just simulate
a certain problem, finding input parameters minimizing or maximizing a certain
output of the simulation. Since the underlying problems are typically of nonlinear
1.1 Numerical Optimization in the Multicore Era 5
nature, the domain of finding solutions for these kinds of optimization problems
is called nonlinear optimization. In nonlinear optimization derivative values play
an important role. As pointed out in [60, ch. 8], these derivative values can either
be approximated by finite differences or they can be computed with algorithmic
differentiation (AD) [29, 57]. We will introduce these two methods in Section 1.2.
AD has the advantage that it provides the derivative values with an accuracy up
to machine precision. The manual differentiation of a given code is error prone
and often infeasible for some reason, for example, because of the code’s size or
because each change to the original code implies a change in the derivative code
and therefore the code’s maintenance is difficult.
The two main methods of AD are source transformation and overloading. The
overloading method is a runtime solution in the sense that a basic variable type is
substituted by a customized type that propagates the derivative information along
the dataflow. The source transformation on the other side is a compile time solu-
tion since a function implementation is transformed into another implementation
that can be used to compute projections of the Jacobian of this function. Since the
pragma-based compiler directives are only available at compile time, the overload-
ing approach cannot take these pragmas into account, and valuable information
about the code is neglected. The source transformation method has the down-
side that the input code must be read by a parser that fully understands the code’s
programming language. Hence, the parser must support a certain subset of the
programming language to cover most of the codes written as a numerical kernel.
Currently, AD source transformation tools do not have a satisfying OpenMP
support. Hence, the knowledge about the inherent parallelism is lost. The fact
that the derivative codes only uses a sequential execution can be an huge disad-
vantage because in nonlinear optimization almost all the algorithms consist of an
iterative search method where the derivative values are computed in each iteration
again. Especially considering the increasing number of cores in recent computer
architectures, the exploitation of the parallelism is crucial. Hence, we present an
approach for an AD source transformation for pure parallel regions. This means
that a certain code region is declared as parallel by a pragma but inside the par-
allel region there exists no further pragmas. This approach allows to examine the
source transformation requirements for a common parallel region without any API
dependent features. Subsequently, we focus more on OpenMP and examine the
source transformation of parallel regions containing various OpenMP constructs.
To motivate this approach with two practical examples, we present two typical
problems in numerical optimization. The first is a nonlinear least-squares problem
which makes use of the first derivative code of an implementation that contains
an OpenMP parallel region. Subsequently, we introduce algorithmic differenti-
6 1 Motivation and Introduction
ation together with an application from the domain of biotechnology where our
derivative code compiler dcc plays an important role. In this area, the second
derivative code is crucial since the usage of the second derivative values instead
of approximating these values by a quasi-Newton method often leads to a more
robust behavior and a faster convergence of the optimization algorithm. Since the
examples from the biotechnology are beyond the scope of this work, we simplify
this example to a common nonlinear optimization problem with constraints. The
associated first- and second derivative codes are used to calculate a solution for
this constrained optimization problem using the open source software Ipopt2 [82].
with
1 1 1 m
φ (x) = k f (x)k22 = f (x)T f (x) = ∑ (y(ti ; x) − bi )2 .
2 2 2 i=1
At this point we enter the domain of numerical optimization where most of the
algorithms require the knowledge of derivatives. Therefore, we have to introduce
some closely connected notions.
Theorem 1 (Chain Rule of Differential Calculus). Suppose that g1 : Rn → Rk with
z = g1 (x) is differentiable at x and g2 : Rk → Rm with y = g2 (z) is differentiable
at z. Then f (x) = g2 (g1 (x)) is differentiable at x with
∂f ∂ g2 ∂ g1 ∂y ∂z
= · = ·
∂x ∂z ∂x ∂z ∂x
Proof. See, for example, [3].
∂f
fx j (x0 ) ≡ (x0 ) .
∂xj
The gradient of f at point x0 is defined by
fx1 (x0 )
∇ f (x0 ) ≡ ... ∈ Rn .
fxn (x0 )
8 1 Motivation and Introduction
∇F(x0 ) ≡
..
.
∇Fm (x0 )T
is called the Jacobian of F at point x0 . The Hessian of f at the point x0 is defined
as
fx1 x1 (x0 ) fx1 x2 (x0 ) . . . fx1 xn (x0 )
∇2 f (x0 ) := ( fxi x j (x0 )) =
.. .. .. ..
. . . .
fxn x1 (x0 ) fxn x2 (x0 ) . . . fxn xn (x0 )
where ∇ f (x) is the Jacobian matrix of f evaluated at point x. The Hessian matrix
of φ is
m
∇2 φ (x) = (∇ f (x))T ∇ f (x) + ∑ ∇2 fi (x) fi (x) (1.3)
i=1
The following algorithm indicates how a solution x∗ can be computed with the
help of the Gauss-Newton, or the Levenberg-Marquardt method [50, 53, 51, 60].
As pointed out in Algorithm 1, the Gauss-Newton method only approximates the
second derivative of φ by omitting the second derivative information of f in (1.3).
Matrix A represents the second-order approximation of φ in Algorithm 1.
The Levenberg-Marquardt method, also known as the damped least-squares
(DLS) method, is similar to the Gauss-Newton method, but it ensures that ma-
trix A has full rank 4n. This method adds the identity matrix multiplied by a factor
µ 2 to the matrix A. This full-rank property is not necessarily given when using
the Gauss-Newton method. The values of µ are adjusted during the optimization
process as indicated in Algorithm 1. The actual value or adjustment method for
µ is not part of this work and it only influences the convergence properties of
Algorithm 1.
1.1 Numerical Optimization in the Multicore Era 9
Next, we extend the above system from one equation to n equations. This means
instead of considering only a single spring-mass system, we approximate param-
eters for n independent spring-mass systems. All the n equations are of the form
(1.1). As in the single equation case, the individual four unknown parameters
will be approximated with the help of m measurement values. The residual func-
tion maps 4n parameter values to mn measurement values. The objective function
Φ : R4n → R becomes
1
Φ(x) = kF(x1,1 , x1,2 , x1,3 , x1,4 , x2,1 , . . . , xn,4 )k22 (1.4)
2
1 n m
= ∑ ∑ (x j,1 e−x j,2 ti sin(x j,3ti + x j,4 ) − b j,i )2 ,
2 j=1 i=1
and
The goal is to find a solution x∗ that satisfies ∇Φ(x∗ ) = 0, and while ∇2 Φ(x∗ ) ∈
R4n×4n is symmetric positive definite. As in the single equation case, we approxi-
mate the Hessian by omitting the second-order information.
15 x b a s e=j ∗ 4 ;
16 x0=x [ x b a s e + 0 ] ; x1=x [ x b a s e + 1 ] ;
17 x2=x [ x b a s e + 2 ] ; x3=x [ x b a s e + 3 ] ;
18 w h i l e ( i <m) {
19 mbase=j ∗m;
20 t _ i=t [ mbase+i ] ; b_i=b [ mbase+i ] ;
21 yy=x0 ∗ exp (0. − x1 ∗ t _ i ) ∗ s i n ( x2 ∗ t _ i+x3 ) ;
22 y [ mbase+i ]= yy−b_i ;
23 i=i +1;
24 }
25 }
26 }
Listing 1.1: Residual function F.
For solving the optimization problem (1.6), we also need the values of the Ja-
cobian ∇F at point x. One could use finite differences or AD. Finite differences
would need 4(n + 1) evaluations of the code shown in Listing 1.1. One major ben-
efit of AD is that truncation is avoided. This can make a big difference in the area
of ODEs, in particular when we consider stiff ODE systems [18]. Therefore, we
show how the least-squares problem can be solved by using AD.
The next section presents the basics of AD. Afterwards, we are able to refor-
mulate Algorithm 1 in a way that uses AD for calculating b and A. We will show
how we obtain the first derivative code of Listing 1.1 from dcc assuming that we
omit the OpenMP pragma in line 10 because dcc does not support OpenMP at the
moment. In addition we will present an application of dcc in biotechnology.
∂ Fj Fj (x + h ei ) − Fj (x)
(x) ≈ , i = 1, 2, . . . , n (1.9)
∂ xi h
where j ∈ {1, 2, . . . , m}, h is a small scalar, and ei is the i-th unit vector. We can
approximate the i-th column of the Jacobian by applying (1.9) for j = 1, 2, . . . , m.
12 1 Motivation and Introduction
With
∂F F(x + h ei ) − F(x)
(x) ≈
∂ xi h
at hand, the whole Jacobian matrix can be calculated by applying this formula for
i = 1, 2, . . . , n.
Another method for computing the derivative values of F is symbolic differen-
tiation. This method can be made by hand or one uses an algebra system like
Mathematica [81] or Maple [16]. Since this method uses a much bigger amount
of computer resources than AD and it is therefore of limited use for calculating
higher derivatives, we do not discuss this method here.
We focus on calculating the derivative values with the third main method, namely
algorithmic differentiation (AD) [29, 57]. AD exploits the fact that the chain rule
of differential calculus holds and takes the view that a given implementation of F
as computer code can be seen as a composition of elementary arithmetic opera-
tions. As an example, let us consider the assignment
y=s i n ( x2 ∗ t _ i+x3 ) ;
as an example. This assignment occurred in similar form in Listing 1.1 line 21.
The computation of the right-hand side can be displayed by a directed acyclic
graph (DAG) as illustrated in Figure 1.1a.
We associate each node of the DAG with an auxiliary variable v, as shown in
Figure 1.1b. The DAG in Figure 1.1b is called linearized, because each edge is
labeled with the local partial derivative of the edge’s target node with respect to
its predecessor. For example, node v5 represents the value sin(v4 ), and the partial
derivative of v5 with respect to its predecessor v4 is cos(v4 ). Therefore, the edge
from v4 to v5 is labeled with cos(v4 ).
Figure 1.1b induces a semantically equivalent representation of the assignment
y=sin(x2*t_i+x3) as single assignment code (SAC). Each node v in Figure 1.1b
can be written as an assignment with the auxiliary variable v on the left-hand side.
The predecessors of node v in the linearized graph represent the right-hand side in
form of an expression. The following SAC and the assignment y=sin(x2*t_i+x3)
are semantically equivalent.
v0 = x2 ;
v1 = t_i ;
v2 = x3 ;
v3 = v0 ∗ v1 ;
v4 = v3+v2 ;
v5 = s i n ( v4 ) ;
1.2 Algorithmic Differentiation 13
y = v5 ;
Listing 1.2: The assignment y=sin(x2∗t_i+x3); can be represented semantically
equivalent as single assignment code (SAC).
AD has two modes, the forward mode and the reverse mode. An application
of the forward mode to a given implementation P of a mathematical function F :
Rn → Rm builds a tangent-linear model of P that is defined in Definition 3. The
reverse mode provides the adjoint model of P which is defined in Definition 6.
x(1) → ∇F · x(1) .
The tangent-linear model F (1) (x, x(1) ) augments each assignment of a given
SAC with an additional assignment where the partial derivative of the augmented
assignment is computed. The computational complexity of computing the first
sin v5
cos(v4 )
+ v4
1
∗ v3 1
v1 v0
x2 ti x3 v0 v1 v2
(a) DAG (b) Linearized DAG
Figure 1.1: The expression sin(x2 · ti + x3 ) can be represented by a DAG (Figure 1.1a). The
linearized DAG of this expression is shown in Figure 1.1b.
14 1 Motivation and Introduction
derivative with the tangent-linear model is O(n) · Cost(F) where Cost(F) denoted
the cost for evaluating the implementation of F. As an example, we apply the for-
ward mode locally to the SAC shown in Listing 1.2. Each assignment can be seen
as a scalar function g : R2 → R. According to (1.10), the tangent-linear model
of g is g(1) : R4 → R, where g(1) (x, x(1) ) ≡ ∇g · x(1) . The components of x(1) are
represented in the code by the prefix ’t1_’. For example, the variable x2 has an as-
sociated derivative component t1_x2. Listing 1.3 shows the tangent-linear model
of Listing 1.2.
t1_v0 = t1_x2 ;
v0 = x2 ;
t1_v1 = t1_t_i ;
v1 = t _ i ;
t1_v2 = t1_x3 ;
v2 = x3 ;
t1_v3 = v1 ∗ t1_v0+v0 ∗ t1_v1 ;
v3 = v0 ∗ v1 ;
t1_v4 = t1_v3+t1_v2 ;
v4 = v3+v2 ;
t1_v5 = c o s ( v4 ) ∗ t1_v4 ;
v5 = s i n ( v4 ) ;
t1_y = t1_v5 ;
y = v5 ;
Listing 1.3: Tangent-linear model of Listing 1.2.
and where h., .iRn and h., .iRm denote the scalar products in Rn and Rm , respec-
tively.
The adjoint model F(1) (x, y(1) ) in AD yields an adjoint code with the important
property that its computational complexity O(m) · Cost(F) grows with the number
of dependent variables m in contrast to the tangent-linear model whose computa-
tional complexity grows with the number of independent variables n. Mathemat-
ical models in physics, chemistry or economic science often have thousands of
inputs and only some values or even only one value as output. Hence, the adjoint
code is frequently the only possibility to compute the derivative values of mathe-
matical models in feasible time. A downside of the adjoint code is that the adjoint
values are computed during a complete dataflow reversal of the original function
evaluation. This dataflow reversal is often connected to an high usage of memory
resources.
We apply the reverse mode to our example SAC from Listing 1.2. An assign-
ment from the SAC is again represented as g : R2 → R. The adjoint model of g is
g(1) : R3 → R2 with g(1) (x, y(1) ) ≡ (∇g)T · y(1) .
1 v0 = x2 ;
2 v1 = t _ i ;
3 v2 = x3 ;
4 v3 = v0 ∗ v1 ;
5 v4 = v3+v2 ;
6 v5 = s i n ( v4 ) ;
7 y = v5 ;
8 a1_v5 = a1_y ;
9 a1_y = 0 ;
10 a1_v4 = c o s ( v4 ) ∗a1_v5 ;
11 a1_v3 = a1_v4 ;
12 a1_v2 = a1_v4 ;
13 a1_v0 = v1 ∗a1_v3 ;
14 a1_v1 = v0 ∗a1_v3 ;
15 a1_x3 += a1_v2 ;
16 a1_t_i += a1_v1 ;
17 a1_x2 += a1_v0 ;
Listing 1.4: Adjoint model of Listing 1.2.
16 1 Motivation and Introduction
When we apply (1.12) to each assignment in Listing 1.2, we obtain the code shown
in Listing 1.4, which contains two distinct sections. The forward section comprises
the lines 1 to 7 and this section is for this example equivalent with the original code
from Listing 1.2. The adjoint computation is done during the reverse section (line
8 to line 17). The adjoint components are preceded by the prefix ’a1_’. These list-
ings already indicate that differentiating a given code is error-prone, particularly
in the adjoint model case. But the source transformation can be undertaken by a
tool and there are already several tools providing AD by source transformation, for
example, OpenAD3 , Tapenade4 , and dcc5 . Unfortunately, none of these tools pro-
vide at this time an AD source transformation support for OpenMP. It is ongoing
work to enable dcc to support OpenMP [23].
With this information about AD at hand we are able to present how AD can
be utilized to solve the previously introduced least-squares problem from Section
1.1.1.
However, we do not compute the whole Jacobian ∇F(x) ∈ Rnm×4n , since the ad-
joint model (1.12) provides exactly what we need for acquiring b. Therefore, we
rewrite (1.13) in order to use the adjoint model.
y ← F(x)
b ← F(1) (x, y)
We use a combination of the tangent-linear model with the adjoint model to com-
3 http://www.mcs.anl.gov/OpenAD/
4 http://tapenade.inria.fr:8080/tapenade/index.jsp
5 http://www.stce.rwth-aachen.de/software/dcc.html
1.2 Algorithmic Differentiation 17
First, we compute the i-th column of the Jacobian ∇F(x) by calling the tangent-
linear model F (1) with x and the i-th euclidean basis vector ei as arguments. Af-
terwards, we use this result to obtain the i-th line of A by calling the adjoint model
a
F(1) . Algorithm 2 shows the adjusted version of Algorithm 1.
b = hA, vi ∈ Rm .
c = hw, Ai ∈ Rn
of A in direction w ∈ Rm is defined as
c = wT · A
hA, u, vi ≡ hhA, ui vi
hA, u, vi = hA, v, ui
hv, w, Ai = hw, A, vi
hw, A, vi = hhw, Ai , vi = hw, hA, vii
(u, v) 7→ ∇2 F, u, v .
(v, w) 7→ w, ∇2 F, v .
Without going into further details, we give the theorems from [57] that this
source transformation is actually correct. Further details about higher derivatives
can be found in this textbook.
Theorem 12 ([57], p. 111). The application of forward mode AD to the tangent-
linear model yields the second-order tangent-linear model.
Theorem 13 ([57], p. 119). The application of forward mode AD to the adjoint
model yields an implementation of the second-order adjoint model.
Theorem 14 ([57], p. 120). The application of reverse mode AD to the tangent-
linear model yields an implementation of the second-order adjoint model.
Theorem 15 ([57], p. 123). The application of reverse mode AD to the adjoint
model yields an implementation of the second-order adjoint model.
In case that the reader is interested in more information about AD we recom-
mend the textbooks [29, 57]. The next section introduces the software tool dcc
that is able to generate first- and higher derivative codes by source transformation.
The results from this dissertation will be integrated into this tool.
1.2 Algorithmic Differentiation 21
Assuming that file F.c contains the adjusted implementation of F, we apply the
forward mode by calling dcc as follows:
d c c F . c −t
The result is a file t1_F.c which contains the implementation of F (1) . Analo-
gously, the adjoint model F(1) is obtained by the command line
d c c F . c −a
6 http://www.stce.rwth-aachen.de/software/dcc.html
22 1 Motivation and Introduction
which results in a generated file named a1_F.c. When speaking about the im-
plementations of the tangent-linear or the adjoint model we often speak about the
tangent-linear and the adjoint code, respectively. We only list their signatures since
we already showed brief examples of tangent-linear and adjoint code. The imple-
mentation of F (1) can be called by using the following signature
v o i d t1_F ( c o n s t i n t n , c o n s t i n t m,
d o u b l e ∗ t , d o u b l e ∗ t1_t , d o u b l e ∗ x , d o u b l e ∗ t1_x ,
d o u b l e ∗ b , d o u b l e ∗ t1_b , d o u b l e ∗ y , d o u b l e ∗ t1_y )
whereby
v o i d a1_F ( i n t bmode_1 , c o n s t i n t n , c o n s t i n t m,
d o u b l e ∗ t , d o u b l e ∗ a1_t , d o u b l e ∗ x , d o u b l e ∗ a1_x ,
d o u b l e ∗ b , d o u b l e ∗ a1_b , d o u b l e ∗ y , d o u b l e ∗ a1_y )
implements F(1) . The first parameter of a1_F, namely bmode_1, is not of im-
portance at this point. The remaining list of parameters in both signatures reveal
that all the floating-point parameters of F are augmented by an additional parame-
ter. This additional parameter serves as derivative association. The tangent-linear
associates x(1) are named t1_x in the tangent-linear code, whereby the adjoint
associates x(1) are referred to as a1_x in the adjoint code.
As mentioned in Section 1.2.1, to obtain an implementation of the second deriva-
tive code, we have different possibilities. Depending on what AD mode we apply
to F and depending on what mode we apply afterwards to achieve the second
derivative code, we obtain different combinations for the second-order implemen-
tation. As an example we generate a second-order adjoint code for F in reverse-
(1)
over-forward mode. This code is referred to as F(2) and it is achieved by applying
dcc two times:
d c c F . c −t
d c c t1_F . c −a −d 2
Without going into details, the reader recognizes again that each floating-point
(1)
parameter is augmented by another derivative component. x(2) (alias a2_t1_x) is,
for example, the corresponding component associated to x(1) (alias t1_x).
Figure 1.2: The work flow of modeling, simulating, and optimizing a natural process with
JADE.
The important fact of this code is that the function is structured in a way that it
contains 1488 equations where the start and the end of the equation are determined
by a pragma. In case that all equations do not have any data dependencies, they
can be computed simultaneously. A possible OpenMP implementation of this is
that we transform, for example,
9 The DyOS website is http://www.avt.rwth-aachen.de/AVT/index.php?id=484&L=1.
10 The MEXA website is http://www.avt.rwth-aachen.de/AVT/index.php?id=891&L=1.
1.2 Algorithmic Differentiation 25
#pragma ad e q u a t i o n
{ /∗ e q u a t i o n 1 ∗/ }
#pragma ad e q u a t i o n
{ /∗ e q u a t i o n 2 ∗/ }
This work will define transformation rules for OpenMP codes as the above and
therefore the parallelism is not lost in the derivative code creation. Once the AD
source transformation for OpenMP pragmas is well defined, one can think about
an approach where we transform the ad equation structure into code that uses het-
erogeneous parallel programming. For example, a solution would be that we use
distributed memory parallel programming on an outer level and shared memory
parallel programming on an inner level. The number of equations is, for example,
divided into halves and each halve is processed by one MPI process running on dif-
ferent cluster nodes. Inside of the two MPI processes, we continue to exploit the
parallelism of the equations by using shared memory parallel programming with
OpenMP or OpenACC. As a result, this heterogeneous approach allows to use
the maximum number of cores that a certain cluster configuration has available,
and the number of equations is not limited by the machine’s hardware constraints
because the equations are distributed among cluster nodes with the help of MPI.
Hybrid approaches with MPI/OpenMP usually show good scaling properties as
presented for example in [49].
Mathematical models in process engineering become more and more complex
and with an increasing complexity the number of equations usually grows as well.
With an increasing number of equations there is no way around the approach of
letting the equations be calculated simultaneously. To emphasize this, we con-
sider the sizes of the derivative codes from the above listing. Please note that
we created the derivative codes of res by ignoring all ad equation pragmas. The
function res has 19.000 lines of code. The application of dcc to the function res
in tangent-linear mode results in 73.000 lines of code whereby the application of
the adjoint model gives us 78.000 lines of code. The second derivative code is
obtained by reapplying dcc to the first derivative code. The application of the
26 1 Motivation and Introduction
min x1 x4 (x1 + x2 + x3 ) + x3
x∈R4
subject to x1 · x2 · x3 · x4 ≥ 25
(1.16)
x12 + x22 + x32 + x42 = 40
1 ≤ x1 , x2 , x3 , x4 ≤ 5 .
x0 = (1, 5, 5, 1)
For this small example, the gradient of the objective function f (x) and the Ja-
cobian of the constraints g(x) can be deduced by symbolic differentiation. Never-
theless, as we saw in the previous section, it is often the case that the constrained
optimization problem consists of many independent equations and all these equa-
tions must be differentiated. The symbolic differentiation approach is therefore
often not applicable due to the size of the implementation of the objective func-
tion. Hence, we apply AD to obtain the first- and second derivative values of the
objective function f and the corresponding constraints g. The Lagrangian function
of (1.16) is defined as f (x) + g(x)T λ and the Hessian of the Lagrangian function
is ∇2 f (x) + ∑m 2
i=1 λi ∇ gi (x), where m is the number of constraints.
As we did for the least-squares problem, we assume that we have given a con-
strained optimization problem which consists of N independent problems. In prac-
tice, the independent problems often vary in the number of variables and in the
computational complexity. This would involve the runtime results in an unbal-
anced way because some threads process their work while the rest of the threads
have already finished their work. Since we are only interested in scalability re-
sults of this work’s approach, we give all the threads a similar amount of work to
process. This is achieved by defining the N subproblems as given in (1.16). The
overall optimization problem can be described as
d o u b l e y← 0 . ;
t i d ← omp_get_thread_num ( ) ;
n t ← omp_get_num_threads ( ) ;
c ← n/ n t ;
lb ← t i d ∗c ;
ub ← ( t i d +1)∗ c −1;
i←lb ;
w h i l e ( i ≤ ub ) {
j ← i ∗4;
y +← ( x j ∗ x j+3 ∗ ( x j+x j+1+x j+2 )+x j+2 ) ;
i ← i +1;
}
t h r e a d _ r e s u l t tid ← y ;
}
f o r ( i ← 0 ; i <omp_get_max_threads ( ) ; i ← i +1) {
o b j +← t h r e a d _ r e s u l t i ;
}
The Ipopt API expects the implementation of the objective function in the C
function eval_f and Listing 1.5 would therefore be part of this code. The user of
Ipopt needs to provide an implementation which computes the first derivative val-
ues of f and this code belongs into eval_grad_f. Analogously, the values for the
constraints are computed in eval_g, while the Jacobian values of the constraints
are computed in eval_jac_g. The values of the Hessian of the Lagrangian func-
tion are computed in eval_h. All these eval_* functions must be provided by the
user of Ipopt since it does not use any AD features. Hence, we need an AD source
transformation tool at hand capable of transforming OpenMP parallel regions such
as given in Listing 1.5. Once we have these transformations, we simply insert them
into the predefined framework of Ipopt.
This concludes the motivation part of this introduction. We saw some applica-
tions where AD is very useful but as soon as pragma-based parallel regions are
used it gets impossible to use AD since the required transformation rules are miss-
ing at this time. This dissertation fills this gap and since we mainly focus on the
OpenMP standard 3.1, we now introduce the parts of the standard that are treated
in this work.
1.3 OpenMP Standard 3.1 29
of threads that could be used to form a new team using a parallel construct.
The most important construct is the parallel construct which starts the parallel
execution. The parallel construct is associated with a sequence of statements that
is executed by a team of threads. Each thread in this team has an unique identifier
number starting with zero. The thread with the identifier zero is called master
or initial thread. Unless otherwise specified, we assume the size of the group of
threads to be p.
parallel Construct
In order to be a close as possible to the description in the standard, we cite the
description in the standard instead of repeating in other words. The first citation
describes the parallel construct.
1. if(scalar-expression)
2. num_threads(scalar-expression)
3. default(shared|none)
4. private(list)
5. firstprivate(list)
6. shared(list)
7. copyin(list)
8. reduction(operator: list)
1.3 OpenMP Standard 3.1 31
[. . .]
When a thread encounters a parallel construct, a team of threads is created to ex-
ecute the parallel region [. . .] . The thread that encountered the parallel construct
becomes the master thread of the new team, with a thread number of zero for the
duration of the new parallel region. All threads in the new team, including the
master thread, execute the region. Once the team is created, the number of threads
in the team remains constant for the duration of that parallel region.
Within a parallel region, thread numbers uniquely identify each thread. Thread
numbers are consecutive whole numbers ranging from zero for the master thread
up to one less than the number of threads in the team. A thread may obtain its own
thread number by a call to the omp_get_thread_num library routine.
A set of implicit tasks, equal in number to the number of threads in the team,
is generated by the encountering thread. The structured block of the parallel con-
struct determines the code that will be executed in each implicit task. Each task
is assigned to a different thread in the team and becomes tied. The task region of
the task being executed by the encountering thread is suspended and each thread
in the team executes its implicit task. Each thread can execute a path of statements
that is different from that of the other threads.
The implementation may cause any thread to suspend execution of its implicit
task at a task scheduling point, and switch to execute any explicit task generated by
any of the threads in the team, before eventually resuming execution of the implicit
task [. . .] .
There is an implied barrier at the end of a parallel region. After the end of
a parallel region, only the master thread of the team resumes execution of the
enclosing task region.
If a thread in a team executing a parallel region encounters another parallel di-
rective, it creates a new team, according to the rules in [. . .], and it becomes the
master of that new team.
If execution of a thread terminates while inside a parallel region, execution of
all threads in all teams terminates. The order of termination of threads is unspeci-
fied. All work done by a team prior to any barrier that the team has passed in the
program is guaranteed to be complete. The amount of work done by each thread
after the last barrier that it passed and before it terminates is unspecified."
S0
#pragma omp p a r a l l e l
{
S1
}
S2
are three structured blocks S0 , S1 , S2 . These blocks are sequences of statements,
whereby the specific kind of statement is not of importance. S1 is declared as to
be evaluated in parallel by a team of threads. Consider Listing 1.3 and assume that
S0 and S2 contain only one statement. The parallel region is represented by S1 =
(s1 ; . . . ; sq ) comprising of q statements. The evaluation of Listing 1.3 is illustrated
in Figure 1.3. The q statements are executed by a team of p threads. The following
notation is used for expressing that the i-th statement in a sequence of statements
is executed by thread t:
sti
For example s01 is the first statement in a sequence and it is evaluated by thread 0.
s31 denotes an instance of s1 that is executed by thread 3. All the p instances of the
first statement that are executed by the team of threads are denoted by
These instances are shown on the same horizontal level in Figure 1.3. This ex-
presses that they are theoretically executed simultaneously.
The topmost node in Figure 1.3 is only executed by the master thread, which is
indicated with the superscript 0. Then, following the arrows in the figure from top
to bottom, a team of p threads is created and each thread processes the statements
of sequence S1 , namely (s1 ; . . . ; sq ). After each thread has encountered the end
of the parallel region by finishing statement sq , the master thread continues the
sequential execution with the statement sq+1 . An implicit barrier at the end of the
parallel region ensures that before statement sq+1 is processed, each thread has
finished its execution of the parallel region. This is called a join operation.
Example 3. (Memory model of OpenMP)
This example displays the use of private and shared variables inside a parallel
region. We define two floating-point variables x and y outside the parallel region.
In OpenMP, these two variables are per default shared variables inside the scope
of the parallel region. Therefore, each thread can read and write to the memory
location of x and y. The parallel region is executed by two threads as defined by
omp_set_num_threads(2). Each thread prints its thread identifier number and
1.3 OpenMP Standard 3.1 33
s00
.. .. ··· .. ..
. . . .
s0q+1
the addresses of x and y on the screen through a function call to fprintf. After the
parallel region the master thread prints its memory addresses of x and y through
another call to fprintf.
f p r i n t f ( s t d e r r , " I n s i d e / O u t s i d e ␣ o f ␣ ␣ ␣ ␣ ␣ \n" ) ;
f p r i n t f ( stderr ,
" p a r a l l e l ␣ r e g i o n ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣&x ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣&y \n" ) ;
f p r i n t f ( stderr ,
"−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−\n"
);
float x;
float y;
omp_set_num_threads ( 2 ) ;
#pragma omp p a r a l l e l
{
float x;
f p r i n t f ( stderr ,
" I n s i d e ␣ ( t h r e a d ␣%d ) ␣ ␣%15p␣ ␣%15p\n" , omp_get_thread_num ( )
, &x , &y ) ;
34 1 Motivation and Introduction
}
f p r i n t f ( stderr ,
" O u t s i d e ␣ ( t h r e a d ␣%d ) ␣%15p␣ ␣%15p\n" , omp_get_thread_num ( )
, &x , &y ) ;
We have two lines showing the addresses from inside the parallel region, one is for
the thread with the ID zero, the other one is for thread one. The third line shows
the addresses, which the variables have outside of the parallel region.
Variable x is defined twice. Once outside the parallel region, once inside the
parallel region. The memory model of OpenMP determines that all references
defined inside the parallel region are private references and only visible for each
individual thread.
The column with the header line labeled with ’&x’ displays that the code has
three instances of x. One for each thread and one that is accessed only from outside
of the parallel region. The latter one is not accessible from inside the parallel
region since all references to x are linked to the private instance of x.
The variable y is defined only once outside the parallel region. Without any
additional clauses defining the status of y, this means that y is shared and accessible
for the whole group of threads which execute the parallel region. This means that
we only have one instance of y. The output of this code displays this in the column
with the header ’&y’ by showing three times the same address in opposite to the
column of &x.
Please note, that the memory location where the private variables are placed is
compiler related and is not determined in the OpenMP standard. For example,
for those who are familiar with the memory model of Linux, the instances of x of
thread 0 are both placed on the stack whereby the instance of thread 1 is located
inside the data segment of the executing process. This may be different for distinct a
compilers and the developer cannot rely on this.
1.3 OpenMP Standard 3.1 35
yi = 2 · xi · xi , where i = 0, . . . , n − 1.
The code from line 3 to line 15 defines a lower bound (lb) and an upper bound
(ub). These boundaries are unique for each thread and define a range of the input
data that is processed by a certain thread. The lines 12 and 15 are for adjusting the
chunk size and the upper bound for the case that the data cannot be partitioned into
equal parts. This is the case when the data size n is not divisible by the number of
threads p.
1 #pragma omp p a r a l l e l
2 {
3 int i←0;
4 int tid←0;
5 int lb←0;
6 i n t ub← 0 ;
7 i n t chunk_size←0;
8 t i d ←omp_get_thread_num ( ) ;
9 p←omp_get_num_threads ( ) ;
10 c h u n k _ s i z e ←n/p ;
11 i ← c h u n k _ s i z e ∗p ;
12 i f ( i 6= n ) { c h u n k _ s i z e ← c h u n k _ s i z e +1; }
13 l b ← t i d ∗ chunk_size ;
14 ub← ( t i d +1)∗ c h u n k _ s i z e −1;
15 i f ( ub ≥ n ) { ub←n −1; }
16 f o r ( i ← l b ; i ≤ ub ; i ++) {
17 y i ←2∗ x i ∗ x i ;
18 xi← 0 . ;
19 }
20 }
Listing 1.6: Data decomposition for a typical SPMD problem.
The method shown in Listing 1.6 is called data decomposition and allows an exe-
cution of the program where each processor fetches its own instructions and oper-
ates on its own data. The underlying model is called the Single Program Multiple
Data (SPMD) model which is very often used in parallel programming [66].
Each thread gets its own chunk of data but the whole group of threads execute the
same set of instructions only with different offsets for accessing the data. However,
36 1 Motivation and Introduction
OpenMP could not call itself an API for abstracting low-level thread programming
if it did not provide mechanisms to do this data decomposition implicitly. The loop
construct that is introduced in the section about work sharing constructs is a typical
a
example where implicit data decomposition is used.
The next example illustrates that knowledge about data dependencies is still
necessary when using OpenMP. One can easily write wrong code when the data
dependencies are ignored by the software engineer. Another reason can be that
the developer just forgets to share the work as shown in Example 5. The compiler
will not prevent the compilation with an error when the code is not parallelizable,
it will just create code according to the pragmas that the software developer has
provided. In the best case, the compiler gives a warning but this presupposes that
the compiler can expect what the developer probably wanted to do, which is often
not the case.
All threads simultaneously write to the memory locations y+sizeof(y0 )*i, where
i ∈ {0, 1, . . . , n − 1}. In the previous example the range 0, 1, . . . , n − 1 was parti-
tioned by an explicit data decomposition and therefore each thread in the team was
responsible for only one partition. Here, each thread processes all n iterations of
the loop. In iteration i, each thread sets first the value of yi and then it sets the
component xi to zero. The value of yi depends on the value of xi and therefore
it depends on the fact whether or not the assignment that sets xi to zero has been
already executed by another thread. The result in yi is decided by a race between
read and store operations from different threads. Therefore, this situation at run-
time is called a race condition. The reason for this race condition is the critical a
reference xi that is read and stored by multiple threads.
1.3 OpenMP Standard 3.1 37
Worksharing Constructs
The developer of Example 5 probably wanted to share the store operations among
the threads. This is being done by defining a worksharing directive as explained in
this section.
The next three paragraphs will introduce the worksharing constructs of OpenMP.
The first is applicable to a counting loop, the sections construct is for defining dif-
ferent code sections which share no data dependencies to one another. Finally, the
single construct is for defining a sequence of statements to be executed by only one
thread, the remaining threads jump over this sequence of statements.
Loop construct
OpenMP 3.1 Citation 18. Loop Construct, p. 39
"The loop construct specifies that the iterations of one or more associated loops
will be executed in parallel by threads in the team in the context of their implicit
tasks. The iterations are distributed across threads that already exist in the team
executing the parallel region to which the loop region binds.
The syntax of the loop construct is as follows:
#pragma omp f o r [clause[[,] clause] [. . .]] new-line
for-loops
1. private(list)
2. firstprivate(list)
3. lastprivate(list)
4. reduction(operator: list)
5. schedule(kind[, chunk_size])
6. collapse(n)
7. ordered
8. nowait
The for directive places restrictions on the structure of all associated for-loops.
[. . .]
The canonical form allows the iteration count of all associated loops to be com-
puted before executing the outermost loop. [. . .]
The loop construct is associated with a loop nest consisting of one or more loops
that follow the directive.
There is an implicit barrier at the end of a loop construct unless a nowait clause
is specified. [. . .]
A worksharing loop has logical iterations numbered 0,1,[. . .],N-1 where N is the
number of loop iterations, and the logical numbering denotes the sequence in
which the iterations would be executed if the associated loop(s) were executed
by a single thread."
Obviously, the implicit data decomposition makes the code much easier to read,
a
which was one big objective of developing the OpenMP standard.
1.3 OpenMP Standard 3.1 39
sections Construct
This construct is more suitable in cases where the threads should execute code
where the individual control flow of the codes differs from thread to thread. The
easiest example is that two procedure calls can be executed simultaneously since
the two subroutines do not have any data dependencies. In this case each section
contains one subroutine call.
1. private(list)
2. firstprivate(list)
3. lastprivate(list)
4. reduction(operator: list)
5. nowait
yi ← xi · xi
zi ← sin(xi ) · xi , where i ∈ {1, . . . , n}
single Construct
In case that we define a structured block that is only executed by one thread and
not the whole group of threads, then the single construct can be used. This can, for
example, be useful for changing shared scalar data.
OpenMP 3.1 Citation 20. p. 50
"The single construct specifies that the associated structured block is executed
by only one of the threads in the team (not necessarily the master thread), in the
context of its implicit task. The other threads in the team, which do not execute the
block, wait at an implicit barrier at the end of the single construct unless a nowait
clause is specified. The syntax of the single construct is as follows:
#pragma omp s i n g l e [clause[[,] clause] [. . .]] new-line
structured-block
1.3 OpenMP Standard 3.1 41
There are another two worksharing constructs which are a combination of the
parallel construct and a worksharing construct. The reason for defining a separate
construct is that it often occurs that, for example, a loop is parallelizable. With-
out the combined version the developer would have to use the parallel construct
first, and the loop construct would have to be placed inside of the associated struc-
tured code block. To make this more compact and readable, OpenMP knows the
combined worksharing constructs.
Example 8. In this example we use the combined parallel loop construct from
OpenMP.
1 #pragma omp p a r a l l e l for
2 f o r ( i ← 0 ; i <n ; i ++) {
3 y i ←2∗ x i ∗ x i ;
4 xi← 0 . ;
5 }
a
It is semantically equivalent to Example 6 and Example 4.
where clause can be any of the clauses accepted by the parallel or for directives,
except the nowait clause, with identical meanings and restrictions."
10 f o r ( i n t i ← 0 ; i <n ; i ++)
11 z i← s i n ( xi ) ;
12 }
13 }
This code is obviously more compact through the usage of the combined parallel
a
sections construct.
master Construct
OpenMP 3.1 Citation 24. p. 67
"The master construct specifies a structured block that is executed by the master
thread of the team.
#pragma omp m a s t e r new-line
structured-block
Other threads in the team do not execute the associated structured block. There is
no implied barrier either on entry to, or exit from, the master construct."
critical Construct
In shared memory parallel programming multiple threads may access the same
shared variable by a read or a write operation. Simultaneous read and write oper-
ations of multiple threads must be avoided since they lead to inconsistent values.
44 1 Motivation and Introduction
One speaks about a race condition since the concurrent threads compete each other
by reading from and writing to the memory. The winning thread, or better saying
the thread that determines the result, is the one that performs the last write opera-
tion.
In other words the result depends on the order of the read and write operations
which depend on the nondeterministic context changes between the threads cur-
rently executing the parallel region. A race condition must be prevented and can,
for example, be avoided by putting the part of the code that causes the race condi-
tion into critical construct. This construct ensures that at each point in time only
one thread is inside the critical section. The critical-section problem is a famous
problem in computer science and can be solved by a construct satisfying the three
requirements mutual exclusion, progress, and bounded waiting [74, 75]. These
requirements are given by the OpenMP critical construct.
OpenMP 3.1 Citation 25 (pages 68). "The critical construct restricts execution
of the associated structured block to a single thread at a time.
#pragma omp c r i t i c a l [(name)] new-line
structured-block
An optional name may be used to identify the critical construct. All critical con-
structs without a name are considered to have the same unspecified name. A thread
waits at the beginning of a critical region until no thread is execution a critical re-
gion with the same name. The critical construct enforces exclusive access with
respect to all critical constructs with the same name in all threads, not just those
threads in the current team."
9 j ← j +1;
10 }
11 #pragma omp c r i t i c a l
12 {
13 t h r e a d _ r e s u l t 0← s i n ( t h r e a d _ r e s u l t 0 ) ∗ s i n ( y ) ;
14 }
15 i ← i +1;
16 }
17 }
The data decomposition is here only indicated by dots. A thread local result is
computed by using the private variable y. Afterwards, all results from the team of
threads flow into a computation where the reference thread_result[0] is succes- a
sively updated by the assignment inside the critical construct.
barrier Construct
OpenMP 3.1 Citation 26 (pages 70). "The barrier construct specifies an explicit
barrier at the point at which the construct appears.
Example 11. In case that a reduction is necessary, one can use the master construct
as displayed in the current code. The dots hide the part where a data decomposition
takes place.
1 #pragma omp p a r a l l e l
2 {
3 ...
4 w h i l e ( i ≤ ub ) {
5 j←0;
6 w h i l e ( j <n ) {
7 y←y ∗ s i n (Ai∗n+ j ∗ x j ) ∗ c o s (Ai∗n+ j ∗ x j ) ;
8 j ← j +1;
9 }
10 i ← i +1;
46 1 Motivation and Introduction
11 }
12 t h r e a d _ r e s u l t tid ←y ;
13 #pragma omp b a r r i e r
14 #pragma omp m a s t e r
15 {
16 i←1;
17 w h i l e ( i <p ) {
18 t h r e a d _ r e s u l t 0← t h r e a d _ r e s u l t 0 ∗ t h r e a d _ r e s u l t i ;
19 i ← i +1;
20 }
21 }
22 }
A loop computes the thread local result for the reference thread_result[tid], where
tid is the unique thread number. Subsequently, the barrier construct defines a meet-
ing point in the parallel region that all threads have to reach before the execution
of any thread may continue. This is necessary to ensure that all threads have fin-
ished their work before the reduction operation summarizes the results of each
thread. After all threads have encountered the barrier, the master thread enters the
a
corresponding construct and executes the reduction.
atomic Construct
The atomic construct can as well as the critical construct be used for avoiding a
race condition. The difference between these constructs is that the atomic con-
struct is only defined for the consecutive assignment instead for a whole subse-
quence of statements.
OpenMP 3.1 Citation 27 (pages 73-77). "The atomic construct ensures that a
specific storage location is accessed atomically, rather than exposing it to the pos-
sibility of multiple, simultaneous reading and writing threads that may result in
indeterminate values.
#pragma omp a t o m i c [ r e a d | w r i t e | u p d a t e | c a p t u r e ]
new-line
expression-stmt
or
#pragma omp a t o m i c c a p t u r e new-line
structured-block
1.3 OpenMP Standard 3.1 47
[. . .]
and where structured-block is a structured block with one of the following forms:
[. . .]
The atomic construct with the update clause forces an atomic update of the loca-
tion designated by x using the designated operator or intrinsic. Note that when
no clause is present, the semantics are equivalent to atomic update. Only the read
and write of the location designated by x are performed mutually atomically. The
evaluation of expr or expr_list need not be atomic with respect to the read or write
of the location designated by x. No task scheduling points are allowed between
the read and the write of the location designated by x."
We only consider the atomic construct with no additional clauses. This is ac-
tually the definition of the atomic construct until OpenMP 3.0. The additional
clauses were first defined with publication of version 3.1 and will be neglected in
this work.
Example 12. Similar as in Example 10, the current code contains an assignment
where the memory location thread_result[0] is consecutively assigned to by all
threads.
1 #pragma omp p a r a l l e l
2 {
3 ...
4 w h i l e ( i ≤ ub ) {
5 j←0;
6 y← 0 . ;
7 w h i l e ( j <n ) {
8 y+←Ai∗n+ j ∗ x j ;
9 j ← j +1;
10 }
11 #pragma omp a t o m i c
12 t h r e a d _ r e s u l t 0+←y ;
13 i ← i +1;
14 }
15 }
48 1 Motivation and Introduction
The atomic construct synchronizes this race condition and the critical reference
a
used in a nondeterministic way.
The first pragma introduced here is the threadprivate directive. This directive
defines static data to be replicated for each thread.
OpenMP 3.1 Citation 29 (p. 88). "The threadprivate directive specifies that vari-
ables are replicated, with each thread having its own copy. The syntax of the
threadprivate directive is as follows:
#pragma omp t h r e a d p r i v a t e ( list ) new-line
• The number of threads used to execute both parallel regions is the same.
private clause
To declare thread local data one can use the private clause.
OpenMP 3.1 Citation 30 (p. 96). "The private clause declares one or more list
items to be private to a task. The syntax of the private clause is as follows:
p r i v a t e ( list )
50 1 Motivation and Introduction
Each task that references a list item that appears in a private clause in any state-
ment in the construct receives a new list item whose language-specific attributes
are derived from the original list item. Inside the construct, all references to the
original list item are replaced by references to the new list item. In the rest of the
region, it is unspecified whether references are to the new list item or the original
list item. Therefore, if an attempt is made to reference the original item, its value
after the region is also unspecified. If a task does not reference a list item that
appears in a private clause, it is unspecified whether that task receives a new list
item.
The value and/or allocation status of the original list item will change only:
firstprivate clause
To declare thread local data that is initialized with the value of the global instance
one can use the firstprivate clause.
OpenMP 3.1 Citation 31 (p. 98). "The firstprivate clause declares one or more
list items to be private to a task, and initializes each of them with the value that the
corresponding original item has when the construct is encountered. The syntax of
the firstprivate clause is as follows:
f i r s t p r i v a t e ( list )
clause on a worksharing construct, the initial value of the new list item for each
implicit task of the threads that execute the worksharing construct is the value of
the original list item that exists in the implicit task immediately prior to the point
in time that the worksharing construct is encountered.
To avoid race conditions, concurrent updates of the original list item must be
synchronized with the read of the original list item that occurs as a result of the
firstprivate clause. If a list item appears in both firstprivate and lastprivate clauses,
the update required for lastprivate occurs after all the initializations for firstpri-
vate."
lastprivate clause
OpenMP 3.1 Citation 32 (p. 101). "The lastprivate clause declares one or more
list items to be private to an implicit task, and causes the corresponding original
list item to be updated after the end of the region.
The syntax of the lastprivate clause is as follows:
l a s t p r i v a t e ( list )
reduction clause
OpenMP 3.1 Citation 33 (p. 103). "The reduction clause specifies an operator
and one or more list items. For each list item, a private copy is created in each
implicit task, and is initialized appropriately for the operator. After the end of the
region, the original list item is updated with the values of the private copies using
the specified operator.
The syntax of the clause reduction is:
r e d u c t i o n ( operator:list )
The following table lists the operators that are valid and their initialization values.
The actual initialization value depends on the data type of the reduction list item.
Operator Initialization value
+ 0
* 1
- 0
.. ..
. .
[. . .] A private copy of each list item is created, one for each implicit task, as
if the private clause had been used. The private copy is then initialized to the
initialization value for the operator, as specified above. At the end of the region
for which the reduction clause was specified, the original list item is updated by
combining its original value with the final value of each of the private copies, using
the operator specified. (The partial results of a subtraction reduction are added to
form the final value)."
This was a brief introduction into the OpenMP 3.1 standard and the original
document with several hundred pages contains much more information. However,
we cited these parts from the original document that are of importance for this
work. The next section describes related work.
product as we defined it in (1.10) but instead the derivative code computes a Jaco-
bian matrix product.
The forward mode in non vector mode augments each assignment in the orig-
inal code by another assignment that computes one directional derivative value
that corresponds to the original assignment. The vector mode on the other side
augments the original assignment not by an assignment but by a loop that com-
putes the whole gradient of the original assignment. We do not cover the vector
mode in this work since it is connected with the consumption of a big amount of
memory, especially in the adjoint case. However, since the computation of the
individual gradient components are independent of one another the corresponding
loop is parallelizable. The authors of [10] suggest to use the OpenMP loop con-
struct for declaring the loops of the tangent-linear vector mode as parallelizable.
To avoid side effects during the execution of the statements coming from the orig-
inal code Q, they recommend to use the master construct. This implies the need
for synchronization with the barrier directive.
To prevent the synchronization overhead of the previous approach, [11] de-
scribes an extension to get better scaling properties. This paper suggests to avoid
the master construct to compute the original statements by the master, but instead
to compute the original statements redundantly by the whole team of threads. Ob-
viously, this requires the usage of additional thread local memory such that the
different computations do not interfere with each other. The synchronization is
prevented but paid with the price of redundant computations and an higher mem-
ory usage. Another extension, recommended by [11], is to use preprocessed loop
bounding scheduling. This is possible here as all loops corresponding to a gradient
computation have the exact same number of iterations.
In case that the original code Q already contains OpenMP directives, the above
approach can still be applied by using nested parallelism. This is discussed in [12]
and leads to two levels of parallel execution. Each thread attending the parallel ex-
ecution on the first level creates another team of threads, each time it encounters an
OpenMP #pragma parallel directive on the second level of the parallel execution.
The implicit barrier at the end of each parallel region suggests that this approach
causes a lot of synchronization overhead on the second level. However, since the
authors only present a potential speedup formula instead of runtime results, we can
only speculate that the overhead of this approach is a major factor.
The papers mentioned above mainly focus on the tangent-linear vector mode.
The main advantage of the vector mode is that it provides the whole Jacobian
matrix after one evaluation of the derivative code. Due to the high amount of
memory necessary for the execution, we do not use the vector mode of AD. This
fact immediately prohibits the above approaches of computing the derivatives with
54 1 Motivation and Introduction
an OpenMP loop.
Our work differs from the approaches above in a way that we do not assume to
have a derivative code given that should be tuned by using OpenMP. Instead, we
assume that an original code Q is given that contains an OpenMP parallel region
P. We examine how to achieve the generation of the tangent-linear code and the
adjoint code of P while assuming that the derivative code of Q without the paral-
lel region P is known. Therefore, we focus of transforming P correctly into the
corresponding derivative code depending which AD mode we apply.
As mentioned in the AD introduction, besides the source transformation method
there is also the overloading method to achieve derivative values. In case that the
overloading method should be applied to a code containing a parallel region P then
P can be adjusted such that certain information is provided during runtime about
the parallel execution. This approach was introduced in [6, 48]. The adjustment of
P is that a firstprivate clause is used to inform the AD overloading tool ADOL-C13
at runtime that a parallel execution is underway. Since the overloading method is
a runtime solution, the OpenMP pragma information cannot be exploited without
any adjustments of the pragmas in the original code.
Applications of AD by source transformation applied to Fortran code are de-
scribed in [34, 35, 27, 26, 39, 28]. The source transformation in [27] uses OpenMP
1.1. Nevertheless, it is not mentioned what OpenMP constructs are known to the
AD tool and [34] states that adjoint support routines were written by hand to offer
a basic support of OpenMP.
The computation of derivatives in tangent-linear and in adjoint model while us-
ing distributed memory parallel programming with MPI is examined in [71, 72,
77]. In order to prove the correctness of MPI adjoint code, [58] proposes a frame-
work where the different possible interleavings of a parallel execution are consid-
ered. This work was taken as a starting point for studies how to prove the correct-
ness of the source transformation results. It turns out that the distributed memory
model has a fundamental advantage compared to the shared memory model. When
considering all the possible interleavings of a possible parallel execution, the usage
of the distributed memory model reduces the number of possibilities considerably.
This originates from the fact that each process only sees its own local memory.
Hence, a combination of statements that are from two different processes can be
executed in arbitrary order. The result is the same since the store operations to the
distributed memory locations cannot interfere each other.
This dissertation is about shared memory parallel programming and therefore
each and every possible interleaving must be examined to reveal whether there is a
13 https://projects.coin-or.org/ADOL-C
1.5 Contributions 55
possible problem in the parallel execution or not. Hence, we will keep the number
of possibilities small by reducing the set of language statements to a subset that
covers most of the occurring numerical kernels.
1.5 Contributions
This dissertation makes the following contributions:
1. A methodology to show the correctness of a parallel execution. The for-
malism of this methodology is based on the approach of using interleavings as a
mathematical abstraction for a parallel execution. We present a formalism that is
applicable to a parallel region with a sequence of statements. This parallel region
is assumed to be executed by a team of threads. We purposely keep this formalism
on an abstract level in order to generally enable the possibility to apply this for-
malism to different parallel programming models. Our formalism is sophisticated
enough to cover the effects of specific constructs of a parallel programming API
such as synchronization. Despite the fact that we only consider shared memory
parallel programming, our formalism can also be used for a distributed memory
(MPI) or for hybrid machine architectures (GPU, FPGA, Intel Phi).
3. A formal proof that our AD source transformation has the closure prop-
erty. We assume the code that serves as input for our source transformation is cor-
rect in the sense that it can be executed concurrently without any race conditions
or side effects. The closure property is fulfilled if we apply the source transforma-
tion to a given input code P and the result is a code that is contained in SPL and
in addition it must be ensured that the parallel execution of this code is correct, for
example, without any race conditions. In case of the tangent-linear source trans-
formation we will see that this property is fulfilled without any restrictions to the
input code except that correctness must be given. The adjoint source transforma-
tion on the other hand fulfills the closure property only for an input code which
fulfills the so called exclusive read property. This property is fulfilled by a parallel
region P if it is ensured that during the execution of P each memory location that
is used on a right-hand side of a floating-point assignment is used by one thread
exclusively and not by multiple threads concurrently.
The formal proofs which show the above properties differentiate between possible
combinations of statements. In case of the tangent-linear code we have to consider
10 cases, the proof of the adjoint code contains 36 different cases.
4. A static program analysis for recognizing the exclusive read property of
a given SPL code. If P does not fulfill the exclusive read property, we have to
use synchronization methods for the reverse section of the adjoint code in order to
allow a correct concurrent execution. One could be conservative by defining that
the adjoint source transformation introduces for each adjoint assignment a syn-
chronization mechanism. This provides a correct adjoint code but from a certain
number of threads on, the execution will be sequential since all threads are busy
with waiting on each other. This conservatism is expected to be very expensive
in terms of runtime performance. Therefore, we present a static program analysis
that provides the information whether or not synchronization is necessary for a
certain adjoint assignment. The static program analysis that we introduce is called
the exclusive read analysis (ERA).
We present experimental results where we compare the runtime results of an ad-
joint code obtained without using the ERA with the results of an adjoint code
created using the ERA. These results show clearly that a static analysis of the orig-
inal code is crucial. The adjoint code generated with help of information from
the ERA is in average three times faster than the conservative adjoint code. The
runtime improvements lie between a factor of 1.5 and 7.74 what means that the
least improvement is 50% and the biggest provides a code that is almost eight
times faster than the adjoint code obtained without using ERA. This improvement
can also expressed in million floating-point operations per second (MFLOPS). For
1.5 Contributions 57
example, the execution of the adjoint code with 32 threads without using ERA
achieves 4622 MFLOPS, the execution of the code with using ERA reaches 16378
MFLOPS which is an improvement factor of 3.5.
5. Implementation of the defined source transformation rules in a tool called
SPLc. The source transformation rules from Chapter 2 and Section 4.2 have been
implemented in SPLc (see Appendix A). This means that SPLc implements the
source transformations that are necessary to fulfill the closure property. This al-
lows to generate arbitrary higher derivative codes. SPLc allows a source transfor-
mation interactively on the console or the user can provide a filename that contains
SPL code.
To give the reader an impression of the codes sizes that SPLc generates, we take an
OpenMP code, called ‘plain-parallel’, as example. This example code is contained
together with the derivative codes in Appendix B.1. The source has 33 lines of
code and is referred to as original code. The corresponding first-order tangent-
linear code has 77 lines, the first-order adjoint code has 148 lines of code. The
reapplication of SPLc to the first derivative code provides the second derivative
code. The second-order tangent-linear code has 291 lines of code, the second-
order adjoint code in forward-over-reverse mode has 395 lines of code, and the
second-order adjoint code in reverse-over-forward has 507 lines of code.
6. Extensive tests for showing scalability and the successful application.. We
developed a test suite to show the scalability of this work. This test suite con-
sists of seven different OpenMP codes starting from a code with only a parallel
region, then codes with synchronization constructs of OpenMP and finally a test
case where we examine the properties of the second-order derivative code. In each
test case we show the results in terms of runtime, MFLOPS, and the stack sizes
used during the adjoint computation. We use two compilers, the Intel compiler
and the g++ compiler to produce a binary executable. In addition, we use first the
compiler without any code optimization. Afterwards, we show the impact when
using the code optimization of the Intel and the g++ compiler. We show that these
compilers supply very different results in terms of speedup and MFLOPS.
Another result is that the adjoint code is not well suited for the typical code op-
timization provided by the Intel and the g++ compiler. The experimental results
of some adjoint codes reveal that the speedup value is higher than the one from
the original code. This fact shows that it is substantial to use concurrency for
computing adjoint values.
To show the application of our approach we implemented the least-squares prob-
lem and the constrained optimization problem which we introduced in this chapter.
58 1 Motivation and Introduction
The corresponding first and second derivative codes are provided by SPLc.
where the appearance of the code inside of the parallel region is irrelevant at this
point. The term ’pure’ means in this context that there are no additional pragmas
inside the parallel region. However, we try to stay on a level that allows to apply
the results of this chapter to a non-OpenMP parallel region.
s4 s5 s6 . . . sq (Statements of thread 0)
s2 s3 s4 . . . sq (Statements of thread 2)
Figure 2.1: This figure shows one possible status of the program execution after processing
six statements with three threads. Each thread has q statements (s1 ; . . . ; sq ) at the beginning
of the execution. Then the scheduler chooses some statements from the three threads and
puts them to the interleaving. After the scheduler has taken six statements, the status of
the execution can appear as in this figure. At this point in the execution, the first three
statements of thread 0 have already been executed. The next time the scheduler decides
to choose thread 0 to be the execution thread, statement s04 is put into the interleaving.
Thread 1 is waiting for statement s13 being put into the interleaving. For thread 2, this
holds analogously for statement s22 . Because all statements have a fixed order inside their
respective executing thread, statement sti+1 cannot be executed before statement sti . In the
figure we see that statement s02 occurs before statement s03 .
In OpenMP one can define certain assignments to be atomic by using the di-
rective atomic. We do not mean this atomicity when we write about an atomic
statement. Nevertheless, the atomic construct can be associated with a statement s
such that s is executed atomically as we will see in Chapter 4. This is very helpful
in case that statement s contains a critical reference. We cite the definition of a
critical reference that Ben Ari presents in his textbook.
64 2 Transformation of Pure Parallel Regions
The author Ben-Ari presents an example as justification for the abstraction with
interleavings of atomic statements, see [5, p. 16]. The example describes the fol-
lowing situation where two threads want to write to the memory location on a mul-
tiprocessor machine. One thread wants to write the value 1 and the other thread
wants to write the value 2. Thus, both threads write different bit values. The ques-
tion is what the result of the store operation is. The value might be undefined when
both bit values are combined per logical OR operation. In this case we would have
the value 3 inside the memory location. This would reveal the abstraction as un-
qualified, but in practice this case does not occur because the memory hardware
writes values not bitwise but cell wise. A cell is, for example, 32 bit on a 80386
processor or 64 bit on modern processors. The bits that the threads want to write
are contained in the same cell. Therefore, the memory hardware completes one
access before the other takes effect. In this case the memory hardware performs
the atomicity and interleaving.
For our approach it is important to understand that the atomic statements of each
individual thread are interleaved arbitrarily. For example, let us consider a parallel
region P that consists of the sequence of statements S = (s1 ; . . . ; sq ). The atomic
statements considered during the execution are in case of thread 0 (s01 ; . . . ; s0q ), for
thread 1 they are (s11 ; . . . ; s1q ), and so on. The permutations where the order does
not reflect the order of the statements inside of a particular thread are not valid.
For example, an interleaving beginning with s11 s02 s12 s01 s21 is not possible, since the
second statement of thread 0 is executed before the first statement s01 of thread 0
appears in the interleaving.
The set of variables used in a parallel region P is denoted by VARP , the set of
variables occurring in statement s is denoted by VARs . Most of the time we will
discuss about comparing references instead of comparing variables. Therefore, we
2.1 Formalism and Notation 65
need a mapping from the set of variables to the memory address space. This is
done by the operator &:
The following example uses a statement where multiple threads increment a shared
variable to clarify the term of a critical reference and the LCR property.
Example 13. Definition 34 can be explained by a source statement s represented
by
s= n ← n+1 ,
where n is a shared variable. This example is similar to the example in [5, p. 27].
Depending on the set of statements contained in the assembly language of a given
architecture, this statement can be compiled into an atomic increment instruction
or into a sequence of assembly statements whose semantic are the same as the
source statement. What one can expect when considering Figure 2.1 is that the
number of interleavings is different when we consider, on the one hand, a program
consisting of one atomic increment instruction, or on the other hand, a program
that consists of a sequence of assembly statements. The behavior can be different
opposed to the case that the sequence of assembly statements would be evaluated
atomically. Nevertheless, the abstraction of atomic source statements can be used
when each statement fulfills the LCR property. To clarify this, we split the state-
ment s into1
s1 = a←n
s2 = n ← a+1 ,
st = nt ← nt + 1
where we use a superscript index to indicate the thread that accesses the variable n.
0
The memory locations of n concerning thread t is &nt , thread t 0 uses &nt . Since n
0
is a shared variable, the reference of both threads is the same, namely &nt = &nt .
Now, we consider the code where we split s into s1 and s2 . This code is seman-
tically equivalent to s but differs according to the number of critical references.
0 0 0
st1 = at ← nt st1 = at ← nt
0 0 0
st2 = nt ← at + 1 st2 = nt ← at + 1
Just to be clear, the LCR property does not ensure the correct evaluation of
the sequence, it only ensures that this abstraction shows the same behavior as an
architecture with an atomic load and store. If we abstract s1 ; s2 by interleavings,
then we get different results for n due to the fact that n is critical. The following
table shows that only two of the possible six interleavings result in the correct
result of two increment operations which results in being n + 2. The remaining
four interleavings all end up with setting n to a value of n + 1. The column titled
with a0 shows the value of variable a that belongs to thread 0, whereby a1 belongs
to thread 1.
2.1 Formalism and Notation 67
s01 ≺ s08
68 2 Transformation of Pure Parallel Regions
s01 ≺ s18
and
s08 ≺ s11
I (P, q, p)
n o
t∈{0,...,p−1}
I (P, q, p) = I | I = (sti )i∈1,...,q , with sti ≺ stj
(q · p)!
|I (P, q, p)| :=
(q!) p
2.1 Formalism and Notation 69
Proof. We show this by induction over q, the number of straight-line code state-
ments. Assume we only have one statement in the parallel region. The parallel
region is executed by p threads resulting in p statements inside of a possible inter-
leaving. The p statements can be scheduled in any order. Therefore, we have p!
possible interleavings.
Assuming that the number of possible interleavings for (q − 1) statements in a
parallel region executed by p threads is
((q − 1) · p)!
, (2.1)
((q − 1)!) p
then we have to show that the set of possible interleavings is
(q · p)!
(q!) p
when the number of statements in the parallel region is q. A parallel region with q
statements executed by p threads results in q · p statements. We can illustrate this
by an urn problem with qp balls where q balls have one of p colors. Therefore,
we have qp possibilities to choose the first statement in the interleaving and qp − 1
possibilities to choose the second statement, and so forth. For the p-th statement
we have qp − p + 1 possibilities. We concern about the statements coming from
a group of p threads or as shown with the urn problem; the fact that we have p
different colors must be taken into account. Hence, the number of possibilities to
choose the first p statements is
qp · (qp − 1) · . . . · (qp − p + 1)
.
q·q·...·q
| {z }
p-times
To get familiar with the above formalism, we present in the following several
examples where we apply our notation to some OpenMP code examples. These ex-
amples contain different OpenMP constructs as the barrier or the master construct.
70 2 Transformation of Pure Parallel Regions
However, these examples serve only as proof of concept and from the next section
on, we consider parallel regions such that these regions only consist of code and
not further pragmas. The barrier and master constructs are topic of Chapter 4.
The OpenMP standard provides a loop construct for sharing loop iterations
among a group of threads, see Section 1.3. To see the difference between a regu-
lar loop statement and a worksharing loop, let us consider the following example
where a loop consist of three iterations.
s2 s2 s2 s3 (statements of thread 0)
s01 s11
s2 s2 s2 s3 (statements of thread 1)
Above we see that thread 0 and thread 1 have three instances of statement s2
waiting for the scheduler to be put into execution. The first statement of each
thread (s01 and s11 ) is already executed at the current point in time. Next, we show
the same code structure but with a preceding worksharing construct.
omp_set_num_threads ( 2 ) ;
#pragma omp p a r a l l e l
{
s1
#pragma omp f o r
f o r ( i ← 0 ; i <3; i ← i +1)
s2
s3
}
2.1 Formalism and Notation 71
s2 s2 s3 (statements of thread 0)
s01 s11
s2 s3 (statements of thread 1)
The worksharing construct instructs the runtime environment to distribute all the
loop iterations among the two threads. Therefore, the three iterations are dis-
tributed to the two threads in chunks of a certain size. This size can be influenced
per schedule clause, but we do not consider this possibility here. Without the
schedule clause, the standard procedure is static which means that the chunks are
approximately equal in size. For our example the chunks are such that threada0
gets two instances of s2 and thread 1 gets the remaining third instance.
are not possible because s1 is executed before the other thread has finished s0 and
this is contradiction to the definition of a barrier which ensures that all threads
have finished their statements preceding the barrier. Formally, we express this
by writing s11 ≺ s00 for the situation in (2.2) and analogously s01 ≺ s10 for (2.3).
This shows that synchronization reduces the number of possible interleavings. a In
practice this fact may lead to a lower performance at runtime.
Example 16 (Master Synchronizing). In Section 1.3 we saw that one can define
a region that is only executed by the master thread. This is the thread with the
identification number 0 in the current group of threads. Consider the following
code as an example where we again have two straight-line code statements.
...
s0
#pragma omp m a s t e r
{
s1
}
...
As mentioned in OpenMP 3.1 [63, p. 68], the master construct does not have an
implicit barrier. This is reflected in (2.4) where thread 1 executes s0 after the
master thread finishes s1 . This is formally expressed by s01 ≺ s10 and would not be
possible in case that an implicit barrier would conclude the block connected to the
a
master construct.
2.1 Formalism and Notation 73
For example, let us consider two instances of a statement s that are executed by
thread t and thread t 0 . To express that the left-hand side reference in thread t does
not occur on the right-hand side of t 0 , we note
0
LHSREF(st ) ∩ RHSREF st = 0/ .
The set of integer variables of P is denoted by INTP , the set of floating-point vari-
ables FLOATP . A comma separated list of variables is denoted by list while v ∈ list
means that v occurs in the comma separated list despite the fact that this is actu-
ally not a set of variables. For example, the notion FLOATlist stands for the set of
floating-point variables occurring in the comma separated list.
A parallel region P that serves as input for a source transformation is assumed to
be correct in the sense that for the same given input values all possible interleavings
result in the same output values. This behavior must be proved for P0 that is the
result of the source transformation of P. We will prove this by showing the absence
of any data dependencies inside of the set of possible interleavings I (P0 , q, p)
while assuming that P has no data dependencies. In other words, we show the
correctness of code P0 by assuming that the parallel region P is correct. More
formally, this means we choose an arbitrary interleaving I ∈ I (P0 , q, p). Each
0
consecutive pair of statements (sti , stj ) ∈ I with t 6= t 0 and i, j ∈ {1, . . . , q} must
be considered whether there is a data dependence or not. Since we know that
P does not have any data dependencies we only have to show that our source
transformation does not introduce a new data dependence. But we will see that the
adjoint source transformation in certain cases introduces a new data dependence
that leads to a race condition at runtime. This race condition must be resolved
appropriately as shown later in this work.
In this context it must be guaranteed that assignments as well as evaluations of
conditions in control statements do not contain any critical references. Otherwise,
a race condition is given at runtime which leads to a nondeterministic result. We
want to define the absence of a critical reference as noncritical. According to
Definition 34 we define a critical parallel region as follows.
0
P is critical if there exists a pair (sti , stj ) of consecutive statements with t 6= t 0 that
is part of an interleaving I ∈ I (P, q, p) and for this consecutive pair of statements
in I holds
0
&v ∈ LHSREF(sti ) ∧ &v ∈ REF stj . (2.5)
The condition (2.5) can be expressed in words by saying that statement sti per-
forms a store operation to a memory location that is simultaneously used by thread
0
t 0 in stj . Please note that neither the term critical parallel region nor critical ref-
erence &v imply that the concurrent execution is wrong. A nondeterministic be-
havior only occurs when there is the situation as described in Definition 37 but
0
besides the interleaving I ∈ I (P, q, p) with (sti , stj ) ∈ I there exists another inter-
0
leaving I 0 ∈ I (P, q, p) with (stj , sti ) ∈ I 0 .
Obviously, we do not want to achieve a critical parallel region P0 when trans-
forming the source code of a noncritical parallel region P. P is called noncritical if
0
for all consecutive pairs of statements (sti , stj ), executed by the two different threads
t and t 0 , and contained in an arbitrary interleaving I out of the set of possible sets
I (P, q, p), the following holds. An occurrence of reference &v on the left-hand
0
side of sti implies that this reference may not be contained in stj .
0
∀I ∈ I (P, q, p) ∀(sti ; stj ) ∈ I, t 6= t 0 :
0
&v ∈ LHSREF(sti ) =⇒ &v ∈
/ REF stj , (2.6)
When we negate the condition we can rewrite this into a formula as an implication:
0
¬ &v ∈ LHSREF(sti ) ∧ &v ∈ REF stj
0
⇔ ¬ &v ∈ LHSREF(sti ) ∨ ¬ &v ∈ REF stj
0
⇔ ¬ &v ∈ LHSREF(sti ) ∨ &v ∈ / REF stj
0
⇔ &v ∈ LHSREF(sti ) ⇒ &v ∈ / REF stj .
Example 18. We show that the code in Example 4 on page 35 is noncritical. Ac-
0
cording to Lemma 38, we have to consider all possible pairs (sti , stj ) of consecutive
statements in all possible interleavings I ∈ I (P, q, p). Since the only shared vari-
ables are the pointers x and y, we only consider the two floating-point assignments
in line 17 and line 18:
17 y i ←2∗ x i ∗ x i ;
18 xi← 0 . ;
We label these two statements with the corresponding line number in the code
sample. Therefore, we consider pairs of possible combinations of s17 and s18 . An
important fact is that i has an unique value range due to the data decomposition
that takes place before the loop.
0
1. Case (st17 , st18 ):
0
Since st18 has a constant on its right-hand side there is no confict with shared
memory locations.
0
2. Case (st18 , st17 ):
Since i has different values in thread t and in thread t 0 we conclude that
0 0
&xti 6= &yti and &xti 6= &xti holds. Please note that this even holds for x and
y pointing to the same address.
0
3. Case (st17 , st17 ):
0 0
&yti ∈
/ {&yti , &xti } is achieved when reasoning analogous to case 2.
78 2 Transformation of Pure Parallel Regions
0
4. Case (st18 , st18 ):
0
Analogously, we obtain &xti 6= &xti because i has different values in thread
t and in t 0 .
Since not one of the above combinations does lead to a critical reference, we con-
a
clude that the parallel region in Example 4 is noncritical.
Example 19. We show that the code in Example 5 is critical. As in the previous
example, we consider only the floating-point statements
5 y i ←2∗ x i ∗ x i ;
6 xi← 0 . ;
since these statements contain the shared references x and y. In constrast to Ex-
ample 4, i is not unique since there is no data decomposition preceding the loop.
Instead, i has the same value range in all threads. This indicates that both state-
ments are critical what can be shown by considering the combinations where two
different threads execute the statements simultaneously. In our context, we show
this with help of the possible interleavings I ∈ I (P, q, p).
0
1. Case (st5 , st6 ):
0
st5 has the left-hand side reference &yti . This reference is only used in st6 if
0
the pointer x and y point to the same memory location (&yti = &xti ). Is this
0
the case then &xti is a critical reference.
0
2. Case (st6 , st5 ):
0
The left-hand side of st6 , namely &xti , is contained in st5 since thread t and t 0
uses the same offset value i. Therefore, &xti is a critical reference.
0
3. Case (st5 , st5 ):
0
&yti is critical since it occurs on the left-hand side in st5 and also in st5 .
0
4. Case (st6 , st6 ):
Even this situtation reveals a critical reference &xti but there is no wrong
effect visible since both threads assign the constant value zero to the same
memory location.
a
This shows that the parallel region in Example 5 is critical.
2.2 SPL - A Simple Language for Parallel Regions 79
1. α : SPL → SPL holds such that the code α(P) is again in SPL, and
In the following will introduce the SPL language and its syntax. Subsequently,
Section 2.3 presents the source transformations that provide the tangent-linear, and
the adjoint model of a given parallel region P written in SPL code.
SPL provides support of stack operations, where the stack is a data structure
implemented as usual with a last in, first out data scheme (LIFO). The common
methods push(), pop(), and top() can be used to insert, remove or access the data
on the top of the stack. SPL is an extension of SL [57, ch. 4] which is the input
and output language of dcc, our derivative code compiler. One extension is that
statements associated with a stack operation are represented by an abstract state-
ment instead of using a static C/C++ array as stack implementation. This allows
later a customized implementation of these stacks. The second extension is that
we allow assignments shaped like y +← e instead of y ← y + e, where y is a scalar
variable and e is an expression. This extension becomes clear as soon as we know
that we need the atomic construct to achieve the closure property of the adjoint
source transformation of a general OpenMP parallel region. Until OpenMP 3.0,
the atomic construct was only defined for assignments shaped as y +← e and we
do not want to exclude potential codes that are written in an OpenMP version lower
than 3.1.
Table 2.1 gives an overview of the set of terminal symbols, contained in SPL.
Please note that we have to code variables, as for example x(1) or x(1) , in some
way that they are representable only with ASCII characters. Our AD tool does
this by defining a prefix ’t1_’ or ’a1_’. This means a variable called a1_x is
2.2 SPL - A Simple Language for Parallel Regions 81
| if (b) { S }
| while (b) { S }
}
The following example shows the code from Example 4 rewritten into SPL code.
Example 20. In Example 4 we saw a typical pattern in parallel programming
where we partition a data set into chunks and let each thread process a certain
chunk. This code is shown here expressed as SPL code.
1 #pragma omp p a r a l l e l
2 {
3 int i ← 0;
4 int tid ← 0;
5 int lb ← 0;
6 i n t ub ← 0 ;
7 i n t chunk_size ← 0;
8 t i d ← omp_get_thread_num ( ) ;
9 p ← omp_get_num_threads ( ) ;
10 c h u n k _ s i z e ← n/p ;
11 i ← c h u n k _ s i z e ∗p ;
12 i f ( i 6= n )
13 { c h u n k _ s i z e ← c h u n k _ s i z e +1; }
14 l b ← t i d ∗ chunk_size ;
15 ub ← ( t i d +1)∗ c h u n k _ s i z e −1;
16 i f ( ub ≥ n )
17 { ub ← n −1; }
18 i ← lb ;
19 w h i l e ( i ≤ ub ) {
20 y i ← 2∗ x i ∗ x i ;
21 xi ← 0 ;
22 i ← i +1;
23 }
24 }
The difference between this code and the one from Example 4 is quite small. In-
stead of the counting loop the current implementation uses a while-loop for pro-
a
cessing the array.
84 2 Transformation of Pure Parallel Regions
Since we will very often speak about instances of assignments that are part of
a concurrent execution, we display an example to present the notation. Let us
consider two assignments that are executed by thread t and thread t 0 . We note this
by writing
sti = yti ← φ (xti,1 , . . . , xti,n )
and
0 0 0 0
stj = ytj ← φ (xtj,1 , . . . , xtj,n ),
which means that both right-hand sides consist of an intrinsic function φ with n
arguments. The intrinsic functions φ represents some intrinsic function not neces-
sarily the same. The evaluated value of φ is assigned to the left-hand side reference
y. The identifier x and y should only be considered as symbol names for declaring
on which side of an assignment these symbols occur. The yti does not need to be
0
connected with ytj just because they have the same name. These kind of relation is
expressed by an equation that defines the memory locations of two references as
equal. For example, to express that the left-hand side reference of sti also occurs
0
on the right-hand side of stj , we write
0
&yti = &xtj,k , k ∈ {1, . . . , n}.
Each read access of a memory reference is assumed to be defined in the sense that
each memory reference is initialized by an assignment before the first read access
happens.
We introduced the mappings of a statement to its left-hand side, right-hand side,
and conditional references in the previous section. Now, after defining SPL, we
show the actual definitions of these mappings. Let us consider an instance st of
statement s ∈ STMTS that is executed by thread t. The set that contains the left-
hand side reference is defined as
(
{ &yt } if s = y ← φ (x1 , . . . , xn )
LHSREF(st ) :=
0/ otherwise
where h∇φ (x), x(1) i is the inner product ∇φ (x) · x(1) , and x, x(1) ∈ Rn .
In [29] as well as in [57] the authors assume that the numerical program is given
by a single assignment code (SAC) and they show the corresponding tangent-linear
model [57, ch. 2]. In our case, we have Similarly, we have to show that the result
of the transformation of τ (P) implements the tangent-linear model of F.
The following proposition shows that the source transformation τ (P) provides
the tangent-linear model of P.
Proposition 41. Given a parallel region code P ∈ SPL that implements a math-
ematical function F : Rn → Rm , then the source transformation τ (P) implements
F (1) .
Proof. We have to show that τ (P) implements the tangent-linear model F (1) of F
with
F (1) (x, x(1) ) ≡ ∇F(x) · x(1) ∈ Rm .
This means the sequence of r statements provides the m output values of F(x).
Since the interleaving does not only contain assignments but also statements where
a test expression is evaluated, we define r0 to be the number of assignments con-
tained in the interleaving I where r0 ≤ r. In general there are more than m as-
signments involved in the computation of the m output values, therefore it holds
r0 ≥ m.
We conclude that there is a sequence of r0 floating-point assignments that de-
termines the m output values F(x). Suppose that the following sequence of as-
signments y1,...,r0 ∈ SLC reflects the computation of the outputs. The association
of what variable y j with j ∈ {1, . . . , r0 } defines which output value is determined
by the implementation. The computation expressed in r0 assignments can be ex-
2.3 AD Source Transformation of SPL Code 89
pressed as a SAC:
y1 ← φ 1 (x1 , . . . , xn )
y2 ← φ 2 (x1 , . . . , xn )
..
.
0
yr0 ← φ r (x1 , . . . , xn ),
0
The φ {1,...,r } are intrinsic scalar functions.
In the following we express the value propagation from the input values to the
output values by a program state transformation. Each program state consists of
n + r0 states where the first n states represent the n input values. The remaining
r0 states represent the results of the r0 assignments y1,2,...,r0 . The first program
state is the one where only the first n input values are set. The program state
0 0
transformation is performed by Φi : Rn+r → Rn+r with
(
i φ i (x1 , x2 , . . . , xn ) , if i = j
(Φ (z)) j := , (2.28)
zj , otherwise
0
where j = 1, . . . , n + r0 , z = (x1 , . . . , xn , y1 , . . . , yr0 ) ∈ Rn+r . The function Φi takes
the current program state zi−1 and provides the new program state zi where the
state only differs in the i-th component. Hence, the computation displayed as a
sequence of state transformations is
z1 ← Φ1 (z0 )
z2 ← Φ2 (z1 )
..
.
0
zr0 ← Φr (zr0 −1 ).
The state z0 is the initial state where only the input values are set. The final state zr0
reflects all the output values after executing the whole computation. Therefore, all
the output values of F(x) are contained in the vector zr0 what we denote by F(x) ⊂
zr0 . All the intermediate states of the computation are indicated by the sequence
z1 , z2 , . . . , zr0 −1 . When Φi ◦ Φi−1 displays the consecutive application of Φi after
0
Φi−1 we can write the computation as consecutive application of Φr , . . . , Φ1 and
90 2 Transformation of Pure Parallel Regions
where i ∈ {1, . . . , q},t ∈ {0, . . . , p − 1}, that represents the same order of computa-
tions of y1,...,r0 as in the interleaving I. The number of statements q0 in the tangent-
linear model is obviously bigger than the number of statements q in the parallel
region P. By applying the source transformation rule (2.17) to I 0 , we obtain
(1) T
(1) (1)
y1 ← ∇φ 1 (x1 , . . . , xn ), x1 , . . . , xn
y1 ← φ 1 (x1 , . . . , xn )
(1) T
(1) (1)
y2 ← ∇φ 2 (x1 , . . . , xn ), x1 , . . . , xn
y2 ← φ 2 (x1 , . . . , xn )
..
.
(1) T
0
(1) (1)
yr0 ← ∇φ r (x1 , . . . , xn ), x1 , . . . , xn
0
yr0 ← φ r (x1 , . . . , xn ) .
where (1)
x1
(1)
(1) (1)
T
x2
x1 , . . . , xn = .
..
(1)
xn
2.3 AD Source Transformation of SPL Code 91
0 0
results in a matrix A ∈ R(n+r )×(n+r ) where we only consider the lines which define
the m output values. Without loss of generality, we assume that the last m assign-
ments yr0 −m+1 , yr0 −m+2 , . . . , yr0 define the m output values. Therefore, we consider
a submatrix inside of A, namely the last m lines and the first n columns. This sub-
matrix contain exactly the components of the Jacobian matrix ∇F(x). The first n
(1) (1) (1)
components in z0 are x1 , . . . , xn . Thus, the submatrix and the first n compo-
(1) (1) (1) (1)
nents of z0 define the tangent-linear values yr0 −m+1 , yr0 −m+2 , . . . , yr0 where
(1)
yr0 −m+1
(1)
yr0 −m+2
(1)
.. ≡ ∇F(x) · x .
.
(1)
yr0
Example 21. We apply the source transformation τ to the code from Example 20.
The resulting code is the following.
1 #pragma omp p a r a l l e l
2 {
3 int i ← 0;
4 int tid ← 0;
5 int lb ← 0;
6 i n t ub ← 0 ;
7 i n t chunk_size ← 0;
8 t i d ← omp_get_thread_num ( ) ;
9 p ← omp_get_num_threads ( ) ;
10 c h u n k _ s i z e ← n/p ;
11 i ← c h u n k _ s i z e ∗p ;
12 i f ( i 6= n )
13 { c h u n k _ s i z e ← c h u n k _ s i z e +1; }
14 l b ← t i d ∗ chunk_size ;
15 ub ← ( t i d +1)∗ c h u n k _ s i z e −1;
16 i f ( ub ≥ n )
17 { ub ← n −1; }
18 i ← lb ;
19 w h i l e ( i ≤ ub ) {
(1) (1) (1)
20 y i ← 2∗ x i ∗ x i +2∗x i ∗ x i ;
21 y i ← 2∗ x i ∗ x i ;
(1)
22 xi ← 0;
2.3 AD Source Transformation of SPL Code 93
23 xi ← 0 ;
24 i ← i +1;
25 }
26 }
The derivative code looks quite similar to the input code except for the two addi-
tional assignments in line 20 and line 22. The assignment in line 20 represents the
tangent-linear model of the assignment in line 21. Analogously, the tangent-linear
a
model of the assignment in line 23 is shown in line 22.
We finish this section with a lemma that shows that the source transformation τ (P)
fulfills the interface requirements that we formulated in (2.8). This equivalence
will play an important role when we prove that the parallel region of the tangent-
linear model is noncritical.
Lemma 42. Suppose u, v ∈ FLOATP are two variables occuring in a parallel re-
gion P ∈ SPL, and τ (P) is the tangent-linear model of P. Then it holds for the
variables u, v, u(1) , v(1) ∈ τ (P):
Proof. In case that u, v are defined outside of P, the equivalence is given because
we determined the interface in such a way, see (2.8). Otherwise, the definitions of
u, v are contained in P and the association by name is achieved by the transforma-
tion rules (2.14) and (2.15). Assume that a variable v ∈ FLOATτ (P) is defined in
the sequence of definitions D contained in τ (P). This definition is unique since
in SPL we only allow definitions on the scope level of the parallel region and not
below as it is possible in C/C++. However, the definition of v inside of the tangent-
linear code is always connected with the definition of the corresponding derivative
component v(1) . Thus, the equivalence in (2.31) is true.
This section introduced the source transformation τ (P) of a given parallel region
P. τ (P) represents the tangent-linear model of P as we showed in the proof of
Proposition 41. The next section introduces the source transformation σ (P) that
provides the adjoint model of P.
94 2 Transformation of Pure Parallel Regions
(
double x ← R;
σ ( double x ← R ) := (2.37)
double x(1) ← 0.;
The forward section or augmented forward section of the adjoint code is pro-
vided by the transformation σ (S) in (2.32). As you will see, the forward section is
quite similar to the code in P, but augmented by statements for storing the control
flow of the current execution and by statements for storing values that are being
overwritten. The data structure for storing these values is the STACK(1)c in case of
the control flow, floating-point values are stored in STACK(1) f , and integer values
are put into STACK(1)i . The reversal of the control flow during the reverse section
is done by consuming all the labels contained in STACK(1)c . The floating-point as
well as the integer values that are overwritten during the forward section must be
stored in order to allow the reverse section to regain these values. In (2.32), the
reverse section is represented as
while not STACK(1)c .empty() { ρ( S ) } ,
where the loop is responsible for consuming all the labels from the control flow
stack STACK(1)c . Inside the loop, ρ( S ) emits a sequence of branching statements
in order to execute the adjoint statements that correspond to the current label.
For transforming a sequence of statements, we distinguish between the cases
where the whole sequence is SLC code or not. In case that there is a conditional
statement or a loop statement, the sequence is split up into parts:
1. If S = (s1 ; . . . ; s j ; . . . ; sq ) with
and
σ (s j+1 ; . . . ; sq )
are provided as a sequence unlike the argument of σ ( s j ) that is a state-
ment. This fact is important even when only one statement is succeeding
the control flow statement. For instance, in case of j + 1 = q, only one SLC
statement follows the control flow statement. Nevertheless, this final state-
ment is first transformed as sequence with σ ( (sq ) ) and through the recursive
definition, σ is applied to the single statement per σ ( sq ).
2. If S = (s1 ; . . . ; sq ) ∈ SLC, this means there is no control flow statement
present:
STACK(1)c .push( l );
σ (s1 );
σ (s1 ; . . . ; sq ) := . , (2.39)
..
σ (sq );
where l = LABEL(s1 ).
The individual statement transformations are:
(
STACK(1) f .push(y);
σ y ← φ (x1 , . . . , xn ) := (2.40)
y ← φ (x1 , . . . , xn );
(
STACK(1) f .push(y);
σ y +← φ (x1 , . . . , xn ) := (2.41)
y +← φ (x1 , . . . , xn );
(
STACK(1) f .push(y);
σ y ← STACK f .top() := (2.42)
y ← STACK f .top();
σ if (b) { S’ } := if (b) { σ ( S’ ) } (2.43)
σ while (b) { S’ } := while (b) { σ ( S’ ) } (2.44)
(
STACK(1)i .push(a);
σ a ← e := (2.45)
a←e
σ STACK( f |i|c) .push(v) := STACK( f |i|c) .push(v) , v ∈ N ∪ VARP (2.46)
σ STACK( f |i|c) .pop() := STACK( f |i|c) .pop() (2.47)
2.3 AD Source Transformation of SPL Code 97
Let us now switch to the transformation ρ that generates the code for computing
the adjoint values. As with the transformation σ , we have two different cases
depending on whether or not the code contains control flow statements.
1. If S = (s1 ; . . . ; s j ; . . . ; sq ), with
ρ (s 1 ; . . . ; s j−1 ) ;
ρ (s1 ; . . . ; sq ) := ρ( s j ); (2.48)
ρ (s ; . . . ; s ) ;
j+1 q
2. If S = (s1 ; . . . ; sq ) ∈ SLC:
if STACK(1)c .top() = l {
STACK(1)c .pop();
ρ(sq );
ρ (s1 ; . . . ; sq ) := .. (2.49)
.
ρ(s1 );
}
where l = LABEL(s1 ). The reader should notice the reversed order of the
statements3 as opposed to the order used in the argument.
We conclude the definition of the transformation for the reverse section with the
3 Thesestatements are actually assignments or stack operations due to the SLC characteristics of this
sequence.
98 2 Transformation of Pure Parallel Regions
ρ y +← φ (x1 , . . . , xn )
y ← STACK(1) f .top();
STACK(1) f .pop();
:= x(1)1 +← φx1 (x1 , . . . , xn ) · y(1) ; (2.51)
.
.
.
x(1)n +← φxn (x1 , . . . , xn ) · y(1) ;
ρ y ← STACK f .top()
y ← STACK(1) f .top();
:= STACK(1) f .pop(); (2.52)
y(1) ← 0;
ρ if (b) { S’ } := ρ( S’ ) (2.53)
ρ while (b) { S’ } := ρ( S’ ) (2.54)
(
a ← STACK(1)i .top();
ρ a ← e := (2.55)
STACK(1)i .pop();
ρ STACK( f |i|c) .push(c) := ε (2.56)
ρ STACK( f |i|c) .pop() := ε, (2.57)
2.3 AD Source Transformation of SPL Code 99
With ε we denote the empty word, which means that the rules (2.55), (2.56), and
(2.57) do not emit any code. The next proposition shows that the above source
transformation provides code that implements the adjoint model.
Proposition 43. Given a parallel region code P ∈ SPL that implements a mathe-
matical function F : Rn → Rm , then the source transformation σ (P) implements
the adjoint model F(1) .
where i ∈ {1, . . . , q}, and t ∈ {0, . . . , p − 1}. This computation supplies the m out-
put values of F(x) with r0 < r assignments. These r0 assignments can be expressed
as a SAC:
y1 ← φ 1 (x1 , . . . , xn )
y2 ← φ 2 (x1 , . . . , xn )
..
.
0
yr0 ← φ r (x1 , . . . , xn ) .
Each assignment belongs to a certain basic block contained in the parallel region
P. We assume without loss of generality that each assignment is contained in a
different basic block. Therefore, each assignment is tracked by a push operation
to the control flow stack during the forward section. In practice this usually is not
the case, and we only need one push operation per basic block to trace the flow of
control.
100 2 Transformation of Pure Parallel Regions
STACK(1)c .push(1);
STACK(1) f .push(y1 );
y1 ← φ 1 (x1 , . . . , xn );
STACK(1)c .push(2);
STACK(1) f .push(y2 );
y2 ← φ 2 (x1 , . . . , xn );
..
.
STACK(1)c .push(r0 );
STACK(1) f .push(yr0 );
0
yr0 ← φ r (x1 , . . . , xn );
The stack for the control flow contains at the end of the forward section the labels
r0 , r0 − 1, . . . , 1 (r0 at the top of the stack) and the floating-point stack contains the
values yr0 , yr0 −1 , . . . , y1 (y0r at the top of the stack). The reverse section looks as
follows.
y2 ← STACK(1) f .top();
STACK(1) f .pop();
x(1)1 +← φx1 (x1 , . . . , xn ) · y(1)2 ;
..
.
x(1)n +← φxn (x1 , . . . , xn ) · y(1)2 ;
y(1)2 ← 0;
}
..
.
if (STACK(1)c .top() = r0 ) {
STACK(1)c .pop();
yq ← STACK(1) f .top();
STACK(1) f .pop();
x(1)1 +← φx1 (x1 , . . . , xn ) · y(1)r0 ;
..
.
x(1)n +← φxn (x1 , . . . , xn ) · y(1)r0 ;
y(1)r0 ← 0;
}
}
The n adjoint assignments in branch i with x(1)1 , . . . , x(1)n on their left-hand side
can be interpreted as a product of the transposed gradient of φ (x1 , . . . , xn ) with the
scalar y(1)i . Therefore, we rewrite the n assignments into one assignment using a
vector notation. We obtain
x(1)1 x(1)1
x(1)2 x(1)2
.. ← .. + (∇φ (x1 , . . . , xn ))T · y(1)1
. .
x(1)n x(1)n
y(1)1 ← 0;
}
if (STACK(1)c .top() = 2) {
STACK(1)c .pop();
y2 ← STACK(1) f .top();
STACK(1) f .pop();
x(1)1 x(1)1
x(1)2 x(1)2
.. ← .. + (∇φ (x1 , . . . , xn ))T · y(1)2
. .
x(1)n x(1)n
y(1)2 ← 0;
}
..
.
if (STACK(1)c .top() = r0 ) {
STACK(1)c .pop();
yr0 ← STACK(1) f .top();
STACK(1) f .pop();
x(1)1 x(1)1
x(1)2 x(1)2
.. ← .. + (∇φ (x1 , . . . , xn ))T · y(1)r0
. .
x(1)n x(1)n
y(1)r0 ← 0;
}
}
As in the proof of Proposition 41, we use the extended function Φ to express the
2.3 AD Source Transformation of SPL Code 103
z1 ← Φ1 (z0 )
z2 ← Φ2 (z1 )
..
.
0
zr0 ← Φr (zr0 −1 ) (2.58)
The forward section ends up in the same program state as the original code, but in
addition the stacks contain the following information where the topmost element
on the stack is the leftmost element in the notation with a tuple:
STACK(1)c = (r0 , r0 − 1, . . . , 2, 1)
STACK(1) f = (yr0 , yr0 −1 , . . . , y2 , y1 ).
This means r0 is the label on the top of the control flow stack. Therefore, the test
expression that is valid during the first iteration of the reverse section, is the one
that corresponds to the r0 -th assignment. The second iteration of the reverse section
enters the branch where the test for r0 − 1 is valid because this label is on top of
the stack STACK(1)c , and so forth. Therefore, we can express the computation of
the reverse section by the following program transformations where we again use
the notation with the transposed Jacobian.
0
z(1)r0 −1 ← (∇Φr (zr0 −1 ))T · z(1)r0 (2.59)
r0 −1
z(1)r0 −2 ← (∇Φ (zr0 −2 ))T · z(1)r0 −1
..
.
z(1)0 ← (∇Φ1 (z0 ))T · z(1)1
104 2 Transformation of Pure Parallel Regions
Please note that the transposed Jacobian in the i-th adjoint assignment is evaluated
0 0
with the program state zr0 −i ≡ Φr −i ◦ Φr −i−1 . . . ◦ Φ1 (z0 ), but the current program
state at this adjoint assignment is actually zr0 −i+1 . For example, when the execu-
tion reaches the end of the forward section and the assignment (2.58) is performed,
this leads to the program state zr0 . Afterwards, the reverse section starts with exe-
cuting (2.59). There, the current program state is still zr0 not zr0 −1 . The two states
differ only at the component for yr0 . Hence, we restore the program state zr0 −1 with
the help of STACK(1) f .
By recursively replacing the adjoint program states we obtain
0
z(1)0 ≡ (∇Φ1 (z0 ))T · (∇Φ2 (z1 ))T · . . . · (∇Φr (zr0 −1 ))T · z(1)r0 . (2.60)
The chain of matrix multiplications can be evaluated where the result is denoted
by A:
0
A ≡ (∇Φ1 (z0 ))T · (∇Φ2 (z1 ))T · . . . · (∇Φr (zr0 −1 ))T .
Without loss of generality we assume that the m values yr0 −m+1 , yr0 −m+2 , . . . , yr0
determine the output values of the computation. We extract a submatrix of A by
only considering the first n lines and the last m columns of A. This submatrix rep-
resents the values of the Jacobian ∇F(x) at point x. The last m values in z(1)r0 ,
namely y(1)r0 ,r0 −m+1 , y(1)r0 ,r0 −m+2 , . . . , y(1)r0 ,r0 are the initial adjoint values defined
by the user of the adjoint code, see (2.59). The first n values in z(1)0 , namely
x(1)0,1 , y(1)0,2 , . . . , y(1)0,n are the adjoint values that the adjoint code provides as
output, see (2.60). Therefore, we can rewrite the adjoint computation as the fol-
lowing matrix-vector product
x(1)0,1 y(1)r0 ,r0 −m+1
y(1)0,2 y(1)r0 ,r0 −m+2
.. ≡ ∇F(x)T ·
..
. .
y(1)0,n y(1)r0 ,r0
To get familiar with the adjoint source transformation, we show the adjoint
model of same example code that we used for explaining the tangent-linear source
transformation.
Example 22. This examples shows the application of the reverse mode to the code
of Example 20. The transformation by σ is as follows, where indexes of statements
2.3 AD Source Transformation of SPL Code 105
σ ( (s8 ; s9 ; s10 ; s11 ) )
σ (s12 )
σ ( (s ; s ) )
(2.38) 14 15
σ (S) = (2.62)
σ (s 16 )
σ ( (s18 ) )
σ (s19 )
We avoid to blow up the example and show only the first two transformations that
represent a transformation of a SLC code and a statement from CFSTMT. The
transformation of the SLC sequence (s8 ; s9 ; s10 ; s11 ) is:
STACK(1)c .push( 8 );
σ (s8 );
(2.39)
σ ( (s8 ; s9 ; s10 ; s11 ) ) = σ (s9 );
σ (s10 );
σ (s11 );
STACK(1)c .push(8);
STACK(1)i .push(tid)
tid ← omp_get_thread_num();
STACK .push(p)
(1)i
(2.45)
= p ← omp_get_num_threads();
STACK(1)i .push(chunk_size)
chunk_size ← n/p;
STACK (1)i .push(i)
i ← chunk_size*p;
106 2 Transformation of Pure Parallel Regions
(2.45)
= if (i 6= n) {
STACK(1)c .push( 13 );
STACK(1)i .push(chunk_size)
chunk_size ← chunk_size + 1;
}
The remaining transformation steps are left to the reader. The forward section
emitted by σ ( S ) looks as follows
1 #pragma omp p a r a l l e l
2 {
3 int i ← 0;
4 int tid ← 0;
5 int lb ← 0;
6 i n t ub ← 0 ;
7 i n t chunk_size ← 0;
8 STACKc . push ( 8 ) ;
9 STACKi . push ( t i d ) ;
10 t i d ← omp_get_thread_num ( ) ;
11 STACKi . push ( p ) ;
12 p ← omp_get_num_threads ( ) ;
13 STACKi . push ( c h u n k _ s i z e ) ;
14 c h u n k _ s i z e ← n/p ;
15 STACKi . push ( i ) ;
16 i ← c h u n k _ s i z e ∗p ;
17 if ( i = 6 n)
18 {
19 STACKc . push ( 1 3 ) ;
20 STACKi . push ( c h u n k _ s i z e ) ;
21 c h u n k _ s i z e ← c h u n k _ s i z e +1;
2.3 AD Source Transformation of SPL Code 107
22 }
23 STACKc . push ( 1 4 ) ;
24 STACKi . push ( l b ) ;
25 l b ← t i d ∗ chunk_size ;
26 STACKi . push ( ub ) ;
27 ub ← ( t i d +1)∗ c h u n k _ s i z e −1;
28 i f ( ub ≥ n ) {
29 STACKc . push ( 1 7 ) ;
30 STACKi . push ( ub ) ;
31 ub ← n −1;
32 }
33 STACKc . push ( 1 8 ) ;
34 STACKi . push ( i ) ;
35 i ← lb ;
36 w h i l e ( i ≤ ub ) {
37 STACKc . push ( 2 0 ) ;
38 STACK f . push ( y i ) ;
39 y i ← 2∗ x i ∗ x i ;
40 STACK f . push ( x i ) ;
41 xi ← 0 ;
42 STACKi . push ( i ) ;
43 i ← i +1;
44 }
Listing 2.2: The first half of the parallel region consists of the forward section of the
adjoint code where we take the code from Example 20 as input.
ρ( (s8 ; s9 ; s10 ; s11 ) )
ρ(s12 )
ρ( (s , s ) )
(2.48) 14 15
ρ( S ) =
ρ(s16 )
ρ( (s18 ) )
ρ(s19 )
108 2 Transformation of Pure Parallel Regions
if STACK (1)c .top() = 8 {
STACK(1)c .pop();
ρ(s11 );
(2.49)
ρ( (s8 ; s9 ; s10 ; s11 ) ) = ρ(s10 );
ρ(s9 );
ρ(s8 );
}
if STACK(1)c .top() = 8 {
STACK(1)c .pop();
i ← STACK(1)i .top();
STACK(1)i .pop();
chunk_size ← STACK(1)i .top();
(2.55)
= STACK(1)i .pop();
p ← STACK(1)i .top();
STACK(1)i .pop();
tid ← STACK(1)i .top();
STACK(1)i .pop();
}
We transform ρ(s22 ) as the previous integer assignments with (2.55). More inter-
esting are the transformations ρ(s20 ) and ρ(s21 ) since the arguments are floating-
2.3 AD Source Transformation of SPL Code 109
xi ← STACKf.top();
(2.50)
ρ(s21 ) = STACKf.pop();
x(1)i ← 0;
The remaining transformations are left to the reader and we present the continua-
tion of the adjoint code by displaying the complete reverse section.
46 w h i l e ( n o tSTACKc . empty ( ) ) {
47 i f ( STACKc . t o p ( ) = 8 ) {
48 STACKc . pop ( ) ;
49 i ←STACKi . t o p ( ) ;
50 STACKi . pop ( ) ;
51 c h u n k _ s i z e ←STACKi . t o p ( ) ;
52 STACKi . pop ( ) ;
53 p ←STACKi . t o p ( ) ;
54 STACKi . pop ( ) ;
55 t i d ←STACKi . t o p ( ) ;
56 STACKi . pop ( ) ;
57 }
58 i f ( STACKc . t o p ( ) = 13 ) {
59 STACKc . pop ( ) ;
60 c h u n k _ s i z e ←STACKi . t o p ( ) ;
61 STACKi . pop ( ) ;
62 }
63 i f ( STACKc . t o p ( ) = 14 ) {
64 STACKc . pop ( ) ;
65 ub ←STACKi . t o p ( ) ;
66 STACKi . pop ( ) ;
67 l b ←STACKi . t o p ( ) ;
68 STACKi . pop ( ) ;
69 }
70 i f ( STACKc . t o p ( ) = 17 ) {
110 2 Transformation of Pure Parallel Regions
71 STACKc . pop ( ) ;
72 ub ←STACKi . t o p ( ) ;
73 STACKi . pop ( ) ;
74 }
75 i f ( STACKc . t o p ( ) = 18 ) {
76 STACKc . pop ( ) ;
77 i ←STACKi . t o p ( ) ;
78 STACKi . pop ( ) ;
79 }
80 i f ( STACKc . t o p ( ) = 20 ) {
81 STACKc . pop ( ) ;
82 i ←STACKi . t o p ( ) ;
83 STACKi . pop ( ) ;
84 x i ←STACK f . t o p ( ) ;
85 STACK f . pop ( ) ;
86 x (1)i ← 0 ;
87 y i ←STACK f . t o p ( ) ;
88 STACK f . pop ( ) ;
89 x (1)i +← 2∗ x i ∗ y (1)i ;
90 x (1)i +← 2∗ x i ∗ y (1)i ;
91 y (1)i ← 0 ;
92 }
93 } /∗ End o f w h i l e l o o p ∗/
94 } /∗ End o f p a r a l l e l r e g i o n ∗/
Listing 2.3: This code shows the second half of the parallel region, namely the reverse
section of the adjoint code.
The complete adjoint code consists of a parallel region with first Listing 2.2 as
forward section and afterwards follows Listing 2.3 which represents the reversea
section.
The following lemma shows that the σ (P) transformation fulfills the interface
requirements that we defined in (2.9). The requirement was that if two memory
references are equal it must be ensured that their adjoint associates also have the
same reference.
Lemma 44. Let u, v ∈ FLOATP be two variables of a parallel region P ∈ SPL
and σ (P) is the adjoint model of P. Then it holds for the variables u, v, u(1) ,
2.3 AD Source Transformation of SPL Code 111
v(1) ∈ σ (P):
Proof. In the case that u and v are defined outside of P, then this holds due to
(2.9). Otherwise, the variable is defined inside of P and therefore transformed
by rule (2.36) or (2.37). This definition is unique inside the parallel region σ (P)
since we only allow one sequence of declarations inside the parallel region. As the
transformations (2.36) and (2.37) show, the declaration of v is always connected
with the declaration of v(1) inside of σ (P). Hence, the equivalence is fulfilled.
Before we finish this section, we introduce a notation that will improve the read-
ability when we prove the closure property of the adjoint source transformation.
During the proof we will examine different combinations of possible statements.
For example, we have pairs of instances of statements in execution where one state-
ment is from the forward section of thread t and the other statement is from the
reverse section of thread t 0 . We try to prove the correctness of the source transfor-
mation by reducing the information about the statements inside the source trans-
formation code to the original statements. This allows to reason that the source
transformation is correct because the original statement is assumed to be correct.
This method requires an association of the interleaving of the original code P with
the interleaving of the code obtained by σ (P). The association is done by a nota-
tion that we introduce in the following.
Suppose the parallel region P has a sequence of statements of length q and is
executed by p threads. The set of possible interleavings is I (P, q, p). Let us as-
sume that q0 is the number of statements in σ (P). The set of possible interleavings
of the source transformation is therefore I (σ (P), q0 , p).
We write sg ∈ P when the statement sg is contained in P. In case that si is ob-
tained by transforming sg through σ , we write si ∈ σ (sg ). We do not use si = σ (sg )
since in general the transformation supplies a sequence of statements. Analo-
gously, we write s j ∈ ρ(sh ) for the case that s j is contained in the resulting code
of applying the transformation ρ to statement sh that stems from P.
0
Let us consider two consecutive statements sti ; stj in J ∈ I (σ (P), q0 , p). The
fact that si ∈ σ (sg ) and s j ∈ ρ(sh ) is denoted by
0 (σ ,ρ) 0
(stg , sth ) −→ (sti , stj ). (2.64)
0
I ∈ I (P, q, p): . . . ; stg ; sth ; ... s11 ; s12 ; . . . ; s1q ; = S1
(σ ,ρ)
−→
0
J ∈ I (σ (P), q0 , p): ...; sti ; stj ; ... σ (S1 ) while (. . .) { ρ(S0 ) }
Figure 2.2: We define a notation that connects the interleaving of the original code P and its
adjoint source transformation I (σ (P), q0 , p).
arised out of the transformations σ (sg ) and ρ(sh ). Since sg and sh are contained
in P, there exists an interleaving I ∈ I (P, q, p) that contains the consecutive pair
0
(stg ; sth ). This pair is shown in the upper half of Figure 2.2. To describe this relation
we write this as shown in (2.64).
Please note the important fact that we have to consider four possible combi-
nations of instances inside the interleaving of the adjoint source transformation,
0
namely I (σ (P), q0 , p). Either statement sti or stj may be part of the forward sec-
tion or from the reverse section. When we combine these possibilities, we achieve
four different cases that need to be considered. For each of the four possible com-
binations, we define in (2.65) an association to describe the connection between
the statements in the derivative code and the statements from the original code.
0 (σ ,σ ) 0
(stg , sth ) −→ (sti , stj ) with si ∈ σ (sg ), and s j ∈ σ (sh )
0 (σ ,ρ) 0
(stg , sth ) −→ (sti , stj ) with si ∈ σ (sg ), and s j ∈ ρ(sh )
(2.65)
0 (ρ,σ ) 0
(stg , sth ) −→ (sti , stj ) with si ∈ ρ(sg ), and s j ∈ σ (sh )
0 (ρ,ρ) 0
(stg , sth ) −→ (sti , stj ) with si ∈ ρ(sg ), and s j ∈ ρ(sh )
This section introduced the adjoint source transformation σ (P) that takes a par-
2.3 AD Source Transformation of SPL Code 113
allel region P as input and provides the adjoint model of P as output. In order to
implement the reverse section, we defined another source transformation ρ (S) that
takes a sequence of statements S as input and provides the corresponding adjoint
statements as output. The fact that the code provided by the transformation repre-
sents the adjoint model of P was proven in Proposition 43. The following section
explains how the code that we obtain by the source transformations fits into the
derivative code of Q which is the code where the parallel region P is embedded.
S0
#pragma omp p a r a l l e l
{
S1
}
S2
Listing 2.4: Example Structure of a code Q with an OpenMP parallel region P.
There are three sequences of statements where S0 and S2 are C/C++ code and S1
is SPL code. We assume that the source transformations τ (), σ (), and ρ() are
well-defined, when S0 and S2 serve as input codes. The actual definition does not
play a role in this work, since we only focus on the source transformation of a
parallel region. However, we have to ensure that the interface between our source
transformation result and the remaining derivative code fits together. This interface
is described in (2.8) and (2.9) on page 86.
Suppose that the tangent-linear models of S0 and S2 are given by τ(S0 ) and
τ(S2 ). The next listing shows the transition of the original structure of Q to the
one in τ (Q). The tangent-linear model of Q inherits the structure from Q to its
derivative pendant. This corresponds to the fact that the dataflow keeps the same
in the tangent-linear model. Our contribution to τ (Q) is the parallel region with
τ (S1 ) in it.
114 2 Transformation of Pure Parallel Regions
S0 τ(S0 )
#pragma omp p a r a l l e l #pragma omp p a r a l l e l
{ τ (Q) {
−→
S1 τ(S1 )
} }
S2 τ(S2 )
Original structure of Q. The tangent-linear model of Q.
The usage of the adjoint model σ (P) needs to be explained in more detail. The
adjoint model computes the derivative values with the help of the forward section
and the reverse section. First of all, the forward section is executed containing the
code of the original code augmented by statements for storing values before they
potentially are overwritten. The augmented code also traces the flow of control.
After the termination of the forward section, the reverse section follows with the
reversal of the control flow and the calculation of the adjoint values. The rigorous
separation of the forward section and the reverse section is called split reversal.
This scheme generally consumes a lot of memory. A remedy for this drawback is
the joint reversal scheme.
In our case, the joint reversal scheme can be applied to the parallel region P.
This means that the forward section of Q does not contain the augmented forward
section of P but instead the original code of P. Before the execution of the original
code P begins, all input values of P are stored, for example, by writing a checkpoint
[47, 30, 78]. A checkpoint, taken at a certain point in the program execution,
contains all values that are necessary to allow a later restart of the computation at
this point. The checkpoint that stores all input data of P during the forward section
is restored in the reverse section of Q just before the execution of the adjoint code
of P starts. Hence, it is ensured that the adjoint code of P has the exact same
input data as the execution of P had during the forward section of Q. In case that
the input data of P is constant or is not changed after P we do not need to store
anything. The above method leads to the following structure of the forward section
of σ (Q).
σ (S0 )
#i n c l u d e " s t o r e _ c h e c k p o i n t . i n c "
#pragma omp p a r a l l e l
{
S1
}
σ (S2 )
2.3 AD Source Transformation of SPL Code 115
As in the tangent-linear case, we assume that σ (S0 ) and σ (S2 ) are given. The
forward section σ (S0 ) follows the creation of a checkpoint which is illustrated by
including the code store_checkpoint.inc from an external file. Sequence S1 in the
parallel region is present as it was in the original code Q. The reverse section of Q
looks as follows.
ρ(S2 )
#i n c l u d e " r e s t o r e _ c h e c k p o i n t . i n c "
#pragma omp p a r a l l e l
{
σ (S1 )
w h i l e ( not STACK(1)c .empty() ) { ρ(S1 ) }
}
ρ(S0 ) .
The codes ρ (S0 ) and ρ (S2 ) appear in reversed order compared to the forward
section. The checkpoint containing the input data of P is restored by including
the external code file restore_checkpoint.inc. The reason for restoring the input
data becomes clear while considering the code inside the parallel region. First, the
augmented forward section σ (S1 ) is preceding the reverse section ρ (P). This joint
computation of the adjoint values inside of the reverse section of Q is called joint
reversal. To summarize, we illustrate the transformation of Q with the following
listing.
σ (S0 )
#i n c l u d e " s t o r e _ c h e c k p o i n t . i n c "
#pragma omp p a r a l l e l
{
S1
S0 }
#pragma omp p a r a l l e l σ (S2 )
{ σ (Q) ρ(S 2)
−→
S1 #i n c l u d e " r e s t o r e _ c h e c k p o i n t . i n c "
} #pragma omp p a r a l l e l
S2 {
σ (S1 )
w h i l e ( not STACK(1)c .empty() )
{ ρ(S1 ) }
}
ρ(S0 )
Original program Q. The adjoint model of Q.
116 2 Transformation of Pure Parallel Regions
More details about the split reversal and the joint reversal scheme can be found
in [29, 57]. The next section shows that if P fulfills certain conditions, the code
obtained by applying the source transformations from this chapter to P, can in fact
be executed in parallel.
this pair, there exists an interleaving that represents the parallel execution of P and
contains the pair of consecutive statements that are the original statements of sti
0 0
and stj . Choose I ∈ I (P, q, p) such that (stg ; sth ) ∈ I with si ∈ τ (sg ) and s j ∈ τ (sh ).
t
We assume that &v ∈ LHSREF(si ) 6= 0/ what implies that si is an assignment.
0
We have to show that the reference &v cannot occur in stj no matter which kind of
statement s j is.
yh ← φ (xh,1 , . . . , xh,n )
118 2 Transformation of Pure Parallel Regions
or
yh +← φ (xh,1 , . . . , xh,n ).
The fact that &yti is part of the derivative computation means for the left-hand side
0
reference &ytg of stg that this reference is also contained in sth . As P is noncritical,
this assumption cannot be true.
sg = yg ← φ (xg,1 , . . . , xg,n )
and
sh = yh ← φ (xh,1 , . . . , xh,n ).
(1)t 0 (1)t
The claim that &yi occurs in stj means for the original statements that &yg oc-
0
curs in sth . This shows again a contradiction to the assumption that P is noncritical.
(1) (1)
Case si = yi +← ∑nk=1 φxi,k (xi,1 , . . . , xi,n ) · xi,k : Analogously as shown in the
previous case.
Lemma 46. Suppose P ∈ SPL is a noncritical parallel region, then the source
transformation τ (P) is closed.
Proof. We know from Proposition 45 that τ (P) is noncritical. Since P does not
contain any OpenMP constructs besides the parallel directive, the tangent-linear
source transformation is closed.
This section showed the very pleasing fact that our source transformation that
provides the tangent-linear model of a parallel region P fulfills the closure property
in case that P is noncritical. The coming section will introduce the exclusive read
property and we show that the closure property for the adjoint transformation is
only given if the original code P has the exclusive read property.
the adjoint source transformation σ (P). To achieve this we perform similar steps
as in the previous section and we consider the possible combinations of pairs in
the interleaving that represents the parallel execution of the adjoint model. To an-
ticipate it, a certain combination will reveal that the closure property is only given
in case that the input code P is noncritical and in addition never reads a mem-
ory location concurrently. The latter requirement is different to the tangent-linear
transformation where the noncritical requirement was sufficient.
In the introduction we saw that a race condition during runtime may lead to a
nondeterministic behavior. Such a race condition appears, for example, when two
threads write to the same memory location at the same time. In contrast to this,
a simultaneous read of the same memory location does not change the memory
location and is therefore noncritical no matter how many threads read this certain
memory location. Thus, a concurrent read is anything but an exception in shared-
memory parallel programming.
The property of a parallel region P that a parallel execution never leads to a
situation where a memory location is read concurrently by more than one thread
is denoted as the exclusive read property. Speaking in terms of an interleaving,
the exclusive read property can be expressed as follows. We consider all possible
interleavings I (P, q, p) of a parallel execution of parallel region P ∈ SPL with q
statements and p threads. We focus on an arbitrary consecutive pair of statements
0
(sti ; stj ) contained in an arbitrary interleaving I ∈ I (P, q, p). The exclusive read
property is fulfilled if and only if each occurrence of the reference &v on the right-
hand side of sti leads to the fact that this reference does not occur on the right-hand
0
side of stj . More formal, we write:
0
∀I ∈ I (P, q, p) ∀(sti ; stj ) ∈ I, t 6= t 0 :
0
&v ∈ RHSREF(sti ) =⇒ &v ∈
/ RHSREF stj . (2.66)
race condition where the critical reference is &v(1) . In other words, the simultane-
ous read operation in the original code leads to a simultaneous write operation in
the adjoint code.
Proposition 47. Suppose that P ∈ SPL is noncritical and fulfills the exclusive read
property (2.66) then σ (P) is noncritical.
Proof. P consists of the sequence of statements S = (s1 ; . . . ; sq ) where the indi-
vidual statements in S are referred to as original statements. P is noncritical what
is equivalent with the fact that all statements s ∈ S are noncritical. Therefore, it
holds:
0
∀I ∈ I (P, q, p) ∀(stg ; sth ) ∈ I, t 6= t 0 :
0
&v ∈ LHSREF stg =⇒ &v ∈ / REF sth
t 6= t 0 . The statements si and s j are from the sequence (s1 , s2 , . . . , sq0 ) which means
that i and j are indexes out of {1, . . . , q0 } and not necessarily different. The original
statements corresponding to si and s j are referred to as sg and sh . Thus, g and h
have values between 1 and q.
0
There is an interleaving I ∈ I (P, q, p) that contains (stg ; sth ). This fact is impor-
tant as we trace back the absence of a critical reference in J to the absence of a
critical reference in I. We illustrate the correlation between the interleavings I and
J in Figure 2.2. We use the notation (2.65) to make clear what kind of transforma-
0
tions have been used to achieve (sti ; stj ). For example,
0 (σ ,ρ) 0
(stg , sth ) −→ (sti , stj )
0
means that sti is a statement from the forward section (si = σ (sg )) and stj is a
0
statement from the reverse section (s j = ρ(sh )). The pair (stg , sth ) is from the inter-
0
leaving I that represents a parallel execution of P, and the pair (sti , stj ) is from the
interleaving J that represent a parallel execution of σ (P).
A parallel execution of σ (P) means that each thread executes its part of the
adjoint model of P independently. At any time, some threads execute the forward
section while other threads execute the reverse section. Thus, we cannot be sure
0
in which phase thread t and thread t 0 are when we consider the pair (sti ; stj ). There
2.4 Closure of the Source Transformation 121
0
are four possible combinations for sti and stj being either in the forward section or
in the reverse section. Considering one of these four combinations, the statements
themselves have a certain kind of shape what lead to further case distinctions.
0
Obviously, this results in many possible combinations for (sti ; stj ).
The reasoning method will be the same for all possible cases. We assume that si
is an assignment and therefore si performs a store operation to the memory location
referenced by the left-hand side reference of sti . This reference can be &ati in case
of an integer assignment or &yti if the assignment is a floating-point assignment.
In
0
either case this left-hand side reference may not be used in REF stj . Otherwise,
this reference is critical. In order to show that the consecutive pair of statements
0
(sti ; stj ) is noncritical, we use a proof by contradiction. Therefore, we suppose
0
that the left-hand side reference of sti is also used by the instance stj . In most of
the possible cases this leads to a contradiction because this would mean that P is
critical as well. The only case where this does not lead to a contradiction is the
0
case where we consider two adjoint assignments as sti and stj . Thus, the impatient
reader may skip the cases where at least one thread executes a forward section and
jump to the case
0 (ρ,ρ) 0
(stg , sth ) −→ (sti , stj )
sj = a ← e
s j = STACK(1)i .push(a)
si = a ← e
s j = y ← φ (x1 , . . . , xn )
0
sti , stj with si , s j ∈ σ (S) s j = if (b) { S’ }
s j = while (b) { S’ }
si = y ← φ (x1 , . . . , xn )
s j = STACK(1) f .push(y)
0
Figure 2.3: This tree shows the possibilities for the case that both statements sti and stj are
assumed to be from the forward section of the adjoint code. The assumption that the left-
hand side reference of sti is critical allows the cases that si is an integer assignment or a
floating-point assignment. From each of these cases we obtain another four possible cases
0
for the shape of stj .
0 (σ ,σ ) 0
Case (stg , sth ) −→ (sti , stj ) (Figure 2.3): In this case we assume that the threads t
and t 0 both execute statements from the forward section of the adjoint model σ (P).
Suppose that the left-hand side reference of an assignment in thread t occurs also
in the execution of thread t 0 . In terms of an instance of a statement in execution, we
0
consider instance sti that is executed by thread t and instance stj that is executed by
thread t 0 . We assume that the left-hand side reference &v of sti occurs in instance
0
stj which means that
0
&v ∈ LHSREF(sti ) ∧ &v ∈ REF stj (2.67)
is valid. Depending on the different possible shapes for si and s j , we trace this as-
sumption back to the original statements sg and sh where we consider the instances
0
stg and sth from the interleaving I.
2.4 Closure of the Source Transformation 123
1. si = a ← e:
From transformation rule (2.45), we know that the original statement sg has
the shapea ← t
e. Because we assume that (2.67) holds we assume that &ai
0
is in REF stj .
a) s j = a ← e:
Again, with (2.45) we conclude that the corresponding statement sh is
a ← e. With (2.67) this leads to the fact that &ati occurs either on the
left- or on the right-hand side of s j . This can be represented by the
equation
0
0
&ati = &atj ∨ &ati = &v ∈ RHSREF stj
e) s j = STACK(1) f .push(y):
s j is a result of one of the source transformation rules (2.40), (2.41),
(2.42), or (2.46). In all these cases assumption (2.67) 0 means for the
interleaving I that &atg is equal to a reference in REF sth . This cannot
be the case since P is noncritical.
2. si = y ← φ (x1 , . . . , xn ):
Rule (2.40) shows that the original statement sg is y ← φ (x1 , . . . , xn ). We as-
sert that(2.67) t
0
is true and therefore reference &yi is assumed to be contained
in REF stj .
a) s j = y ← φ (x1 , . . . , xn ):
The original statement sh has the shape y ← φ (x1 , . . . , xn ) due to trans-
formation rule (2.40). The claim (2.67) about interleaving J implies
that &yti needs to be either on the left or on the right-hand side:
0 0
&yti = &ytj ∨ &yti = &xtj,k where k ∈ {1, . . . , n} .
which means that the threads t and t 0 either both write to &ytg or that
one thread writes and one threads reads from the reference &ytg . How-
ever, both possibilities implies that P is critical which is a contradic-
tion.
b) s j = if (b) { S } or s j = while (b) { S } : 0
Analogously to the reasoning in 1d, we conclude that &ytg ∈ REF bth
which means that &ytg is a critical reference and therefore P is critical
as well.
c) s j = STACK(1) f .push(y):
Similar to 1e, we obtain the fact that in interleaving I the reference &ytg
0
is in REF sth what means that this reference from P is critical.
In all the above cases, we obtain a contradiction. Therefore, the claim (2.67) must
0 (σ ,σ ) 0
be wrong for the case (stg , sth ) −→ (sti , stj ). This shows that the parallel execution of
two augmented forward sections σ (S) in different threads cannot include a critical
reference. The next case examines the concurrent execution of a forward section
and a reverse section in two different threads t and t 0 .
2.4 Closure of the Source Transformation 125
s j = y(1) ← 0
si = a ← e
s j = a ← STACK(1)i .top()
0
sti , stj with si ∈ σ (S), s j ∈ ρ(S)
s j = y ← STACK(1) f .top()
si = y ← φ (x1 , . . . , xn )
Figure 2.4: This tree shows the possibilities for the case where sti is from the forward section
0
and stj is from the reverse section.
0 (σ ,ρ) 0
Case (stg , sth ) −→ (sti , stj ) (Figure 2.4): The current case assumes that thread t
executes the forward section and thread t 0 executes its reverse section code.
1. si = a ← e:
We know that the original statement sg has the shape a ← e because of
transformation
0 rule (2.45). We assume that (2.67) holds. Therefore, the set
REF stj comprises the reference &ati .
a) s j = y(1) ← 0:
If we have a critical reference in J then this is only possible when refer-
0
ence &ati is used as an offset in the left-hand side reference &yt(1) j . As
0
a reminder concerning the notation, &yt(1) j is an adjoint reference used
0
by thread t 0 that occurs in the instance stj . The original statement sh is
y ← φ (x1 , . . . , xn ) as shown in rule (2.50). The offsets are transformed
by the identity mapping. Therefore, the use of &ati as an offset means
0
for the interleaving I that &atg is used as an offset in &yth which is the
126 2 Transformation of Pure Parallel Regions
0
left-hand side reference of sth . Thus, &atg is critical since this reference
is used in a store operation in thread t and it is read by thread t 0 . We
infer that this is a contradiction as P is assumed to be noncritical.
b) s j = a ← STACK(1)i .top():
Since (2.67) we obtain
0
&ati = &atj .
From the transformation rule (2.55), we know that sh has the shape
a ← e. Thus, the equation
0
&atg = &ath
2. si = y ← φ (x1 , . . . , xn ):
The original statement sg is y ← φ (x1 , . . . , xn ) as a result of rule (2.40). By
claiming t
that (2.67) is true, reference &yi is assumed to be contained in
0
REF stj .
a) s j = y ← STACK(1) f .top():
In this case the claim (2.67) implies that
0
&yti = &ytj
2.4 Closure of the Source Transformation 127
0 (ρ,σ ) 0 0
Case (stg , sth ) −→ (sti , stj ) (Figure 2.5): If sti is from the reverse section and stj is
from the forward section, we do not need to consider all possible assignments in
the reverse section. In case that an assignment has an adjoint reference on its left-
hand side, this reference cannot occur in the forward section. Therefore, Figure 2.5
shows only
0 these cases for si where a left-hand side reference can possibly occur
in REF stj .
1. si = a ← STACK(1)i .top():
If si is a stack operation that restores an integer value then the original state-
ment sg is a ← e due t
to rule (2.55). The possible cases where reference &ai
0
may occur in REF stj are
a) s j = a ← e
128 2 Transformation of Pure Parallel Regions
sj = a ← e
s j = STACK(1)i .push(a)
si = a ← STACK(1)i .top()
s j = y ← φ (x1 , . . . , xn )
0
sti , stj with si ∈ ρ(S), s j ∈ σ (S) s j = if (b) { S’ }
s j = while (b) { S’ }
si = y ← STACK(1) f .top()
s j = STACK(1) f .push(y)
Figure 2.5: This tree shows the possibilities for the case where sti is from the reverse section
0
and stj is from the forward section. Since no adjoint references are used inside of the forward
section, we do not need the case where si is an adjoint assignment.
b) s j = STACK(1)i .push(a)
c) s j = y ← φ (x1 , . . . , xn )
d) s j = if (b) { S } or s j = while (b) { S }
e) s j = STACK(1) f .push(y)
These cases for the shape of s j are the same as displayed in Figure 2.3. Since
the left-hand side reference of sti is here &ati as well as in Figure 2.3 we refer
the reasoning in the current case to the cases
0 (σ ,σ ) 0
(stg , sth ) −→ (sti , stj ) 1a to 1e.
2. si = y ← STACK(1) f .top():
The original statement sg may have one of three possible shapes depending
2.4 Closure of the Source Transformation 129
on which rule was used to generate statement si . In case that the rules (2.50)
or (2.51) were used, sg is is y ← φ (x1 , . . . , xn ). Otherwise, rule (2.52) dis-
plays that the original statement is sg = y ← STACK f .top(). The possible
cases for the shape of s j are
a) s j = y ← φ (x1 , . . . , xn ):
c) s j = STACK(1) f .push(y):
Since we obtain a contradiction in all the possible cases we conclude that the ex-
ecution of reverse section in thread t and a forward section in thread t 0 do not
interfere each other. The next case examines whether or not an execution of a
reverse section in both threads possibly interfere in some way.
0 (ρ,ρ) 0
Case (stg , sth ) −→ (sti , stj ) (Figure 2.6): In this case, we assume that both state-
0
ments sti and stj are from the reverse section. As a reminder, the reverse section
has the following appearance:
while not STACK(1)c .empty() { ρ( S 0 ) } (2.68)
The number of branch statements m is the number of SLC sequences that we obtain
from applying rule (2.48). The test expressions in (2.68) and (2.69) are noncritical
because they only read from a thread local data structure. Therefore, we only con-
sider the sequences of statements S1 to Sm . These sequences are the result of the
source transformation rule (2.49). The pop operation which is the first statement
inside the branch code block can be neglected since it only affects thread local
memory. The actual statements in these sequences are obtained through the trans-
formation rules (2.50), (2.51), (2.52), and (2.55). Hence, we consider possible
combinations of statement pairs out of the following set of statements:
{ y ← STACK(1) f .top(),
x(1)k +← φxk (x1 , . . . , xn ) · y(1) ,
y(1) ← 0,
a ← STACK(1)i .top() } , where k ∈ {1, . . . , n}.
Thus, the shape of si can be one of the assignments contained in this set. De-
pending on the kind of the left-hand side reference there are several cases where
0
this left-hand side reference could occur in stj . The edges from si to s j in Figure
2.6 corresponds with the possibility of an occurrence. To prove the absence of a
critical reference by contradiction, we assume that (2.67) is true.
1. si = a ← STACK(1)i .top():
The corresponding original statement sg to si is a ← e what can be recog-
nized by rule (2.55).
a) s j = a ← STACK(1)i .top():
As stated for si , due to rule (2.55) we obtain the original statement sh
with the shape a ← e. Claim (2.67) implies that the both left-hand side
references must be the same.
0
&ati = &atj
This equation means for interleaving I that
0
&atg = &ath ,
and leads to the fact that P is critical because both threads write to the
same memory location.
b) s j = y(1) ← 0:
We refer to the case
0 (σ ,ρ) 0
(stg , sth ) −→ (sti , stj ) 1a , see Figure 2.4.
2.4 Closure of the Source Transformation 131
There, we investigate the case where the left-hand side reference ati
of si occurs in the same shape of s j as in the current case. Thus, the
reasoning is the same as shown there.
c) s j = y ← STACK(1) f .top():
We conclude that this is a contradiction as shown in the case
0 (σ ,ρ) 0
(stg , sth ) −→ (sti , stj ) 1c , see Figure 2.4.
0 (σ ,ρ) 0
(stg , sth ) −→ (sti , stj ) 1d , see Figure 2.4.
2. si = y(1) ← 0:
Source transformation rule (2.50) shows us that the original statement sg is
y ← φ (x1 , . . . , xn ).
a) s j = y(1) ← 0:
The assumption (2.67) that there is a critical reference in J leads to
0
&yt(1)i = &yt(1) j
which reveals a critical reference since the threads t and t 0 write to the
same memory location.
132 2 Transformation of Pure Parallel Regions
hold also for the current case and thus we conclude that P is critical.
2.4 Closure of the Source Transformation 133
The left equation is the same as in (2.70) and thus we achieve the
contradiction stating that P is critical. Therefore, the only remaining
possibility is that
0
&xt(1)k,i = &xt(1)l, j
is valid. The equivalence from Lemma 44 provides the fact that in
interleaving J the equation
0
&xtk,i = &xtl, j
134 2 Transformation of Pure Parallel Regions
is true. For the parallel execution of P this means that the threads t and
t 0 read cocurrently from the same memory location. For the first time
we do not get a contradiction because this read-read situation is not
critical. Nevertheless, our original code P fulfills the exclusive read
property (2.66) and therefore we achieve that this remaining equation
cannot be true either.
In all cases except the last, we achieved a contradiction to the assumption that P is
noncritical. Hence, we can conclude that there is no critical reference in σ (P) as
long as P fulfills the exclusive read property. We cannot relinquish the exclusive
read property as we showed in the case where we achieved the equation (2.71).
Fortunately, in shared memory parallel programming there are ways to synchro-
nize such a situation. In Section 4.2, we will extend the language SPL such that it
contains constructs for synchronizing the parallel execution of P. With these syn-
chronization constructs we are able to provide the closure of σ (P) for any given
P ∈ SPL, as we will see in Section 4.2.5. Nevertheless, synchronization of parallel
programs is often connected with a bad scaling performance since the overhead
for the synchronization increases with a growing number of threads. Therefore,
we only want to use synchronization when it is not avoidable. But this means that
our source transformation tool must be capable of recognizing the exclusive read
property during compile time. In this context, we introduce a static program anal-
ysis in Chapter 3. This static analysis provides for each floating-point assignment
in P an approximation whether or not it fulfills the exclusive read property. In case
the analysis cannot decide if the reference is used exclusively or not, the result is
that the reference does not fulfill the exclusive read property.
2.5 Summary
This chapter started by introducing the formalism that we use not only in this chap-
ter but also in the coming chapters. This formalism comprises the mathematical
abstraction of a parallel region as an interleaving. In this approach, the concur-
rent instances of statements in execution are interleaved into one sequence. This
sequence is called interleaving and represents one possible parallel execution. We
showed in Lemma 36 that the number of possible interleavings is quite high even
when we only consider a straight-line code.
2.5 Summary 135
We defined the terms of a critical and a noncritical parallel region and the im-
portant property called closure of a source transformation. The closure is defined
in Definition 39 and it states that the source transformation takes a noncritical SPL
code as input and produces a noncritical SPL code as output. The language SPL is
defined as a subset of C/C++. This simplifies the correctness proof by restricting
the set of possible statements in a parallel region. After the definition of SPL, we
defined the source transformation τ (P) and σ (P) which take a parallel region P as
input and provides the tangent-linear (τ (P)) or the adjoint model (σ (P)) as output.
The formal proofs that the transformation result is in fact the corresponding model
is shown in Proposition 41 and Proposition 43.
In this work, we only focus on the source transformation of a parallel region P
and not the transformation of the code Q where P is embedded in. However, the
derivative codes of P and Q have to fit together and we explained the corresponding
interface in Section 2.3.3.
The most important section in this chapter was Section 2.4. In this section we
proved that the closure property for τ (P) and σ (P) holds under certain circum-
stances. The tangent-linear source transformation τ (P) fulfills the closure property
for a noncritical input code P written in SPL. To achieve the closure property of
σ (P), the input code P has to be noncritical and in addition it has to fulfill the
exclusive read property. The exclusive read property was defined in (2.66). It
requires that during the parallel execution of P it never happens that two differ-
ent threads reads from the same memory location concurrently in the context of a
right-hand side evaluation of an assignment. In shared memory parallel program-
ming it is a general pattern to let threads read from a shared memory location.
These codes where such patterns are used must be adjusted by hand such that each
thread uses memory locations exclusively, or the source transformation tool must
handle these codes adequately. However, the exclusive read property for the input
code P is necessary because otherwise the concurrent read access of two threads in
P becomes a concurrent store operation during the execution of the adjoint model
σ (P).
In order to enable the source transformation tool to recognize the exclusive read
analysis, we introduce a static program analysis in Chapter 3. Afterwards in Chap-
ter 4, we make the transition from a general parallel region to an OpenMP parallel
region. We show how the most important constructs of OpenMP can be trans-
formed into the tangent-linear or the adjoint model.
136
si = y(1) ← 0
s j = y(1) ← 0
0
sti , stj with si , s j ∈ ρ(S)
s j = y ← STACK(1) f .top()
si = y ← STACK(1) f .top()
si = x(1)k +← φxk (x1 , . . . , xn ) · y(1) , k ∈ {1, . . . , n} s j = x(1)l +← φxl (x1 , . . . , xn ) · y(1) , l ∈ {1, . . . , n}
0
Figure 2.6: This tree shows the possibilities for the case where both sti and stj are from the reverse section. We show only these cases
where the left-hand side reference of si is probably read or written in statement s j .
2 Transformation of Pure Parallel Regions
3 Exclusive Read Analysis
The exclusive read property was introduced in (2.66) on page 119. If the parallel
region P fulfills the exclusive read property then it is ensured that it never happens
that two threads read the same memory location concurrently in the context of an
assignment evaluation. Just to be clear, this situation is not a race condition and
the parallel execution of P is correct if two threads read the same memory location.
The only problem with such a code is that the concurrent read access in P becomes
a concurrent store operation in the adjoint model σ (P).
The constraint that the input code for our adjoint transformation has to fulfill the
exclusive read property can be handled in different ways. To neglect the pragmas
to avoid the parallel execution solves the problem through a sequential execution
but this is most likely the worst solution. Another option is to put the responsibility
into the hands of the user of the source transformation tool such that the user has to
provide an option to the tool whether the exclusive read property holds or not. De-
pending on the given option the transformation tool implements different versions
of the derivative code. The last possibility is that one could assume conservatively
that all input codes possibly violate the exclusive read property. In order to resolve
possible race conditions, the source transformation tool generates derivative code
that uses synchronization or additional memory to avoid the race condition. Both,
synchronization and additional memory, imply overhead that should be avoided
whenever possible.
This motivates a static program analysis that provides as much information as
needed to avoid synchronization constructs. With this information at hand, the
source transformation tool only augments these adjoint assignments with a syn-
chronization construct where the static program analysis cannot ensure that the
original assignment from P fulfills the exclusive read property. The dataflow anal-
ysis that we introduce in this chapter is called exclusive read analysis (ERA). The
target language of the ERA is SPL that we defined in Section 2.2. The main part
of ERA is basically spoken an integer interval analysis.
Beginning at the point where the parallel directive is placed, the analysis propa-
gates an approximation of the possible value range of all integer variables through
all the possible computation paths. This is done by solving an equation system it-
eratively. The equations are called transfer functions and represent the dataflow of
the code. The iterative process finishes with a fixed point that hopefully provides
enough information to decide the exclusive read property. For example, when a
statement assigns a constant value to an integer, we propagate this information
further through the transfer functions. The interval of the value range of each in-
teger variable changes depending on what assignments each variable encounters.
Unfortunately, the theorem of Rice [68] shows that we can only expect an approx-
imation of the actual value range at compile time.
For simplification purposes, we assume that the number of threads p ∈ N divides
the number of data elements n ∈ N. This implies that n = k · p for a k ∈ N and
therefore the inequality n ≥ p is valid.
A detailed introduction into static program analysis is beyond the scope of this
thesis. Therefore, we refer to textbooks with a detailed description of dataflow
analyses [59, 43]. A more common introduction into the domain of static program
analyses used in nowadays compiler can be found in [4, 79, 56, 31, 1, 42, 80].
As an example, let us consider a private integer variable, say i. With the infor-
mation about the value range of i for each thread, the source transformation tool
is able to decide whether or not a read access to a floating-point memory location,
say x[i], is exclusive or not. This is the case if during runtime each thread sees
its own range of values in variable i. For example, thread 0 uses values from zero
to four during the runtime whereby thread 1 encounters the values five to nine in
variable i. In this particular case, the memory access to x[i] fulfills the exclusive
read property. In case that both threads use the whole value range from zero to
nine for variable i, the memory access to x[i] is not exclusive since both threads
read all the ten elements of the array x. The difficulty is to decide at compile time
what value range the threads will see at runtime.
18. To minimize the effect of false sharing1 , the sum of the polynomial values
is first computed in a private variable and afterwards at the end of the parallel
region copied into the shared array thread_result. For readability reasons, the
computation of the polynomial value is coded as one would probably calculate it
on a sheet of paper. In general holds that a parallel evaluation of a polynomial is
possible in log2 n steps where n is the polynomial degree [46, 65].
1 #pragma omp p a r a l l e l
2 {
3 int tid←0;
4 i n t p← 0 ;
5 i n t c← 0 ;
6 int lb←0;
7 i n t ub← 0 ;
8 int i←0;
9 int j←0;
10 i n t k← 0 ;
11 double y ;
12 double local_sum← 0 . ;
13
14 t i d ←omp_get_thread_num ( ) ;
15 p←omp_get_num_threads ( ) ;
16 c←n/p ;
17 lb← t i d ∗c ;
18 ub← ( t i d +1)∗ c −1;
19 i←lb ;
20 t h r e a d _ r e s u l t tid ← 0 . ;
21 w h i l e ( i ≤ ub ) {
22 y← 0 . ;
23 j←0;
24 k←n −5;
25 while ( j ≤ k) {
26 y ← Ai∗n+ j+4 ∗ ( x i ∗ x i ∗ x i ∗ x i ) + Ai∗n+ j+3 ∗ ( x i ∗ x i ∗ x i )
27 + Ai∗n+ j+2 ∗ ( x i ∗ x i ) + Ai∗n+ j+1 ∗ ( x i ) + Ai∗n+ j ;
28 l o c a l _ s u m+←y ;
29 j ← j +1;
30 }
31 i ← i +1;
32 }
33 t h r e a d _ r e s u l t tid+← l o c a l _ s u m ;
1 This effect often occurs in the context of cache coherence, see [66, p. 141] or [15, p. 155].
140 3 Exclusive Read Analysis
34 }
Listing 3.1: The array A contains the coefficients of n · 5n polynomial functions of degree
four. A can be considered as a two dimensional array with n rows where each row
contains n values. These values represent the coefficients of the polynomial functions.
Each polynomial function that has its coefficients in row i is evaluated at point xi . The
n rows of A are divided into chunks where each thread gets a chunk of rows. The
polynomial function values of each chunk are evaluated concurrently. Each thread puts
the sum of the polynomial values into the private variable local_sum. At the end of the
parallel region the value of the local sum is put into a component of the shared array
thread_result.
Our focus in this example code are the lines 26 and 27. The different elements
inside the arrays A and x are accessed by different offsets. The offsets themselves
are composed of integer expressions. Our goal in this chapter is to establish a
static program analysis that decides whether or not these offsets are used only by
one thread of probably by multiple threads.
For example, variable i takes only values between the lower bound lb and the
upper bound ub. As shown in Listing 3.1, these boundaries depend on the individ-
ual thread identifier number provided by omp_get_thread_num() and the size
of the chunk c that each thread should process. The chunk size depends on the
overall number of elements n that should be processed and the number of threads
that is supplied by omp_get_num_threads(). The upper bound of thread t ubt
has the offset value that is one below the lower bound offset value of thread (t + 1),
namely lbt+1 . Therefore, when we use t as a thread ID, p as the number of con-
current threads, and n as the number of data elements, then we can write the range
of values that i takes in thread t as:
n n
t · ≤ i ≤ (t + 1) · − 1 (3.1)
p p
When we multiply with n we get:
n n
t· · n ≤ i · n ≤ (t + 1) · · n − n
p p
By respecting that variable j takes values from zero to (n-5), we obtain the values
of the expression i · n + j for thread t:
n n
t· · n ≤ i · n + j ≤ (t + 1) · · n − 5 (3.2)
p p
The expressions in (3.2) depend on the unique thread identification number t, the
number of threads p, the size of the data n, and a linear term. In case that we
141
propagate the information about the possible value range of all integer variables in
P along the dataflow, we probably may perform an interval analysis.
In order to decide the exclusive read property, we examine whether or not the
intervals of the threads t and t + 1 overlap. For example, let us consider inequality
(3.2). We check that the value of the upper bound of thread t is smaller than the
value of the lower bound of thread t + 1:
n n
(t + 1) · · n − 5 < (t + 1) · · n (3.3)
p p
−5 < 0
This inequality is valid for all values of t, n, and p. Hence, the memory refer-
ence Ai∗n+ j in line 26 fulfills the exclusive read property. The fact that memory
a
reference xi is read exclusively can be recognized from inequality (3.1).
The following sections contain the base ingredients that are necessary to define
the dataflow analysis called ERA. Section 3.1 displays how the control flow of a
given SPL code is formalized. Our analysis and its formalism bases on the interval
analysis presented in the textbook [59]. Therefore, we briefly present the idea
of this integer interval analysis in Section 3.2. Since we need to track the value
propagation from, for example, omp_get_thread_num() to the definition of the
lower and upper boundaries for each thread, we need an extension of this interval
analysis. Therefore, in Section 3.3 we define some order-theoretic foundations
for directed acyclic graphs (DAG). DAGs are mainly used in our implementation
of the source transformation tool. The value range of integer variables in our
SPL code is going to be represented as an interval where the lower and the upper
boundary are formed by a DAG. Hence, Section 3.4 clarifies how we define a
partial order of intervals of DAGs. All these sections serve as introduction for
the dataflow analysis that is defined in Section 3.5. The result of this dataflow
analysis is an approximation of the value range of all integer variables in the given
SPL code. This approximation is necessary to decide whether or not a memory
reference that is addressed by a base address and an offset expression, is accessed
only by one thread or maybe by multiple threads at the same time. This check is
made similar as shown in Example 23 and is explained in Section 3.6.
142 3 Exclusive Read Analysis
Definition 48 (Initial and final labels). Every statement in SPL has a single en-
try which is called initial label. The mapping init : SPL → N maps a given SPL
statement to its initial label. Suppose P has the shape
init(P) := init(D) if D 6= 0/
init(P) := init(S) if D = 0/
init(D) := init(d1 ) if D = (d1 ; d2 ; . . . ; dq ), q ∈ N
init(S) := init(s1 ) if S = (s1 ; s2 ; . . . ; sq ), q ∈ N
unsigned int a
unsigned int a ← N
int a
init(l : d) := l if d =
int a ← Z
double x
double x ← R
STACK ∗ .push(a)
STACK∗ .pop()
a←e
y ← φ (x , . . . , x )
1 n
init(l : s) := l if s =
y +← φ (x 1 , . . . , xn )
if (b) { S }
while (b) { S }
#pragma omp barrier
The mapping final : SPL → P(N) provides the set of exit labels.
final(P) := final(S)
final(D) := dq if D = (d1 ; d2 ; . . . ; dq ), q ∈ N
final(S) := sq if S = (s1 ; s2 ; . . . ; sq ), q ∈ N
3.1 Control Flow in SPL code 143
unsigned int a
unsigned int a ← N
int a
final(l : d) := l if d =
int a ← Z
double x
double x ← R
STACK∗ .push(a)
STACK ∗ .pop()
a ← e
final(l : s) := l if s =
y ← φ (x1 , . . . , xn )
+← φ (x1 , . . . , xn )
y
#pragma omp barrier
final ( l : if (b) { S } ) := {l} ∪ final(S)
final ( l : while (b) { S } ) := {l}
The final label in a branch statement is either the branch statement itself in case
that the test expression b is not valid or if b is evaluated to true, the final label is
the final label of the sub block S. The loop statement allows in our SPL language
only the test expression b as exit. The break statement in C/C++ would make this
definition more complex.
Definition 49 (Flow relation). The relation flow ⊂ N × N returns for a given SPL
statement its control flow as a set of pairs of labels. The empty set as flow is
defined for the following statements
STACK∗ .push(a)
STACK∗ .pop()
a ← e
flow(s) := 0/ if s =
y ← φ (x1 , . . . , xn )
y +← φ (x1 , . . . , xn )
#pragma omp barrier
144 3 Exclusive Read Analysis
This means that the flow of the parallel region consists of the join of the flow sets
of the sequence of declarations and the sequence of statements.
8 t i d ←omp_get_thread_num ( ) ;
9 y tid ← 0 . ;
10 w h i l e ( t <100.) {
11 y tid ←y tid+t ∗ s i n ( x tid ∗ x tid ) ;
12 #pragma omp b a r r i e r
13 i f ( t i d = 0) {
14 t←t +1.;
15 }
16 #pragma omp b a r r i e r
17 }
18 }
Listing 3.2: Example code that is represented as a control flow graph in Figure 3.2. The
barriers prevent the threads from computing the value of ytid during the update of the
master thread.
Let us consider the node 59 in Figure 3.2. It represents the branch statement in
line 13 of Listing 3.2.
The init set of node 59 contains only the label 59 because the control flow of
the program first tests the boolean expression before it decides where to continue
the execution. The final set of node 59 is {57, 59}. The label 59 corresponds to
the case that the boolean expression is false and the runtime execution continues
with the code following the conditional statement, here the next code statement
is the barrier construct. In case that the boolean expression is true, the execution
continues with the code labeled with 57 which is also the last statement inside the
body of the conditional statement. Finally, the flow set of the conditional statement
is given by
which corresponds with the edge going from node 59 to 57 in Figure 3.2. The edge
going from node 59 to node 60 is evoked by the flow relation of the sequence aof
statements that represents the body of the loop in line 10.
146 3 Exclusive Read Analysis
The empty interval ⊥ is a subset of all intervals I ∈ Interval. Figure 3.1 clarifies
the idea behind the partial ordering v on Interval. The figure can be read from the
bottom to the top with a "contained in" relation. The ⊥ element is contained in
all intervals. The intervals with only one integer number are contained in intervals
with two integer numbers and so forth. The greatest interval in Interval is [−∞, ∞]
that contains all integer numbers together with −∞ and ∞. This interval is called
the top element and is denoted by the symbol >. We define the following order on
the sets of intervals.
where
l
Z1 := {n1 | [n1 , n2 ] ∈ J} (3.5)
G
Z2 := {n2 | [n1 , n2 ] ∈ J}. (3.6)
3.2 Integer Interval Analysis 147
[−∞, ∞] = >
. .. ..
.. . .
Figure 3.1: The complete lattice Interval. The empty interval ⊥ is a subset of all intervals in
Interval. The interval [0, 0] consists of the element zero and [1, 1] is comprised of the value
one. Both, [0, 0] and [1, 1], are subsets of the interval [0, 1] that covers the values zero and
one. Thus, an interval is contained in all intervals that lies on a path from this interval to the
interval [−∞, ∞].
d
Please note that in (3.5) is the greatest lower bound of Z ∪ {−∞} and in (3.6)
F
is the least upper bound of Z∪{∞}. Each interval in J is surrounded by the interval
F F
J. The definitions (3.5) and (3.6) ensure that J is indeed the least upper bound
of J. Since J was chosen arbitrary each subset J of Interval has the least upper
bound J. Hence, (Interval, v) is a complete lattice.
F
This concludes the section about the base idea for our static program analysis.
The next section introduces our approach for the exclusive read analysis where we
extend the complete lattice (Interval, v) in such a way that the complete lattice
does not consist of intervals of integer numbers but of intervals of DAGs. An
analysis as shown in the current section would also be possible but this would
lead to a relative coarse grained information. For example, the lower bound lb in
Listing 3.1 is computed by
148 3 Exclusive Read Analysis
t i d ←omp_get_thread_num ( ) ;
p←omp_get_num_threads ( ) ;
c←n/p ;
lb← t i d ∗c ;
In case that we use the analysis from this section we cannot trace back the infor-
mation that lb depends on the thread identifier number, the number of threads, and
the number of elements to process. In addition, intervals that result from expres-
sions with several operations such as a multiplication tend to be an overestimation
such that they reach the top element > = [0, ∞]. This happens since the analysis
has to be conservative and in case that the interval cannot be determined exactly the
analysis information becomes the top element. The information of the top element
is obviously not sufficient for our purpose.
For the exclusive read analysis, we need an information whether or not an inte-
ger variable has overlapping value ranges in different threads. Therefore, we have
to go a step further and to track the value propagation of a computation with help
of DAGs. These DAGs are used as lower and upper boundaries for intervals. Af-
ter a fixed point computation, we yield intervals of DAGs that can be compared
to achieve the information if certain intervals are used individually by threads or
whether the occurring values are the same for different threads. The first step for
extending the current analysis is to show that the set of possible directed acyclic
graphs DAG and a relation v form a partial order.
V = { a, Z, +, −, ∗, / | a ∈ INTP , Z ∈ Z ∪ {−∞, ∞} } .
0: line(0)
type(CFG_ENTRY)
6: line(1)
double* x;
8: line(2)
double* y;
12: line(3)
double t=0.;
62: line(18)
type(PARALLEL_REGION)
17: line(7)
int tid=0;
20: line(8)
tid=omp_get_thread_num()
26: line(9)
y[tid]=0.
61: line(17)
type(SPL_cond_while)
test(t<100.)
49: line(13)
type(BARRIER)
59: line(15)
type(SPL_cond_if)
test(tid==0)
57: line(14)
t=t+1.
60: line(17)
type(BARRIER)
d1 := /
t n p
∗
d2 :=
+ /
t 1 n p 1 .
For simplicity reasons, we use the expression that the DAG represents to describe
its appearance instead of showing the DAG as graph. For example, we simply
write the above graphs as
d1 := t ∗ (n/p) (3.7)
and
In case that dl ∈ DAG only consist of one node and this node represents an integer
number Z then we write
dl = Z.
One of the essential operations on DAGs is the equality operation. We define two
DAGs, dl and dr , as equal if and only if they consist of the exact same number of
3.3 Directed Acyclic Graphs and Partial Orders 151
nodes and all the nodes have the same content and the same set of predecessors
and successors. We express the equality by writing
dl = dr .
Please note that this definition does not cover cases where we have algebraic equal-
ity due to commutativity. In the simple implementation of the analysis that we
want to introduce here, the expressions (t · np ) and ( np ·t) are not equal, as shown in
Figure 3.3.
∗ ∗
/ /
t n p n p t
n n
(a) DAG for expression t · p (b) DAG for expression p ·t
Figure 3.3: Both expressions are algebraically equal due to the commutativity of the multi-
plication of two natural numbers. However, we define these DAGs as not equal since they
have a different structure.
da v db :=
true if da = db
true if da = 0 and db ∈ N0
true if da = −∞ ∈ Z ∪ {−∞}
true if db = ∞ ∈ Z ∪ {∞}
Z ≤ Z
a b if da = Za , db = Zb , where Za , Zb ∈ Z
(3.9)
true if db = da + d, where d = Z ∈ Z, Z ≥ 0
true if da = d1 , db = d2 , see (3.7), (3.8)
if da = ((((t + 1) ∗ (n/p)) − 1) ∗ n) + (n − 1),
true
and db = ((t + 1) ∗ (n/p)) ∗ n
otherwise
f alse
where the need for the certain expressions becomes more clear at the end of this
chapter. The relation provides true in case that both DAGs are equal. When da
152 3 Exclusive Read Analysis
represents the smallest element in the domain which is −∞, or when db represents
the biggest element ∞ then the relation da v db is true. Another special case is
that da represents zero and the other DAG represents some value out of the natural
number range with zero. In this case zero is for sure the lowest boundary and the
relation provides also the value true. If both DAGs represent an integer numbers,
say Za and Zb , we use the relation Za ≤ Zb to determine the result where ≤ ⊆ Z × Z
is the general ordering of integer numbers.
It often occurs the situation where an integer variable is incremented and this
variable is used as an offset for accessing some data. This means for our relation
that we have two DAGs from consecutive iterations with a similar structure. The
only difference between them is an increment of an integer number, let us say Z. If
db is da plus Z and Z has a positive value or the value is zero then da v db is true.
The remaining two cases where the result is true are special cases which reflect
the often used SPMD data decomposition code patterns. The first special case is
the one where the DAGs are defined as shown in (3.7) and (3.8). The correspond-
ing inequality
n n
t· ≤ ( (t + 1) · ) − 1 (3.10)
p p
can be rewritten as
n n n
t· ≤ t · + −1
p p p
what leads to
n
1≤ . (3.11)
p
As a reminder, n represents here the number of data elements that should be pro-
cessed by the group of threads. p is the number of threads that process the n
elements. Therefore, n and p are positive natural numbers and we obtain that the
inequality (3.10) is true for n ≥ p. We assume for our program analysis that the
number of data sets is always bigger than the number of threads. In practice this is
almost always the case but has to be tested by an actual implementation.
The second special case reflects the following inequality.
n n
(t + 1) · − 1 · n + n − 1 ≤ (t + 1) · · n
p p
n n (3.12)
(t + 1) · · n − n + n − 1 ≤ (t + 1) · · n
p p
−1 ≤ 0
3.3 Directed Acyclic Graphs and Partial Orders 153
Therefore, the inequality is always true, no matter what the actual values of n, p,
and t are. In order to show that (DAG, v) is a partial order, we have to show that the
relation v ⊆ DAG × DAG from (3.9) is reflexive, transitive, and anti-symmetric.
Lemma 50. (DAG, v) is a partial order where DAG is the domain of possible
DAGs and the relation v ⊆ DAG × DAG is an ordering relation. This means for
d1 , d2 , d3 ∈ DAG it holds:
1. Reflexivity: d1 v d1
2. Transitivity: d1 v d2 ∧ d2 v d3 =⇒ d1 v d3
3. Anti-Symmetry: d1 v d2 ∧ d2 v d1 =⇒ d1 = d2
Proof. Let d1 , d2 , d3 ∈ DAG be out of the set of possible DAGs.
1. Reflexivity: d1 v d1 is true since this situation is tested by the first case in
(3.9)
2. Transitivity: Let us assume that
d1 v d2 ∧ d2 v d3
holds. Since d1 v d2 is true, we consider the different cases from (3.9) where
the relation returns the value true.
a) d1 = d2 : With d2 v d3 we conclude d1 = d2 v d3 .
b) d1 = −∞: In this case d1 v d3 is true independent of the value in d3 .
c) d2 = ∞: From d1 v d2 and d2 v d3 we conclude that d3 = ∞ and there-
fore d1 v d3 .
d) d1 = Z1 and d2 = Z2 with Z1 , Z2 ∈ Z and Z1 ≤ Z2 : From d2 v d3 we
infer d1 v d3 with help of the relation ≤ ⊆ Z × Z.
e) Suppose d1 v d1 + d is true where d = Z ∈ Z. Therefore, we get d2 =
d1 + d v d3 and with Z ≥ 0 we conclude d1 v d3 .
f) d1 = t · np , d2 = ((t + 1) · np ) − 1: We know from (3.10) that with n ≥ p
it holds
n n
d1 = t · v ((t + 1) · ) − 1 = d2 .
p p
With d2 v d3 we conclude d1 v d3 .
g) d1 = ((((t + 1) · np ) − 1) · n) + (n − 1) and d2 = ((t + 1) · np ) · n: We
showed in (3.12) that d1 v d2 is always valid. Therefore, together with
d2 v d3 , we conclude d1 v d3 .
154 3 Exclusive Read Analysis
3. Anti-Symmetry:
a) d1 = d2 : In this case nothing is to show.
b) d1 = −∞: From −∞ v d2 and d2 v −∞ we conclude d2 = −∞.
c) d2 = ∞: With d1 v ∞ and ∞ v d1 we achieve d1 = ∞.
d) d1 = Z1 and d2 = Z2 , where Z1 , Z2 ∈ Z: With Z1 ≤ Z2 and Z2 ≤ Z1 we
conclude Z1 = Z2 .
e) d2 = d1 + d where d = Z ∈ Z: Since d1 v d1 + Z v d1 , and Z ≥ 0 the
only possible value for Z is 0. Thus, it holds d = 0 and d1 = d2 .
f) d1 = t · np and d2 = ((t + 1) · np ) − 1: From (3.11) we know that this
leads to the inequalities p ≤ n and p ≥ n. Hence, it holds n = p and
therefore d1 = d2 .
g)
n
d1 = (t + 1) · − 1 · n + (n − 1)
p
and
n
d2 = (t + 1) · ·n :
p
d
The greatest lower bound (GLB) of a subset of DAGs is denoted by and we
define it recursively. This means that we trace back the GLB of a subset of DAGs
to the problem of a pair of DAGs.
l l nl o
{d1 , d2 , d3 , . . . , dn } = {d1 , d2 , d3 , . . . , dn−1 } , dn
..
.
l nl nl n l o o o
= . . . {d1 , d2 } . . . , dn−1 , dn
Suppose that d1 and d2 are out of DAG and suppose that n ≥ p, then the GLB of a
3.3 Directed Acyclic Graphs and Partial Orders 155
The first case provides d1 as result when both DAGs are equal. In case that
one DAG represents −∞ we return this DAG since −∞ ≤ z for all integer numbers
z ∈ Z∪{−∞}. This holds analogously for the case where one DAG represents ∞. If
both DAGs represent integer numbers Z1 and Z2 , we use the relation ≤ ⊆ Z × Z to
determine the GLB. Afterwards, we handle the two special cases that we already
introduced in (3.10) and (3.12). During the fixed point iteration there is often a
comparison of two DAGs necessary where the difference between them is only an
addition or a subtraction of an integer number. Therefore, this case is also covered
in (3.13).
The fact that we define the GLB only for a set of two DAGs can be exploited in
the definition of the lowest upper bound (LUB). The LUB of two DAGs is defined
by the result of the GLB of these two DAGs.
d
G d1 if d{d1 , d2 } = d2
{d1 , d2 } := d2 if {d1 , d2 } = d1 (3.14)
∞ otherwise
We defined the partial order (DAG, v) which is indeed a partial order as Lemma 50
showed. In addition, we defined the least upper and the greatest lower bound of
156 3 Exclusive Read Analysis
two DAGs. The next section defines the partial order that has intervals of DAGs as
domain.
The ordering v on DAG is the one defined in (3.9). The empty interval is denoted
by ⊥. Therefore, we initialize at the begin of the dataflow analysis all information
with ⊥. The relation v of two intervals is defined by the equivalence
The greatest interval in DAGIntervals is [−∞, ∞] where −∞ and ∞ are DAGs with
only one node with the corresponding content. [−∞, ∞] covers all integer numbers.
The expressions represented by DAG can all be evaluated to have a value out of
Z ∪ {−∞, ∞}. This means that [−∞, ∞] ∈ DAGIntervals contains all DAGs that
can be evaluated to have an integer number as a result. The interval [−∞, ∞] ∈
DAGIntervals is called the top element and is denoted by the symbol >.
Proof. Let [d1 , d2 ], [d3 , d4 ], [d5 , d6 ] be out of the domain DAGIntervals with d1 , d2 ,
d3 , d4 , d5 , d6 ∈ DAG.
3.4 Intervals of Directed Acyclic Graphs 157
• Transitivity:
• Anti-Symmetry:
The next step is to show that the partial order (DAGIntervals, v) is a com-
plete lattice. We claim that each subset of DAGIntervals has a least upper bound.
With this claim, we can conclude that (DAGIntervals, v) is a complete lattice [59,
app. A]. Suppose that J is a subset of DAGIntervals. We define the least upper
bound of J in the following way
(
G ⊥ if J = ⊥ or J = {⊥}
J=
[D1 , D2 ] otherwise
where D1 and D2 are defined with help of the GLB and LUB of DAG.
l
D1 := {d1 | [d1 , d2 ] ∈ J} (3.16)
G
D2 := {d2 | [d1 , d2 ] ∈ J}. (3.17)
D1 is the greatest lower bound of the set of all left-sided boundaries of intervals
in J. D2 is the lowest upper bound of all right-sided boundaries of intervals in J.
F
The definitions (3.16) and (3.17) ensure that all intervals in J are contained in J.
158 3 Exclusive Read Analysis
F
The fact that J is indeed the least upper bound of J can be understood by
assuming that we have another upper boundary [l1 , l2 ] with
G
[l1 , l2 ] v J = [D1 , D2 ] .
D1 v l1 ∧ l2 v D2 .
[d1 , d2 ] v [l1 , l2 ]
l1 v d1 ∧ d2 v l2 .
D1 v l1 v d1
for the left-sided boundaries. We reason analogously for the right-sided bound-
aries, which brings us to
d2 v l2 v D2
because D2 is defined as LUB in (3.17). The fact that D1 and D2 are boundaries
that indeed occur in intervals [d1 , d2 ] ∈ J leaves the only conclusion that
l1 = D1 ∧ l2 = D2 .
F
Therefore, J is indeed the least upper bound of J and since J was chosen arbitrary
we imply that each J ⊆ DAGIntervals has a least upper bound. This shows that
(DAGIntervals, v) is a complete lattice.
With the partial order (DAGIntervals, v) that we defined in this section, we have
the core requirement to define a dataflow analysis with intervals of DAGs. The next
sections shows how we propagate the analysis information along the control flow.
code statement. δ is a mapping that provides the connection between integer vari-
ables of the parallel region and its corresponding value range interval. For eval-
uating expressions we have to define how intervals of DAGs should be combined
when the corresponding variables are part of an expression. For example, a mul-
tiplication of two expressions leads necessarily to an operation that combines the
analysis information of these two expressions. With these definitions, we can de-
fine the evaluation function valδ (e) that evaluates the expression e depending on
the analysis information δ .
After the definition of the transfer functions φl we will have to ensure that
our static analysis reaches a fixed point. This problem is solved by introducing
a widening operator in Section 3.5.1. We will see that this method tends to supply
an overestimate what motivates the definition of a narrowing operator. Because
of the need of very precise interval information, Section 3.5.2 presents a further
improvement of our program analysis by taking the test expressions of loops and
conditional branches into account.
Our analysis is a forward problem which means that the dataflow is analyzed
from the beginning to the end of the control flow. Therefore, the analysis starts
at init(P) and ends at an element of the set final(P). The analysis information at
the entry is initialized such that all integer variables that have a shared status are
assumed to have a certain well defined value. For example, the size of the data
that the threads are going to process can be contained in the integer variable n.
The value of n must be set outside the parallel region and all the threads need this
variable to calculate the size of the chunk that they have to process. Therefore,
variable n is shared among the group of threads and the interval for this variable is
defined to [n, n] at the entry of the parallel region.
The possible value range of an integer variable a is represented by intervals
out of DAGIntervals. The association of a variable a ∈ INTP with its value range
is done by the mapping δ : INTP → DAGIntervals. The complete lattice in our
program analysis is therefore (D, v) where the domain D is defined by
is. We define the operations ⊕, , ⊗, and for combining intervals where the
corresponding operations on the variables are +, −, ∗ and /. The DAGs d1 , d2 , d3 ,
and d4 represent the boundaries of the two intervals that are combined by such an
operation. The first operation is the addition of two intervals which is defined as
(
⊥ if [d1 , d2 ] = ⊥ or [d3 , d4 ] = ⊥
[d1 , d2 ] ⊕ [d3 , d4 ] :=
[d1 + d3 , d2 + d4 ] otherwise
The multiplication ⊗ and the division operation are defined with help of the
GLB and the LUB.
(
⊥ if [d1 , d2 ] = ⊥ or [d3 , d4 ] = ⊥
[d1 , d2 ] ⊗ [d3 , d4 ] :=
[D5 , D6 ] otherwise
where l
D5 = {d1 ∗ d3 , d1 ∗ d4 , d2 ∗ d3 , d2 ∗ d4 },
and G
D6 = {d1 ∗ d3 , d1 ∗ d4 , d2 ∗ d3 , d2 ∗ d4 } .
Please note that for better readability reasons, we use the more common notation
of a set than the pairwise notation that we use in the definition of the GLB and
LUB in (3.13) and (3.14). The remaining definition is the one for an expression
where a division is involved.
(
⊥ if [d1 , d2 ] = ⊥ or [d3 , d4 ] = ⊥
[d1 , d2 ] [d3 , d4 ] :=
[D5 , D6 ] otherwise
where l
D5 = {d1 /d3 , d1 /d4 , d2 /d3 , d2 /d4 },
and G
D6 = {d1 /d3 , d1 /d4 , d2 /d3 , d2 /d4 } .
With the above definitions we can define a mapping val that takes variables,
constants, or expressions as input and provides the corresponding interval from
3.5 Data Flow Analysis with DAG Intervals 161
and
[NUM_OF_THREADS, NUM_OF_THREADS]
value propagation into the stack data structure and back, we return the top element
>. This means that the variable can possess all values from [−∞, ∞].
valδ ( STACK(i|c) .top() ) := >
Eventually, we have the ingredients at hand to define the transfer functions
{φl | l ∈ LABEL(P)}
where l is a label associated uniquely with a certain statement in P. The transfer
function φl takes the dataflow information δ that is valid before the evaluation of
the statement [s]l labeled with l and provides the dataflow information as output
that is valid after the evaluation of [s]l .
δ if [s]l = STACK(i|c| f ) .push(a)
if [s]l = STACK(i|c| f ) .pop()
δ
if [s]l = y ← φ (x1 , . . . , xn )
δ
if [s]l = y +← φ (x1 , . . . , xn )
δ
if [s]l = unsigned int a
δ [ a → ⊥ ]
φl ( δ ) := δ [ a → [N, N] ] if [s]l = unsigned int a ← N (3.19)
δ [ a → ⊥ ]
l
if [s] = int a
δ [ a → [Z, Z] ] if [s]l = int a ← Z
δ [ a → valδ (e) ] if [s]l = a ← e
if [s]l = if (b) { S }
δ
if [s]l = while (b) { S }
δ
As stated in (3.19), the dataflow information δ is only altered for declarations and
assignments of integers, in all other cases δ keeps the same as it was before the
statement. If the control flow encounters an integer declaration (int a) then the
value of the variable a is undefined what is represented by the ⊥ element. In
case that the declaration of a is combined with an initialization with the value Z
(int a ← Z), we change the information of a to be the interval [Z, Z]. An integer
assignment a ← e influences δ in such a way that it resets the information for a to
the interval that is given by valδ (e).
To convince the reader that our static program analysis reaches a fixed point, we
present the theorem of Tarski and Knaster [45, 76]. This includes the introduction
of further terminology. During the fixed point iteration the approximation of the
intervals could possibly take the following chain of intervals:
0/ v [2, 2] v [1, 3] v [0, 4] v [0, 5] v [0, 6] v [0, 7] v . . . (3.20)
3.5 Data Flow Analysis with DAG Intervals 163
A subset S of the partially ordered set DAGIntervals is called chain [59, app. A] if
for all intervals [l1 , u1 ], [l2 , u2 ] ∈ S pertains that
Unfortunately, the complete lattice property and the ACC property are unrelated.
Therefore, we have to ensure that (DAGIntervals, v) satisfies the ACC.
In order to avoid a fixed point iteration that alternates between increasing and
decreasing the approximation intervals, we must ensure that the transfer functions
are monotonic. This means that
for all [l1 , u1 ], [l2 , u2 ] ∈ (DAGIntervals, v). The definition of φ in (3.19) is the
identity mapping, the constant values ⊥ or [N, N], or the interval is determined by
valδ (e). In case that valδ (e) uses ⊕ and , the monotonicity is given because of
the usage of the linear operations + and − to obtain the resulting interval. d On
the other hand, the operations ⊗ and use the monotonicity of and . This
F
⊥ t d := d t ⊥ := d , d ∈ DAGIntervals
[d1 , d2 ] t [d3 , d4 ] := [d5 , d6 ]
with d1 , d2 , d3 , d4 , d5 , d6 ∈ DAG and
l
d5 := { d1 , d3 }
G
d6 := { d2 , d4 } .
To guarantee that our interval analysis find a solution we have to ensure that the
requirements of the following theorem are fulfilled.
164 3 Exclusive Read Analysis
is the least fixed point of φ where φ (d) = d and φ k+1 (d) := φ ◦ φ k (d)
Proof. See [59, app. A.4]
At this point we could implement the interval analysis but we would quickly
determine that there are codes where the static program analysis would not termi-
nate. The analysis iterates over and over again and increases the value range of
the corresponding intervals until the internal representation reaches its limits. An
example chain for this behavior is shown in (3.20). A counting loop is a typical
example code where a fixed point search at this point would fail since the interval
that represents the counting variable would be increased in every iteration of the
fixed point search. The missing requirement of Theorem 52 is the ACC. Hence,
the next section introduces a so called widening operator.
with
⊥Od := dO⊥ := d , d ∈ DAGIntervals
[d1 , d2 ]O[d3 , d4 ] := [d5 , d6 ]
with d1 , d2 , d3 , d4 , d5 , d6 ∈ DAG and
(
d1 if d1 v d3 (3.21)
d5 :=
−∞ otherwise
(
d2 if d4 v d2
d6 :=
∞ otherwise
With this widening operator, the ascending chain from (3.20) is transformed into
because
d0O = 0/
d1O = 0O[2,
/ 2] = [2, 2]
d2O = [2, 2]O[1, 3] = [−∞, ∞]
d3O = [−∞, ∞]O[0, 4] = [−∞, ∞] .
Example 25. Listing 3.3 shows the implementation of a data decomposition that
we often use in our example codes. A data set of size n is divided into chunks of
size c where c is n divided by the number of threads p. We assume for simplicity
reasons that the data size is divisible by the number of threads. Each chunk consist
of a certain index range [lb, ub] and each thread is uniquely associated with certain
chunks. The lower bound of this interval is lb, the upper bound by ub.
The thread identifier that is returned by the function omp_get_thread_num()
is put into the variable tid. The code that precedes the loop in line 11 is a typical
pattern in a parallel region that uses the SPMD model. This code is often implicitly
given as, for example, in the case for the OpenMP parallel for directive. Regardless
166 3 Exclusive Read Analysis
Algorithm 3 Calculation of the possible value ranges of the integer variables oc-
curring in parallel region P. The fixed point is calculated by using the widening
operator from (3.21).
Require: D (see (3.18)), LABEL(P), init(P) and flow(P)
Ensure: fixO (φL ), where fixO (φL ) w fix(φL ) and L ⊆ LABEL(P)
for all l ∈ L do
if l ∈ init(P) then
AI ← {δ [v → [v, v]] | v ∈ SHAREDP ∩ INTP }
else
AI ← ⊥D
end if
end for
cont ← true
while cont do
cont ← f alse
for all l ∈ L do
AInew ← ⊥
for all (l 0 , l) ∈ flow(P) do
AInew ← AInew t φl 0 (AIl 0 )
end for
if AInew v φl 0 (AIl 0 ) and AInew 6= φl 0 (AIl 0 ) then
AIl ← AIl OAInew
cont ← true
end if
end for
end while
Let us consider the dataflow information of the integer assignment in line 14. After
one iteration, we yield the following dataflow information:
n : [
n, n ]
tid : [
THREAD_ID, THREAD_ID ]
p : [
NUM_OF_THREADS, NUM_OF_THREADS ]
c : [
n/NUM_OF_THREADS, n/NUM_OF_THREADS ]
lb : [
THREAD_ID*(n/NUM_OF_THREADS), THREAD_ID*(n/NUM_OF_THREADS) ]
ub : [
((THREAD_ID+1)*(n/NUM_OF_THREADS))-1,
((THREAD_ID+1)*(n/NUM_OF_THREADS))-1 ]
i : [ THREAD_ID*(n/NUM_OF_THREADS), THREAD_ID*(n/NUM_OF_THREADS) ]
The shared variable n is assumed to be defined from the code outside the par-
allel region and is therefore given through the interval [n, n]. The special value
THREAD_ID is used by the program analysis to represent the unique thread iden-
tifier number that is supplied by the runtime function omp_get_thread_num().
The number of threads is coded through the specific value NUM_OF_THREADS
and is at runtime provided by omp_get_num_threads(). At this point, variable i
has the same value as variable lb since the incrementation statement in line 14 has
no effect yet. The effect of the incrementation in line 14 is visible from the second
iteration forward. Let us assume that only variable i is changed inside the loop.
Hence, the only interval that is changed through the next fixed point iterations is
the one for variable i. In case that we use the common join operation for intervals
rather than using the widening operator, we get
i: [ THREAD_ID*(n/NUM_OF_THREADS), THREAD_ID*(n/NUM_OF_THREADS)+1 ]
168 3 Exclusive Read Analysis
as result for variable i after the second iteration. The third iteration yields
i: [ THREAD_ID*(n/NUM_OF_THREADS), THREAD_ID*(n/NUM_OF_THREADS)+2 ]
and so forth. Since the fixed point iteration does not consider the condition of the
loop test, the iteration would never terminate2 .
In the following, we consider what happens if we use the widening operator
from (3.21). After applying the transfer function φ , we also get
i: [ THREAD_ID*(n/NUM_OF_THREADS), THREAD_ID*(n/NUM_OF_THREADS)+1 ]
as result when the computation path that contains the incrementation statement
comes the first time into play. But we do not join the results but rather use the
widening operator on
i: [ THREAD_ID*(n/NUM_OF_THREADS), THREAD_ID*(n/NUM_OF_THREADS) ]
and
i: [ THREAD_ID*(n/NUM_OF_THREADS), THREAD_ID*(n/NUM_OF_THREADS)+1 ]
i: [ THREAD_ID*(n/NUM_OF_THREADS), PLUS_INFINITY ]
2 Orthe implementation of the analysis would crash or provide unpredicted results due to an overflow
of the internally used value range.
3.5 Data Flow Analysis with DAG Intervals 169
starting from fixO (φL ). The widening operator fulfills the property that
Since the transfer function φL is monotonic, the sequence (φLk )k=1,2,... is also mono-
tonic. The sequence ( φLk (fixO (φL )) )k=1,2,... is a descending chain and it holds
Example 26. (Narrowing) Let us consider what the impact of narrowing is. In
order to illustrate this, we apply Algorithm 3 and Algorithm 4 to Listing 3.4.
1 j=l b ;
2 w h i l e ( i <=ub ) {
3 // Thread p r o c e s s e s data at p o s i t i o n i .
4 // . . .
5 j=ub ;
6 // . . .
7 i=i +1;
8 }
Listing 3.4: Example code for presenting the effect of the narrowing operator.
We show a portion of code without the whole parallel region because it contains
only slight changes comparing to Listing 3.3. The only difference is that we use
170 3 Exclusive Read Analysis
Algorithm 4 After a fixed point fixO (φL ) has been found by Algorithm 3, we try
to improve this approximation by using narrowing. The maximal number of nar-
rowing steps is here defined to be 10.
Require: fixO (φL ) from Algorithm 3 where L ⊆ LABEL(P)
Ensure: φLk (fix(φL )) with φLk (fixO (φL )) w φLk (fix(φL )) w fix(φL )
cont ← true
counter ← 1
while cont or counter ≤ 10 do
counter ← counter + 1
cont ← f alse
for all l ∈ L do
for all (l 0 , l) ∈ flow(P) do
if AIl 6= φl 0 (AIl 0 ) then
AIl ← AIl t φl 0 (AIl 0 )
cont ← true
end if
end for
end for
end while
here another integer variable j that is the left-hand side of two assignments in line
1 and line 5. To get a better context between the results and the code, we display
the control flow graph with the internal information of our compiler in Figure 3.4.
In the following, we show firstly the results that the widening algorithm provides
(Algorithm 3) where we only consider variable j because its value range is more
precise after narrowing.
Node 83 (SPL_cond_while):
j: [THREAD_ID*(n/NUM_OF_THREADS), PLUS_INFINITY]
Node 76 (SPL_int_assign):
j: [THREAD_ID*(n/NUM_OF_THREADS), PLUS_INFINITY]
Node 1 (CFG_EXIT):
j: [THREAD_ID*(n/NUM_OF_THREADS), PLUS_INFINITY]
These results serve as input for the narrowing algorithm (Algorithm 4) which pro-
vides the following improved result:
Node 83 (SPL_cond_while):
3.5 Data Flow Analysis with DAG Intervals 171
0: line(0)
type(CFG_ENTRY)
70: line(15)
j=lb
83: line(22)
type(SPL_cond_while)
test(i<=ub)
1: line(0)
...
type(CFG_EXIT)
76: line(19)
j=ub
...
82: line(21)
i=i+1
Figure 3.4: Extract from the control flow graph of the code shown in Listing 3.4.
j: [THREAD_ID*(n/NUM_OF_THREADS),((THREAD_ID+1)*
(n/NUM_OF_THREADS))-1]
Node 76 (SPL_int_assign):
j: [THREAD_ID*(n/NUM_OF_THREADS),((THREAD_ID+1)*
(n/NUM_OF_THREADS))-1]
Node 1 (CFG_EXIT):
j: [THREAD_ID*(n/NUM_OF_THREADS),((THREAD_ID+1)*
(n/NUM_OF_THREADS))-1]
172 3 Exclusive Read Analysis
The upper bound of the value range of j can be reduced from plus infinity because
F
our relation from (3.14) can decide the maximum of both possible values lb and a
ub.
We use here the relation a @ b without definition since it only serves as an ab-
breviation for a v b ∧ a = 6 b . As this definition shows, we only adjust the upper
boundary of the interval. This means, for example, in case that we have a loop
that decrements a counting variable, the resulting value range would still be minus
infinity. We want only to illustrate the idea of this analysis and let further im-
provements as the adjustment of the lower boundary to the people who implement
this.
Definition (3.22) consists of four different cases. The first and second case are
very similar and the only difference is the relation of the test expression. The
relation in the first case is a lower or equal relation (≤) and in the second case it
is a lower relation (<). This also holds for the cases three and four which is the
reason why we only explain case one and three.
The first case adjusts the information δ for variable a1 to [d1 , d4 ] in case that
the test expression is a1 ≤ a2 , the current value ranges of a1 is [d1 , d2 ], the value
174 3 Exclusive Read Analysis
...
227: line(39)
type(SPL_cond_while)
test(i<=ub)
i<=ub i>ub
232: line(40)
92: line(26)
thread_result[tid]+=
y=0.
local_sum
101: line(28)
k=n-5
221: line(37)
type(SPL_cond_while)
test(j<=k)
j<=k j>k
211: line(34)
226: line(38)
y=A[i*n+j+4]*(x[i]*x[i]*x[i]*x[i])
i=i+1
+...+A[i*n+j]
215: line(35)
local_sum+=y
220: line(36)
j=j+1
Figure 3.5: Extract from the control flow graph of the code shown in Listing 3.1. The edge
from node 227 to node 92 is labeled with i ≤ ub which means that the control flow only
takes this path if this condition is true. If the condition is not true then the control flow takes
the path from node 227 to 232 (i>ub).
range for a2 is [d3 , d4 ], and the relation d4 @ d2 is valid. This means that the
upper boundary of the value range of variable a2 can be carried over to the value
3.5 Data Flow Analysis with DAG Intervals 175
range of variable a1 . The fact that d1 v d4 should be true is based on the fact that
the adjusted interval must fit the requirement that the lower boundary is lower or
equal than the upper boundary. The third case is similar and covers the situation
where variable a1 has infinity as upper boundary and the variable a2 has the upper
boundary d4 that is different from infinity. In this case we can be sure that the
value range of a1 can be changed to d4 .
We saw in Example 26 that we gain better analysis information by applying nar-
rowing. However, this does not hold for Listing 3.1 and the next example presents
the results that we obtain when we use the adjusted transfer function φ(l,l 0 ) from
(3.22).
Example 27. (Exclusive-read analysis of Listing 3.1) For the sake of clarity, we
only present the results that differ from the ones that we obtained by using widen-
ing and narrowing. As we recognize in the below output, the upper boundary of
the intervals are not infinity. For example, let us consider node 211. We only list
the two variables i and j because the intervals of these two variables can be im-
proved by applying (3.22). The upper boundary of variable j is not longer infinity
but rather n-5. This better approximation can be made because we know that the
condition j≤k is valid at node 211 and since k has the upper bound n-5, we refine
the upper bound of j to n-5.
Node 92 (SPL_float_assign):
i: [THREAD_ID*(n/NUM_OF_THREADS),((THREAD_ID+1)*
(n/NUM_OF_THREADS))-1]
list of assertions:
1. ’i<=ub’
2. ’j>k’
Node 96 (SPL_int_assign):
i: [THREAD_ID*(n/NUM_OF_THREADS),((THREAD_ID+1)*
(n/NUM_OF_THREADS))-1]
list of assertions:
1. ’i<=ub’
2. ’j>k’
i: [THREAD_ID*(n/NUM_OF_THREADS),((THREAD_ID+1)*
(n/NUM_OF_THREADS))-1]
list of assertions:
1. ’i<=ub’
For all the intervals above holds that they only depends at most on three different
input values:
• the size of the data (here given by the shared variable n),
• the unique thread identifier THREAD_ID,
• and the special value NUM_OF_THREADS that represents the number of
threads that execute the current parallel region.
3.6 Towards the Exclusive Read Property 177
This shows that we may obtain the actual value range intervals by taking the con-
a
ditional test expressions into account.
This statement is internally identified by our implementation with node 211. The
result for node 211 that we achieved in Example 27 is:
178 3 Exclusive Read Analysis
Let us consider whether or not the memory access by x[i] is exclusive or not. In
order to compare the corresponding intervals for the threads t and t + 1, we re-
place THREAD_ID by t and t + 1. The internal name NUM_OF_THREADS is
replaced by p what represents the number of threads. The total number of data
elements processed by the group of threads is here n. Variable i used in thread t
has a value range of
n n
t · , (t + 1) · − 1 .
p p
The thread with identifier t + 1 uses the values
n n
(t + 1) · , (t + 2) · − 1
p p
for the variable i. This holds for all threads with the identification numbers between
0 and p − 1. In order to ensure that thread t does not use offset values which are
also used by thread t + 1, we compare the upper bound of the value range of thread
t with the lower bound of thread t + 1. Formally,
n n
(t + 1) · − 1 @ (t + 1) ·
p p
must be valid. This is obviously true since the two expressions only differ by a
subtraction of one. However, as often in the program analysis domain this is not
easy to decide for the compiler tool. Our implementation uses DAGs as internal
representation and therefore the implementation of the program analysis must be
able to compare these certain kinds of DAGs. For this reason, we defined in (3.13)
certain shapes of DAGs that can be compared. These shapes are implemented in
our tool and sophisticated enough to decide the exclusive read property for the
examples that we present in this work. For a productive implementation this set of
possible shapes probably must be extended.
The case we discussed above is the easiest shape that occurs, namely an array
reference [i] where only one variable is used. In case that variable j is used as an
offset, the value range is [0, n − 5]. This value range is the same for all threads
3.6 Towards the Exclusive Read Property 179
since it does not depend on the unique thread identification number THREAD_ID.
This fact alone is sufficient to imply that a memory reference x[j] cannot fulfill the
exclusive read property.
Another shape from the above example is [i*n+j] what is typically used when a
two dimensional array is referenced. As shown in the example, we use this expres-
sion together with an addition of a constant value to reference multiple elements
during one iteration. In fact, during one iteration of the inner loop of our example
code, the elements referenced by A[i*n+j], A[i*n+j+1], . . ., A[i*n+j+4] are read.
The two dimensional array is A which consist of n rows and n columns.
To decide whether or not the memory location referenced by A[i*n+j+3] is
read exclusively, we have to take three different intervals into account. We focus
on the valid intervals in node 211 for the variables i, j. The value range of the
shared variable n is [n, n]. With these intervals at hand, we can imply that thread t
encounters the following value range with respect to the expression i*n+j+3.
n n
t · · n + 0 + 3, (t + 1) · − 1 · n + (n − 5) + 3
p p
Analogously, thread t + 1 uses the following interval to reference the memory lo-
cation A[i*n+j+3].
n n
(t + 1) · · n + 0 + 3, (t + 2) · − 1 · n + (n − 5) + 3
p p
As before in the simple reference case, we compare the upper bound of thread t
with the lower bound of thread t + 1 to ensure that these intervals does not overlap.
Formally, the relation
n n
(t + 1) · − 1 · n + (n − 5) + 3 @ (t + 1) · ·n+0+3 (3.23)
p p
must be valid to ensure that thread t is not reading any components that are also
accessed by thread t + 1. We rewrite this in order to use the implementation of the
greatest lower bound.
l n
n
(t + 1) · − 1 · n + (n − 5) + 3, (t + 1) · ·n+0+3 (3.24)
p p
Since our tool is only capable of comparing a certain kind of DAGs, we have to
rewrite the given DAGs in (3.24) to achieve a shape that is known to our definition
of the greatest lower bound. For this reason, we implemented the distributivity
180 3 Exclusive Read Analysis
into
n
(t + 1) · · n − n + n + (−5 + 3) .
p
Subsequently, a normalization procedure looks for adjacent nodes in the DAGs that
can be simplified by simple algebraic rules. For example, the expression (−n + n)
is recognized to be zero and is erased from the DAG. Furthermore, the expression
where two constants are adjacent is simplified to one constant when the connecting
operation is a subtraction or an addition. In our example the expression (−5 +
3) is replaced by the constant −2, and the expression (0 + 3) is replaced by 3.
Eventually, the reformulation leads to
l n
n
(t + 1) · · n − 2, (t + 1) · · n + 3
p p
where we achieve two DAGs where a subtree is exactly the same and both DAGs
only differ by a constant term. This can be decided by the implementation of the
definition of the greatest lower bound in (3.13) and the program analysis can imply
that the memory reference through A[i*n+j+3] is exclusive for each thread.
The illustrated method from above to decide the exclusive read property is de-
scribed in Algorithm 5. It provides a set of labels ExclusiveRead where each
contained label l is associated with a memory reference that fulfills the exclusive
read property. As a starting point, the algorithm takes the result φLk (fix(φL )) that
we obtain by applying Algorithm 4 and the adjustments for taking the conditional
branches into account.
The approach in Algorithm 5 is as follows. Each floating-point assignment [s]l
in P is examined. All the right-hand side references x occurring in [s]l are tested.
There are two distinct shapes of memory references possible. The first kind reads a
memory location addressed by a base address b plus an offset value o. The second
kind is that a memory location is associated with a scalar variable x.
The fairly easy case to decide is the one where we consider a scalar variable. In
this case the exclusive read property depends on the sharing status of x according
to the OpenMP memory model3 . Only when x is a private variable, the associated
memory location is read exclusively by a thread.
3 For information see Section 1.3 on page 48
3.6 Towards the Exclusive Read Property 181
Algorithm 5 This algorithm provides a set of labels where the associated state-
ments are verified to contain only read accesses that are exclusive for all threads.
Require: φLk (fix(φL )) from Algorithm 4 where L ⊆ LABEL(P)
Ensure: ExclusiveRead ⊆ LABEL(P).
ExclusiveRead ← 0/
for all [s]l ∈ P do
if [s]l = y ← φ (x1 , . . . , xn ) or[s]l = y +← φ (x1 , . . . , xn ) then
for all &x ∈ RHSREF [s]l do
if &x = b + o then
[lt , ut ] ← valδ (o)
if THREAD_ID ∈ lt ∧ THREAD_ID ∈ ut then
if b ∈ SHAREDP then
[lt+1 , ut+1 ] ← valδ (o)[t/(t + 1)]
if ut @ lt+1 then
ExclusiveRead ← ExclusiveRead ∪ {&x}
else
ExclusiveRead ← ExclusiveRead \ {&v | &v = b + o0 }
end if
else
ExclusiveRead ← ExclusiveRead ∪ {&x}
end if
end if
else
if x ∈ PRIVATEP then
ExclusiveRead ← ExclusiveRead ∪ {&x}
end if
end if
end for
end if
end for
On the other hand, is the memory location &x addressed by a base address b
plus an offset o. The first check is whether or not the base address b is shared. If b
is private, the memory access is exclusive for each thread.
In case that b is shared, it must be verified that offset o has a different value
range for each thread. For this important case we need the results from the interval
182 3 Exclusive Read Analysis
Obviously, in order to enable the thread to exclusively use a certain value range,
the thread identifier THREAD_ID must be contained in the lower boundary lt as
well as in the upper boundary ut . The value range of offset o with respect to thread
t + 1 is
[lt+1 , ut+1 ] = valδ (o)[t/(t + 1)] ,
where valδ (o)[t/(t + 1)] means the interval provided by valδ (o) and each occur-
rence of t is replaced by the expression (t + 1), what is indicated by [t/(t + 1)]. To
ensure that the interval of thread t and thread t + 1 does not overlap the following
relation must be true:
ut @ lt+1
Please note that the relation @ again serves as an abbreviation for
ut v lt+1 ∨ ut 6= lt+1 .
In case that the memory reference &x is read exclusively, it is inserted into the set
ExclusiveRead:
This can be seen as the typical gen step in a dataflow analysis. In contrast to this
the kill step must be performed if a memory reference is not exclusively read:
For example, suppose that the analysis recognizes x[i] as exclusively read. Later in
the dataflow analysis process, it is recognized that the memory reference x[j] is not
exclusively read. This means j has potentially values that are also used as offset in
another thread and even worse the value j may potentially has the same value as i.
Therefore, we have to remove these memory references in ExclusiveRead where
the same base address b is used in case that a memory reference is exposed to be
not exclusively read.
Read access in node 211 (line 26) A[((i*n)+j)+4]: exclusive read: yes
Read access in node 211 (line 26) x[i] : exclusive read: yes
Read access in node 211 (line 26) x[i] : exclusive read: yes
Read access in node 211 (line 26) x[i] : exclusive read: yes
Read access in node 211 (line 26) x[i] : exclusive read: yes
Read access in node 211 (line 26) A[((i*n)+j)+3]: exclusive read: yes
Read access in node 211 (line 26) x[i] : exclusive read: yes
Read access in node 211 (line 26) x[i] : exclusive read: yes
Read access in node 211 (line 26) x[i] : exclusive read: yes
Read access in node 211 (line 26) A[((i*n)+j)+2]: exclusive read: yes
Read access in node 211 (line 26) x[i] : exclusive read: yes
Read access in node 211 (line 26) x[i] : exclusive read: yes
Read access in node 211 (line 26) A[((i*n)+j)+1]: exclusive read: yes
Read access in node 211 (line 26) x[i] : exclusive read: yes
Read access in node 211 (line 26) A[(i*n)+j] : exclusive read: yes
Read access in node 211 (line 28) y : exclusive read: yes
Read access in node 211 (line 33) local_sum : exclusive read: yes
All the memory references fulfill the exclusive read property which means that the
associated adjoint assignments do not need synchronization. In Section 5.3 we will
compare the runtime results for the resulting adjoint code with synchronization and
a
the one without synchronization.
Example 29. (Kill effect of the ERA) This example illustrates the situation where
we have to remove the results from set ExclusiveRead when it turns out that a
certain memory reference is not exclusively read. For example, take the code from
Listing 3.1 and change one offset of an arbitrary x[i] reference to x[j]. This leads
to the following output from our implementation.
Read access in node 211 (line 26) A[((i*n)+j)+4]: exclusive read: yes
Read access in node 211 (line 26) x[i] : exclusive read: no
Read access in node 211 (line 26) x[i] : exclusive read: no
Read access in node 211 (line 26) x[i] : exclusive read: no
Read access in node 211 (line 26) x[i] : exclusive read: no
Read access in node 211 (line 26) A[((i*n)+j)+3]: exclusive read: yes
Read access in node 211 (line 26) x[i] : exclusive read: no
Read access in node 211 (line 26) x[i] : exclusive read: no
Read access in node 211 (line 26) x[i] : exclusive read: no
Read access in node 211 (line 26) A[((i*n)+j)+2]: exclusive read: yes
Read access in node 211 (line 26) x[i] : exclusive read: no
Read access in node 211 (line 26) x[i] : exclusive read: no
184 3 Exclusive Read Analysis
Read access in node 211 (line 26) A[((i*n)+j)+1]: exclusive read: yes
Read access in node 211 (line 26) x[i] : exclusive read: no
Read access in node 211 (line 26) A[(i*n)+j] : exclusive read: yes
Read access in node 211 (line 28) y : exclusive read: yes
Read access in node 211 (line 33) local_sum : exclusive read: yes
First, the memory references x[i] are recognized as exclusive but x[j] is recognized
as not exclusively read since the offset j has the value range [0 : n − 5] and this
value range holds for all threads. Therefore, we have to assume that the offsets i
and j potentially have the same values and x[i] therefore is not exclusive anymore.
Hence, we adjust the results such that all memory references where x is the base a
pointer are removed from the set ExclusiveRead.
This concludes this chapter where we have seen that it is possible to examine at
compile time whether or not memory references are accessed exclusively at run-
time by one thread. With this knowledge, the source transformation tool can decide
which adjoint assignment needs synchronization and which one can be executed
freely.
3.7 Summary
The basic idea of the exclusive read analysis (ERA) is that the very general code
pattern of the data decomposition in a parallel region can be analyzed in such a way
that the result is a fairly precise approximation of possible value ranges of integer
variables. These value ranges are crucial when the task is to decide if a certain
array reference is exclusively read by only one thread. For example, if there is a
floating-point reference x[i], then we have to examine what value range the offset
variable i has in order to decide whether or not x[i] is read exclusively or if it is
possible that more than one thread accesses this memory location. In this chapter,
we considered a static analysis of expressions that are typical for referencing one
or two dimensional arrays.
The typical code pattern of a data decomposition for parallel computations de-
tects the number of threads p, it divides the number of data elements n by p and
determines the size of the chunk that each thread should process. To perform this,
the lower and upper bound is set in dependence of the thread identifier number t.
Subsequently, the lower and upper boundary of each chunk define where a certain
thread accesses the data. In order to decide if a certain memory reference is exclu-
sively read or not, the interval analysis must cover the value propagation from the
statements where the values t and p are determined until the assignments where
3.7 Summary 185
certain memory references are used. To simplify the static program analysis, we
assumed that
• the number of threads p divides the number of data elements n, and
• n is bigger or equal to p.
Despite of these assumptions, we had to perform some effort to achieve good re-
sults. One can imagine that the cases where the above assumptions do not hold
lead to even more effort to gain practical results. But this is beyond the scope of
this work. Our intend was only to introduce a method that serves as a proof of
concept.
In the case that the above assumptions do not hold or in case that the array
references inside an original code are more complex than the one we covered here,
it is conceivable that a mixture between a static program analysis and dynamic
information can be used. With help of the dynamic information obtained at runtime
and the value range intervals that we achieved through a static analysis, an interval
analysis can be performed efficiently. However, in this chapter we introduced a
pure static analysis that tries to achieve as much information as possible during
compile time. The hybrid approach of a static analysis together with dynamic
information is a possible part of further research.
We presented an integer interval analysis from [59] that we took as a starting
point. Since this analysis often provides not very precise intervals, we extended
this approach by using DAGs as components of the intervals. The advantage is
that at a certain point in the code, the DAGs can reflect the whole expression of
computation steps which were performed to reach this point. This allows a very
precise analysis.
To convince the reader that this approach can be taken as basis for a static pro-
gram analysis, we had to go through some order-theoretic foundations. We have
shown that the set of possible DAGs and the relation v: DAG × DAG form a partial
order. Afterwards, we clarified that intervals of DAGs (DAGIntervals) also form a
partial order together with the relation
v: DAGIntervals × DAGIntervals.
The actual ERA takes the results from the interval analysis as input and decides
for each right-hand side reference &v of a floating-point assignment in the parallel
region P whether &v is exclusively read or not. The concluding examples illus-
trated that the ERA in fact can provide fruitful information that can be exploited by
the AD source transformation tool. Only these adjoint assignments that really need
synchronization are augmented, for example, with an OpenMP atomic construct.
One could think of ways to strengthen the comparison skills of the program
analysis by using methods typically lying the domain of computer algebra systems.
A work that focuses on solving inequalities can be found in [38]. A common way
to compare inequalities would enable the ERA to compare arbitrary expressions
that are used as an offset to reference an array. Our approach was only to support
the comparison of certain shapes of expressions. It is not part of this work to cover
all the possible cases of occurring expressions. However, it would be worth of
thinking about using an interface to an open source computer algebra system such
as Axiom4 or Maxima5 to examine the possibilities to retrieve information about
inequalities. It should be possible to gain useful information since the inequalities
we encounter in the context of the data decomposition are not complex.
Another domain where it could be interesting to explore if there are connections
that can help us with the ERA are theorem prover. In recent years there are more
and more projects where a so called Satisfiability modulo theories (SMT) solver6
is implemented. Two examples are the MathSAT57 and the Z38 solver. One
example application that builds on the Z3 solver is the Verifier for Concurrent C
(VCC9 ). The VCC uses assertions in the preamble of C functions to proof certain
properties of the code [22]. This approach of using assert statements at certain
points in the parallel region P may improve the ERA results significantly similar
to the approach where we took the conditional branches into account.
4 http://www.axiom-developer.org/
5 http://maxima.sourceforge.net/
6 Overview of SMT solvers: http://smtlib.cs.uiowa.edu/solvers.html
7 http://mathsat.fbk.eu/
8 http://z3.codeplex.com/
9 http://research.microsoft.com/en-us/projects/vcc/
4 Source Transformation of OpenMP
Constructs
In Chapter 2, we used the context-free grammar G SPL = (V, Σ, R, P), defined in
Definition 40, for SPL codes. In this chapter, we extend this grammar in order to
recognize OpenMP constructs inside of the parallel region P. The non-terminal
symbol that is used in the production rules for these constructs is c. The set of
terminal symbols Σ and the start symbol P remain the same as they were in G SPL .
The production rules ROMP is at the moment equal to R but is extended by rules
during the coming sections. The new context-free grammar is therefore defined as
of portability. The downside of this approach is that the developer has to define a
certain size for the static arrays and in case that this size is not sufficient to host all
the values the execution of the adjoint code ends up with a stack overflow. Thus,
a balance between a static fast solution and a dynamic flexible solution must be
chosen. We present an example for both approaches and let the decision to choose
the method what fits best to the software engineer who implements the derivative
code.
Suppose you want to use a class STACK that is somehow defined as indi-
cated in Listing 4.1. As pointed out in the definition of the SPL language, we
need three different stacks, two value stacks for integer and floating-point values
(STACK(1)i , and STACK(1) f ), and one stack to store the labels of the control flow
(STACK(1)c ). To declare a globally defined class as thread local, OpenMP knows
the threadprivate directive. This means that the definition is placed in the global
scope but each thread uses its own stack. This is necessary when the original par-
allel region uses calls to subroutines and the stacks therefore have to record values
from multiple levels of scopes. Thus, the different scopes have to write all to the
same stack which is possible by defining the stack in the global scope.
c l a s s STACK { . . . /∗ T h i s i s l e f t t o t h e u s e r o f AD ∗/ } ;
STACK STACK(1)c ;
STACK STACK(1)i ;
STACK STACK(1) f ;
#pragma omp t h r e a d p r i v a t e ( STACK(1)c , STACK(1)i , STACK(1) f )
Listing 4.1: The global section provides some definition of a C++ class, here named
STACK. Since these stacks has to be defined as thread local, they are contained in a
threadprivate directive.
In case that the parallel region P does not contain subroutine calls it would
probably be more wise to place the definition of the stacks STACK(i| f |c) into the
scope of the parallel region. This has the advantage that the memory for the data
is placed on the stack that is provided from the operating system (OS). Hence,
the data has a certain locality which can make an huge difference when the target
machine has a non-uniform memory access (NUMA) architecture in opposite to
the symmetric multiprocessing (SMP) architecture. In case of a SMP machine,
all processor cores are connected to a single shared main memory, whereby the
NUMA architecture splits the main memory into parts such that each processor
core has a local memory and the latency of an access to this local memory is
much lower than the latency that is necessary to access the memory from distinct
processors. For example, thread t is associated by the OS to core C0 . Thus, we
are well advised to place the stacks for the adjoint code into the memory that is
4.2 SPLOMP 1 - Synchronization Constructs 189
Example 30. In the following listing there are two consecutive worksharing loops.
The first assigns values to the vectors x and y, the second loop uses these vectors
to compute a value for the matrix A. We have to define a barrier before the second
loop to ensure that each thread has finished the first loop, otherwise the compu-
tation in the second loop may use values that have not been defined by the first
loop.
#pragma omp p a r a l l e l
{
int i ;
int j ;
4.2 SPLOMP 1 - Synchronization Constructs 191
#pragma omp f o r n o w a i t
f o r ( i ← 0 ; i <N ; i ++) {
xi ← f ( i , x ) ;
yi ← g ( i , y ) ;
}
#pragma omp b a r r i e r
#pragma omp f o r
f o r ( i ← 0 ; i <N ; i ++)
f o r ( j ← 0 ; j <N ; j ++)
Ai j ← h ( i , j , x , y ) ;
}
The nowait clause in the first loop construct suppresses the implicit barrier. With-
out this clause, we would not need to define the explicit barrier because there is an
a
implicit barrier by default at the end of each worksharing construct.
from Si is executed before a statement from S1 are not possible. This is formally
expressed by
0
∀I ∈ I (P, q, p) : stk ≺ stl , (4.1)
where k ∈ {1, . . . , i − 1}, l ∈ {i, . . . , q}, t 6= t 0 . For example, suppose that there is a
critical reference &v with
0
&v ∈ LHSREF stk ∧ &v ∈ REF stl .
The execution leads likely to different results because it depends on the order of stk
0
and stl in the interleaving that represents the concurrent execution. We stated ear-
lier that this is called a race condition because the thread that writes last determines
the value in &v. This non-deterministic behavior is avoided by restricting the set of
0
possible interleavings I (P, q, p) to such interleavings where stk ≺ stl holds. This
can be achieved by defining a barrier between sk and sl . In other words the devel-
0
oper who puts a barrier between sk and sl defines that only the order (stk , stl ) may
occur in an interleaving, not the other way around.
We extend the set of production rules ROMP by the rule for a barrier construct
where the barrier construct is classified to be a straight-line code statement.
ROMP = ROMP ∪ c : #pragma omp barrier
The fact that the barrier construct is defined as straight-line code is a question
of the abstraction level. On our level of abstraction the barrier construct can be
seen as a straight-line code statement because it does not change the control flow
of a certain thread. The barrier only leads to the situation that the thread stops its
execution until all threads have reached the barrier. Then the thread continues with
the statement that follows the barrier. The control flow is not changed. On a lower
level, the implementation of the barrier construct probably changes the control
flow because the method for waiting is usually implemented through a loop.
The tangent-linear as well as the adjoint transformation rules map the barrier
construct by the identity mapping:
at all. Without any assumption about the original code, this is undecidable.1 We
assume that this property holds for the parallel region P ∈ SPLOMP 1 that serves
as input for our source transformation. However, we have to show that the source
transformation of P does not invalidate this property.
Proposition 53. Suppose that an OpenMP parallel region P contains a barrier
that is encountered by all threads or none at all. Then this also holds for the
barrier contained in τ (P) and σ (P).
Proof. The tangent-linear code τ (P) as well as the forward section σ (S0 ) do not
change the control flow in comparison to P. Hence, the property that each thread
encounters a given barrier in P is inherited to τ (P) and σ (S0 ).
The reverse section ρ (S0 ) reverses the control flow of P. We show that this
reversal does not disallow any thread from reaching the barrier that is emitted by
ρ. Assume that there is a barrier construct inside of a sequence S contained in
P. As in Chapter 2, we distinguish between the cases that S is straight-line code
(SLC) or not. If a control flow statement (CFSTMT) is contained in S, we split the
sequence until we achieve just SLC sequences or control flow statements.
1. Case S ∈ SLC:
Assume that the barrier in P succeeds statement si ∈ S = (s1 ; . . . ; sq ). We
apply (2.39) to S and afterwards we apply (4.3) to the barrier. This yields
STACK(1)c .push( 1 );
σ (s1 );
..
.
σ (si ); (4.5)
#pragma omp barrier
..
.
σ (sq ); ,
to S by rule (2.49) and then to the barrier by applying rule (4.4), we obtain
the code shown in (4.6). This shows that either each thread encounters the
barrier contained in (4.6) or no thread at all, depending on the content of
stack STACK(1)c .
Please note the special cases where the barrier is the first or the last statement
in sequence S. In case that the barrier is the first statement in σ (S), then
the barrier is the last statement in ρ (S). This means that each thread first
must execute the adjoint code of S before encountering the barrier, which is
exactly what one would expect when reversing the dataflow. In case that the
barrier precedes σ (s1 ), the barrier in the reverse section follows after ρ (s1 ).
if STACK(1)c .top() = 1 {
STACK(1)c .pop();
ρ(sq );
..
.
#pragma omp barrier (4.6)
ρ(si );
..
.
ρ(s1 );
}
2. Case S ∈/ SLC:
Split S into S1 , si , Si+1 with
The previous case holds if the barrier is contained in S1 or Si+1 . Thus, let us
assume that the barrier succeeds si . By applying rule (2.38) to S and (4.3)
to the barrier, we get
σ (s1 ; . . . ; si−1 ) ;
σ ( si ); (4.7)
σ (#pragma omp barrier; si+1 ; . . . ; sq ) ;
4.2 SPLOMP 1 - Synchronization Constructs 195
and the reverse section comprises after applying rule (2.48) and (4.4):
ρ (s1 ; . . . ; si−1 ) ;
ρ( si ); (4.8)
ρ (#pragma omp barrier; si+1 ; . . . ; sq ) ;
Since (#pragma omp barrier; si+1 ; . . . ; sq ) ∈ SLC holds, we can apply the
previous case to this sequence.
The following proposition clarifies that a barrier is in fact necessary in the result
of the source transformation. Since a barrier synchronization in parallel program-
ming is always a very expensive construct in terms of runtime efficiency, it should
be ensured that no barrier is produced where none is needed.
where t 6= t 0 .
Proof. In case that there is no data dependence between statements from sequence
S 1 to statement si ∈ S i , we could move the barrier to the point in front of statement
si+1 . If this would again not lead to a data dependence from S 1 to S i , we could
move the barrier further, until we get a data dependence or until we get the end
of the sequence. Hence, let us assume that there is a data dependence between
statements from sequence S 1 to statement si ∈ S i . Without loss of generality we
assume that the statement from S1 is s1 . Otherwise, we only need to change the
statement index. The statements are assumed to be assignments, and s1 is executed
196 4 Transformation of OpenMP Constructs
0
with &yt1 = &xtik ,t 6= t 0 , k ∈ {1, . . . , n}. This means that a memory location is used
by two different threads t and t 0 and thread t uses this location as left-hand side
reference and t 0 uses it as right-hand side reference. We have the situation that
thread t computes a value that is read by thread t 0 . Let us consider the derivative
code of s1 and si .
ρ(s1 ).
Forward section:
STACK(1) f .push(yt1 );
yt1 ← φ 1 (xt1,1 , . . . , xt1,n ); (4.13)
..
.
Reverse section:
..
.
yt1 ← STACK(1) f .top();
STACK(1) f .pop();
xt(1)1,1 +← φxt (xt1,1 , . . . , xt1,n ) · yt(1)1 ; (4.14)
1,1
..
.
xt(1)1,n +← φxt (xt1,1 , . . . , xt1,n ) · yt(1)1 ;
1,n
yt(1)1 ← 0; (4.15)
Another thread t 0 processes the code obtained by applying σ (si ) and ρ(si ).
Forward section:
0
STACK(1) f .push(yti );
0 0 0
yti ← φ (xti,1 , . . . , xti,n ); (4.16)
..
.
Reverse section:
..
.
0
yti ← STACK(1) f .top();
STACK(1) f .pop();
0 0 0 0
xt(1)i,1 +← φxt 0 (xti,1 , . . . , xti,n ) · yt(1)i ; (4.17)
i,1
..
.
198 4 Transformation of OpenMP Constructs
0 0 0 0
xt(1)i,k +← φxt 0 (xti,1 , . . . , xti,n ) · yt(1)i ; (4.18)
i,k
..
.
0 0 0 0
xt(1)i,n +← φxt 0 (xti,1 , . . . , xti,n ) · yt(1)i ; (4.19)
i,n
0
yt(1)i ← 0; (4.20)
0
Similar as in the tangent-linear case, the equality &yt1 = &xti,k holds. This
implies that we need a barrier in front of (4.16). Otherwise, thread t 0 may
0
read reference &xti,k in (4.16) before thread t writes this reference in (4.13).
The computation of the partial derivatives in the code lines from (4.17) to
(4.19) may also read a wrong value if we omit this barrier.
During the reverse section we have another data dependence to take account
of, namely the one induced by the implication
0 Lem.42 0
&yt1 = &xti,k =⇒ &yt(1)1 = &xt(1)i,k .
Hence, we have a data dependence from (4.18) to all the assignments from
(4.14) to (4.15). To synchronize this race condition we need a barrier con-
struct following ρ(si ).
The barrier in P must be placed before si what leads to a barrier after ρ (si )
in ρ (S). We showed that a data dependence in the original code P leads to
multiple data dependencies in the adjoint code of P. Therefore, we cannot
omit the barrier in the reverse section.
The above shows that in case of a true dependence between S1 and Si , we cannot
omit the barrier construct. A true dependence [66, p. 98] between the statements
s1 and si means that s1 writes a memory reference that is concurrently read through
statement si . An output dependency means that two assignments write to the same
reference. In case of
0 Lem.42 0
&yt1 = &yti =⇒ &yt(1)1 = &yt(1)i ,
we obtain a race condition between (4.20) and (4.15) in the reverse section.
Summarizing, we showed that a barrier construct in P that does not have its
pendant in the source transformation result, leads to a race condition during the
derivative code execution.
4.2 SPLOMP 1 - Synchronization Constructs 199
S1 = (s1 ; s2 ; . . . ; si−1 ) ,
Si = (si ; si+1 ; . . . ; s j−1 ) , and
S j = (s j ; s j+1 ; . . . ; sq ).
The statements from sequence Si are only executed by the master thread, which
implies that statements from Si in all possible interleavings have always the thread
identifier zero:
Since the master construct does not define an implicit barrier, there is no order
restriction inside of I (P, q, p). The production rule for the master construct is
c : #pragma omp master
ROMP = ROMP ∪
{S}
where the master construct counts as control flow statement and is therefore an
element of CFSTMT. Given a master construct statement as input, the tangent-
200 4 Transformation of OpenMP Constructs
#pragma omp master
{
(2.39)
= STACK(1)c .push( l );
..
.
}
We indicate above that label l is the label of the first statement in S. This label
is pushed onto the stack STACK(1)c that belongs to the master thread. All the
other threads do not execute this push statement. Hence, only the master thread
encounters this element during the reverse section which means that we do not
need the master construct during the reverse section to restrict the execution of
ρ (Si ) to the master thread.
#pragma omp master
ρ := ρ (S) (4.23)
{S}
where the new element is contained CFSTMT. Despite the fact that the critical
construct is associated with a subsequence of statements, we consider only the
assignment
s : y ← y·x
as contained in the subsequence. This statement is sufficient to explain the ap-
proach of our source transformation.
4.2 SPLOMP 1 - Synchronization Constructs 201
∗ ∗
x1 x0
∗ y{0,1} · x0 ∗ y{0,1} · x1
x0 y{0,1} x1 y{0,1}
y{0,1} x0 x1 y{0,1} x1 x0
(a) Interleaving (s0 ; s1 ). (b) Interleaving (s1 ; s0 ).
The order in which the threads enter the critical region in P is arbitrary. This
means for the computation which takes place inside the critical region that it must
be independent of the order in which the threads enter the region. When we assume
that we use two threads to execute the statement s, we obtain the two possible
interleavings (s0 ; s1 ) and (s1 ; s0 ). We display the computation of both interleavings
by a DAG which is shown in Figure 4.1. No matter which interleaving we consider,
the result in the topmost node is the same since the multiplication is commutative.
y{0,1} · x0 · x1 = y{0,1} · x1 · x0
We indicate the affinity of each variable by a superscript index. x0 belongs to
thread zero, x1 to thread one, and y{0,1} belongs to all threads which means that
this variable is shared. The result obtained by the computation shown in Figure
4.1 is y{0,1} · x0 · x1 . The three derivative values of this result with respect to the
input values are as follows:
∂ y{0,1} · x0 · x1
= x0 · x1 (4.24)
∂ y{0,1}
∂ y{0,1} · x0 · x1
= x1 · y{0,1} (4.25)
∂ x0
∂ y{0,1} · x0 · x1
= y{0,1} · x0 (4.26)
∂ x1
No matter whether we use the tangent-linear or the adjoint model, these derivative
results should be computed. To examine this, we consider the possible interleav-
ings of the derivative codes concerning the example from above. The tangent-
linear model of s is obtained by applying the rule (2.17):
(
s1 = y(1) ← x · y(1) + y · x(1) ;
τ (s) =
s2 = y ← y · x;
202 4 Transformation of OpenMP Constructs
The statement s1 computes the tangent-linear component y(1) and s2 computes the
value for y. If we execute τ (s) with two threads then we obtain according to
Lemma 36 six interleavings. These interleavings are shown in the first column of
Table 4.1. The second column shows the expression that is supplied in the variable
y(1) after the execution of the corresponding interleaving from the first column.
In case that all three derivative values are computed correctly, the third column
contains a ’yes’.
Let us illustrate this by taking the first two lines in Table 4.1 as an example. The
interleaving (s01 ; s02 ; s11 ; s12 ) leads to the expression
x1 · (x0 · y(1) + y · x(1)0 ) + (y · x0 ) · x(1)1
that defines the value for the variable y(1) after the tangent-linear code has been
executed. To verify that the derivative values are correct, we initialize the tangent-
linear vector (1){0,1}
y 1 0 0
x(1)0 with 0 , 1 , and 0 .
x(1)1 0 0 1
The results are the correct derivative values shown in (4.24), (4.25), and (4.26)
depending on which Cartesian basis vector was used. Analogously, we obtain the
value y for the derivative with respect to x1 when we use the expression
x1 · (x0 · y(1) + y · x(1)0 ) + y · x(1)1
which represents the value that is computed by the interleaving (s01 ; s11 ; s02 ; s12 ). This
is wrong because the correct value is shown in (4.26). The verification of the
4.2 SPLOMP 1 - Synchronization Constructs 203
remaining interleavings reveals that only the first and the last interleaving provide
all three derivative values correctly.
The interleavings s01 ; s02 ; s11 ; s12 ; and s11 ; s12 ; s01 ; s02 ; have in common that the state-
ments s1 and s2 are executed right after another. This shows that we have to define
the tangent-linear model of s inside one critical region. Hence, we define the
tangent-linear transformation of a critical region by
(
#pragma omp critical #pragma omp critical
τ := . (4.27)
{S} { τ (S) }
The adjoint source transformation requires more effort and is not as straightfor-
ward as the tangent-linear transformation is. As we will explain below, we need to
trace the order of the threads in which they enter the critical region to ensure the
correct derivative computation. This is implemented by using a counter value that
is put onto the stack STACK(1)c which is a departure of the common usage of this
stack.
For the explanation, we assume the content of the critical region to be again the
statement s. The forward section of s is obtained by applying (2.40):
(
s1 = STACK(1) f .push(y);
σ (s) = .
s2 = y ← y · x;
Table 4.2 displays the different interleavings of the forward section and their im-
pact on the floating-point stacks STACK f 0 and STACK f 1 which are defined as
thread local. The interleavings where statement s1 is executed by both threads
before the first execution of statement s2 takes place, result in the same stack con-
tent y in both threads. This is wrong because this leads to a wrong value recovery
during the reverse section. Therefore, only the first and the last interleaving are
correct in the sense that they provide different values in the stacks. Similar to the
tangent-linear case, this shows that we have to ensure that σ (s) is executed as
a compound. This is achieved by putting the forward section of s into a critical
region.
s1 = y
← STACK f .top(); STACK.pop();
ρ (s) = s2 = x(1) +← y · y(1) ; (4.28)
s3 = y(1) ← x · y(1) ;
The code for the reverse section of s is emitted by (2.50) and contains actually
four statements. However, we only consider three statements as shown in (4.28) to
204 4 Transformation of OpenMP Constructs
keep the number of combinations for the interleavings down. We stick together the
first two statements, namely the recovery of value y and the pop-statement. This
is possible because the pop-statement only has a thread local effect. Please note
that the statement s3 is not an incremental assignment but a real assignment. This
is not what the source transformation provides but it is semantically equivalent as
long as the transformation tool uses temporary variables for intermediate values as
shown in Section 1.2. The reason for the difference of this adjoint assignment is
the fact that variable y occurs on both sides of the original assignment s.
To examine the adjoint statements of s, let us assume that the forward section has
already been executed. The interleaving that reflects the execution of the forward
section is s01 ; s02 ; s11 ; s12 (first line in Table 4.2). This means that first thread zero
executes σ (s) and then thread one. Figure 4.1a illustrates this computation. This
results in the following values for the floating-point stacks and for variable y{0,1} :
With this program state in mind we can examine the effect of the different inter-
leavings of the adjoint statements (4.28). According to Lemma 36 we have
(3 · 2)!
= 20
3!2
different interleavings for the three statements, executed by two threads. To make
the result more readable, we only present the result for the different derivative val-
ues and not the expressions that result from the different interleavings. Comparing
the values shown in the third, fourth, and fifth column of Table 4.3 with the correct
4.2 SPLOMP 1 - Synchronization Constructs 205
0 1 {0,1}
No. Interleaving x(1) x(1) y(1) Correct
1 s01 ; s02 ; s03 ; s11 ; s12 ; s13 ; y y · x0 · x0 x0 · x1 false
2 s01 ; s02 ; s11 ; s03 ; s12 ; s13 ; y y · x0 · x0 x0 · x1 false
3 s01 ; s11 ; s02 ; s03 ; s12 ; s13 ; y · x0 y · x0 · x0 x0 · x1 false
4 s11 ; s01 ; s02 ; s03 ; s12 ; s13 ; y y · x0 x0 · x1 false
5 s01 ; s02 ; s11 ; s12 ; s03 ; s13 ; y y · x0 x0 · x1 false
6 s01 ; s11 ; s02 ; s12 ; s03 ; s13 ; y · x0 y · x0 x0 · x1 false
7 s11 ; s01 ; s02 ; s12 ; s03 ; s13 ; y y x0 · x1 false
8 s01 ; s11 ; s12 ; s02 ; s03 ; s13 ; y · x0 y · x0 x0 · x1 false
9 s11 ; s01 ; s12 ; s02 ; s03 ; s13 ; y y x0 · x1 false
10 s11 ; s12 ; s01 ; s02 ; s03 ; s13 ; y y · x0 x0 · x1 false
11 s01 ; s02 ; s11 ; s12 ; s13 ; s03 ; y y · x0 x0 · x1 false
12 s01 ; s11 ; s02 ; s12 ; s13 ; s03 ; y · x0 y · x0 x0 · x1 false
13 s11 ; s01 ; s02 ; s12 ; s13 ; s03 ; y y x0 · x1 false
14 s01 ; s11 ; s12 ; s02 ; s13 ; s03 ; y · x0 y · x0 x0 · x1 false
15 s11 ; s01 ; s12 ; s02 ; s13 ; s03 ; y y x0 · x1 false
16 s11 ; s12 ; s01 ; s02 ; s13 ; s03 ; y y · x0 x0 · x1 false
17 s01 ; s11 ; s12 ; s13 ; s02 ; s03 ; y · x0 · x1 y · x0 x0 · x1 false
18 s11 ; s01 ; s12 ; s13 ; s02 ; s03 ; y · x1 y x0 · x1 false
19 s11 ; s12 ; s01 ; s13 ; s02 ; s03 ; y · x1 y · x0 x0 · x1 true
20 s11 ; s12 ; s13 ; s01 ; s02 ; s03 ; y · x1 y · x0 x0 · x1 true
Table 4.3: The three statements from (4.28) can be interleaved in 20 different ways. The
results of these interleavings according to the adjoint variables are presented. The last
column shows whether or not the computed values are correct.
derivative values shown in (4.25), (4.26), and (4.24), we recognize that only the
two last interleavings provide the correct result.
The correct interleaving in line 20 reveals two things. Firstly, the three state-
ments s1 , s2 , and s3 have to be executed as a compound without any context switch
between threads. This is because s11 ; s12 ; s13 and s01 ; s02 ; s03 have all the same thread
ID. Secondly, the thread scheduling order is important because otherwise the in-
terleaving in the first line of Table 4.3 would have provided the correct result as
well.
The remaining question is why the interleaving in line 19 also provides the cor-
rect result. The only difference between the interleavings in line 19 and 20 is that
the statements s01 and s13 have switched their position. The statement s1 restores
206 4 Transformation of OpenMP Constructs
the value of y from the floating-point stack and s3 computes the adjoint value y(1) .
Therefore, these two statements do not have any data dependencies and can be
executed simultaneously.
We achieved the correct derivative results, for example, if the forward section
was executed first by thread zero, then by thread one. In the reverse section, it is the
other way around, the first thread must be thread one, the second thread is thread
zero. This means that the order in which the threads enter the critical region in the
forward section determines the order in which the reverse section has to execute the
adjoint statements. In the following we explain how this is implemented through
the adjoint source transformation.
We tried to keep this implementation as simple as possible and achievable with
the possibilities that the SPLOMP 1 language provides. For that reason, we use
a counter variable called al where the values of al are of the same basic type as
the labels from LABEL(P) are. The index l associates this counter variable with
the critical region in P which has this label l. This prevents name clashes in the
derivative code. The values of al are put onto the control flow stack STACK(1)c
despite the fact that usually the content consists only of labels.
STACK(1)c .push(l)
#pragma omp critical
{
l : #pragma omp critical
σ (S)
σ := (4.29)
{S}
STACK(1)c .push(al )
al ← al + 1
}
STACK .push(l)
(1)c
Obviously, we have to ensure that the value range of al is not interfering with the
label values on the control flow stack. Otherwise, a misinterpretation of the values
lead to the wrong reversal of the control flow.
To explain transformation (4.29), we consider a critical region inside of P that
is labeled with l. Subsequence S is associated to this critical region and is labeled
with l − 1. According to our implementation, this fits the practice because we use
a bottom-up parser generated by the tool bison2 . In case of a top-down parser
one has to attend the initialization value for al . Please note that the choice of the
counter initialization depends on the values of the labels that can arise during the
2 http://www.gnu.org/software/bison/
4.2 SPLOMP 1 - Synchronization Constructs 207
execution of (4.29). It must be ensured that all the labels occurring in (4.29) are
different to the values occurring as counter values during runtime.
We choose l + 1 as initial value for the counter al , because l is the label of the
critical region and in our case the associated subsequence S contains only labels
lower than l. This means the branch statements that are emitted by σ (S) in (4.29)
all contain a test expression that checks for a label value lower than l. Therefore,
these label values do not interfere with the values of al .
The first thread that enters the critical region that is shown in (4.29), pushes
the value l + 1 onto STACK(1)c , increments the value of al and leaves the critical
region. The second thread pushes l + 2 to STACK(1)c , increments al and leaves the
critical region, and so on. In case that only two threads enter the critical region,
the corresponding local control flow stacks have the appearance as shown in Table
4.4.
Table 4.4: This shows the critical region amount that is contained in the STACKc from the
two threads. The order in which the elements were put onto the stack is from left to right.
The first l marks that the thread encountered the critical region. The label l − 1 comes
from the execution of σ (S) because S is assumed to has the label l − 1. The third position
represents the place where the value of the counter variable al is placed. The first thread
puts a l + 1 there, the second thread a l + 2, and so forth. The concluding label l marks that
the thread has left the critical region.
The label l is pushed twice onto STACK(1)c , firstly before the thread enters the
critical region and secondly after the thread has left the critical region. These two
labels l enclose the region associated with the critical region. The label l − 1 fol-
lows what indicates that the execution of the subsequence σ (S) has started. We
assume in our simple example that only one label is put onto the control flow stack
during the execution of σ (S), in practice there are likely more. The third position
in Table 4.4 represents the position where the values of al are placed. Therefore,
there are the values l + 1 for the first thread and l + 2 for the second thread. This
element must be recognized as a counting number instead of a label identifier.
When the threads leave the critical region they put another l to STACK(1)c to indi-
cate that this is the end of the critical region. This final label is important during
the execution of the reverse section.
When the execution of the reverse section encounters the final l which is the
rightmost l in Table 4.4 the consumption of all the values from al starts. This is
208 4 Transformation of OpenMP Constructs
shown in (4.30). Firstly, we test if the current top of the control flow stack contains
label l. If this is the case, we pop this label and start a loop which iterates until the
current top of the stack is again label l. This means that the loop iterates as long
as the leftmost label l in Table 4.4 has not been reached. In addition, this clearly
shows why we put the label l twice onto the stack.
After the two threads in our example have popped the final l off their local stack
STACKc , the stacks have the following appearance:
1st thread ... l l −1 l +1
2nd thread ... l l −1 l +2
The whole loop body is a critical region, such that it is ensured that at each point
in time only one thread executes the loop body. The other threads are probably
waiting for being allowed to enter this critical region. Once a thread is allowed
to enter the critical region, it tests the current top of STACKc , which represents a
counter value not a label identifier.
l : #pragma omp critical
ρ
{S}
if STACK(1)c .top() = l {
STACK(1)c .pop()
while STACK .top() 6
= {
(1)c l
#pragma omp critical
{
if STACK .top() = a − 1 {
l
(1)c
STACK(1)c .pop()
:= al ← al − 1 (4.30)
while STACK .top() 6
= l {
(1)c
ρ (S)
}
}
}
}
STACK (1)c .pop()
}
4.2 SPLOMP 1 - Synchronization Constructs 209
The last thread that entered the critical region during the forward section puts
the value al − 1 onto STACK(1)c . Now during the reverse section, this thread has to
be the first that executes the corresponding adjoint statements in σ (S). Speaking
with our example, after the two threads left the critical region, the current value of
al is l + 3. Therefore, the second thread enters the branch body in (4.30) because it
has the value l + 2 on top of STACK(1)c . This thread removes the value l + 2 from
its stack and decreases the counting variable al . This decrease has to be atomic
since all the threads are accessing this reference. But since we are inside a critical
region, the exclusive access is ensured. The stacks look now as follows:
1st thread ... l l −1 l +1
2nd thread ... l l −1
The value of al is now l + 2, but the second thread has not left the critical region
yet. Hence, we can be sure that the first thread is still waiting for allowance to
enter the critical region. Meanwhile, the second thread executes the code from
ρ (S). In our example this is only the element l − 1. After l − 1 is consumed by the
innermost loop the stacks contain:
1st thread ... l l −1 l +1
2nd thread ... l
l displays the end of the part where σ (S) had put labels onto STACK(1)c and the
thread can leave the critical region. Afterwards, it removes the label l from its
control flow stack:
1st thread ... l l −1 l +1
2nd thread ...
As soon as the second thread leaves the critical region, the first thread is allowed
to enter this region. The following test expression checks if the value on top of
STACK(1)c is l + 1 which true. It consumes the label l − 1 by executing ρ (S) and
removes the label l after it has left the critical region. This finishes the reverse
section part.
Please note that there is a data dependence between (4.29) and (4.30). We must
prevent that one thread can enter the critical region in the forward section, while
another thread is already consuming these counting numbers during the reverse
section. This situation could lead to ambiguous values. Therefore, we have to
place a barrier construct before the reverse section to make sure that each thread
has finished its forward section before any thread enters the reverse section.
Proposition 55. Definition (4.30) implies the need of a barrier between the for-
ward and the reverse section.
210 4 Transformation of OpenMP Constructs
Proof. Assume that there is no barrier between the forward and the reverse section.
In addition, thread zero and thread one are about to enter the critical region in the
reverse section. A third thread with ID two is about to enter the critical region in
the forward section. The current value of al is assumed to be c and the control flow
stacks of the threads are as follows:
0 ... l ... c−1 l
1 ... l ... c−2 l
2 ... l
Thread zero enters the critical region after removing the label identifier l from
STACKc . The top of the stack of thread zero shows c − 1 and therefore it continues
the execution with ρ(S). After thread zero has left the critical region the value of
al is c − 1.
0 ... l
1 ... l ... c−2 l
2 ... l
Assume that thread two now enters the critical region defined in the forward sec-
tion, see (4.29). Thread two puts the value c − 1 onto STACKc , increases al to
c, leaves the forward section and continues its execution until it reaches the code
from (4.30).
0 ... l
1 ... l ... c−2 l
2 ... l ... c−1 l
At this point, both threads attempt to enter the critical region in the reverse section.
Thread two is being allowed to execute ρ(S) because its value on top of its stack
is c − 1 and the current value of al is c. This is wrong because thread one has to
come first. This shows that a wrong reversal of the control flow is possible when
we omit a barrier between the forward and the reverse section.
Another question may arise during the reading of the above. Would it be possi-
ble to put a barrier after the critical region in (4.30)? This would have the advan-
tage that while all the threads are competing with each other to enter the critical
region, the wrong candidates would be held from competing another time because
they are waiting at the barrier. As a general answer we must deny this question.
The problem is that the barrier has to be encountered by all threads in the group
or by none at all. This is in general not given since we do not know which threads
enter the critical region in P and which one not. In case that it is ensured that
4.2 SPLOMP 1 - Synchronization Constructs 211
all threads run through the critical region in P we could place a barrier after the
critical region in (4.30).
The control flow of the executing thread is not changed, therefore we classify
this construct as straight-line code. This means for the partitioning into maximal
sequences of SLC sequences that we do not end a sequence when we encounter an
atomic statement.
To explain the source transformation of an atomic construct, we consider the
3 For example the X86 processor family knows the XCHG or the XADD command where atomicity
is ensured, see e.g. [55].
212 4 Transformation of OpenMP Constructs
+ +
1 1
+ 1 + 1
1 1
φ φ φ φ
1 1
φx 0 φx 1 φx 1 φx0
y{0,1} x0 x1 y{0,1} x1 x0
(a) Result of interleaving (s0 ; s1 ). (b) Result of interleaving (s1 ; s0 ).
When we apply this implication to (4.32), we see that we have another critical
reference in the tangent-linear model of s:
0 0 0
∃(s(1)t , s(1)t ) ∈ I ∈ I (τ(P), q, p) : &y(1)t = &y(1)t or &y(1)t = &x(1)t .
From this, we conclude that the critical reference of the original assignment is
inherited to the tangent-linear assignment s(1) . Hence, we have to put at least an
atomic construct in front of the two assignments emitted by τ (s) to ensure that
they are executed atomically by the runtime system of OpenMP. The remaining
question is whether is is necessary to put both assignments into one critical region.
This can be examined by considering the possible interleavings. We define
(
#pragma omp atomic
s1 :=
y(1) +← φx (x) · x(1) ;
(
#pragma omp atomic
s2 :=
y +← φ (x)
and we suppose that s1 and s2 are executed by two threads. This leads to six
different interleavings that we have to consider. The correct derivative values are
∂ y{0,1} + x0 + x1
=1 (4.33)
∂ y{0,1}
∂ y{0,1} + x0 + x1
= φx 0 (4.34)
∂ x0
∂ y{0,1} + x0 + x1
= φx 1 (4.35)
∂ x1
The computation of the derivative value by the tangent-linear model is shown in
the following table.
214 4 Transformation of OpenMP Constructs
Interleaving y(1){0,1}
s01 ; s02 ; s11 ; s12 ; φx0 (x0 ) · x(1)0 + φx1 (x1 ) · x(1)1
s01 ; s11 ; s02 ; s12 ; φx0 (x0 ) · x(1)0 + φx1 (x1 ) · x(1)1
s11 ; s01 ; s02 ; s12 ; φx1 (x1 ) · x(1)1 + φx0 (x0 ) · x(1)0
s01 ; s11 ; s12 ; s02 ; φx0 (x0 ) · x(1)0 + φx1 (x1 ) · x(1)1
s11 ; s01 ; s12 ; s02 ; φx1 (x1 ) · x(1)1 + φx0 (x0 ) · x(1)0
s11 ; s12 ; s01 ; s02 ; φx1 (x1 ) · x(1)1 + φx0 (x0 ) · x(1)0
Most of the edges in Figure 4.2 are labeled with one. This fact leads to the effect
that we see in the above table. All the expressions are the same, no matter what
interleaving we are considering. This shows that we do not need to put both as-
signments s1 and s2 into one critical region, as it was necessary when transforming
the critical construct by the tangent-linear model. Thus, the tangent-linear trans-
formation of an atomic construct is:
#pragma omp atomic
τ
y +← φ (x1 , . . . , xn )
#pragma omp atomic
y(1) +← n φ (x , . . . , x ) · x(1) ;
∑k=1 xk 1 n k
:= (4.36)
#pragma omp atomic
y +← φ (x , . . . , x )
1 n
The above creates the impression that the linear effect of the atomic assignment
also makes the life easier for the adjoint source transformation. We examine this
in the following. Let us consider the transformation of the assignment in (4.31).
We obtain by applying the transformation rule (2.41):
(
s1 = STACK(1) f .push(y);
σ ( y +← φ (x) ) = (4.37)
s2 = y +← φ (x);
The possible interleavings for executing s1 and s2 with two threads are shown in
Table 4.5 together with the content of the floating-point value stack. We recog-
nize the same effect that we saw earlier in Section 4.2.3. In case that both threads
execute their push-statement before one thread executes the statement s2 then the
content of the stacks in both threads comprise the same value. For the example
of the critical region this was wrong. Here, we claim that this effect does not
matter since the value from the stacks is not necessary for the adjoint statements
in ρ ( y +← φ (x) ). This becomes clear by considering (4.38). The value that is
recovered by the first statement in ρ ( y +← φ (x) ) is not used in the adjoint assign-
ment that follows. Compare (4.38) to the situation that we had in (4.28). There
4.2 SPLOMP 1 - Synchronization Constructs 215
was a direct data dependence of the recovery statement with y on the left-hand
side and the following adjoint assignment with y on the right-hand side. This data
dependence does not exist here.
y ← STACK(1) f .top();
ρ (y +← φ (x)) = STACK(1) f .pop(); (4.38)
x(1) +← φx (x) · y(1) ;
This brings up the question why one should store all the intermediate values
when they are not necessary anyway. The only value we have to store is the value
that the variable y has before any thread executes the transformation σ (y +← φ (x)).
For example, we assume that the value of y is c before the first thread executes
σ (s). This value c is somehow stored but the other intermediate values for vari-
able y are not of interest while the threads execute the forward section.
Once the threads enter the reverse section they encounter at some time the code
of ρ (s). The value of y is not important at this time since it does not effect the
adjoint information in ρ (s). As soon as the last thread has finished the code ρ (s),
we have to ensure that the value of y is again c as it was before executing σ (s).
Therefore, we restore the value c at some point during the reverse section. Where
exactly we perform this is not of importance but after all threads have finished
ρ (s) the value must be recovered.
This is implemented with the help of two auxiliary variables. The first auxiliary
variable al is an integer variable and second auxiliary variable zl is a floating-point
variable that hosts the value that we denoted with c in the above example. The
label l is the label of the atomic statement s in the parallel region P. The variable
al serves as status variable that can take the values zero, one, and two. Since both
216 4 Transformation of OpenMP Constructs
auxiliary variables have to be shared, the code for storing to them has to be in a
critical region. The transformation of an atomic statement for the forward section
has the following appearance:
#pragma omp atomic
σ
y +← φ (x1 , . . . , xn )
#pragma omp critical
{
if (al = 0) {
al ← 1;
:= . (4.39)
zl ← y;
}
y +← φ (x1 , . . . , xn )
}
Initially, al has the status zero which means that no thread has reached the code in
(4.39) yet. As soon as the first thread enters the critical region, the value of y is
stored in zl and the status variable al is set to one which prevents that another thread
assigns to zl . From this point forward all threads execute only the assignment
y +← φ (x1 , . . . , xn ) in (4.39) and all the intermediate values of y are lost. The code
for the reverse section is defined as shown in (4.40).
The first thread that reaches (4.40), finishes the adjoint assignments, and enters
the critical region sets the status in al to two, which represents the status that the
value of y has been restored. Since the current thread has encountered the reverse
section, it is ensured that at least one thread has finished the forward section and
therefore the status in al is one. The new status of al is set to two and the variable
y is set to its previous value which was stored in zl .
Before the execution of the adjoint code reaches (4.39) it must ensured that
al has the status zero. This can be done outside the parallel region or when we
want to be sure that al is set correctly by our source transformation tool, we can
implement the initialization of al to zero in a way that it takes place inside a single
construct. Another possibility is the initialization inside a master construct and a
barrier following the master construct. This barrier is implicit given with single
4.2 SPLOMP 1 - Synchronization Constructs 217
Proposition 56. When the initial status of al is set inside the parallel region, a
barrier must be defined after the initialization assignment to al and before (4.39).
Proof. Suppose that s = al ← 0 is the assignment that sets the initial status of al
inside of the parallel region. The subsequence of the critical region of (4.39) is
denoted by S. Inside of an interleaving, S can be considered as one statement as S
is contained in a critical region. Without a barrier between s and S the following
interleaving is possible, where t and t 0 are two distinct threads.
0 0
St ; st ; St ;
This means, the assignment y +← φ (x1 , . . . , xn ) in S is executed by thread t before
s has been executed by thread t 0 . The initial value of y is lost. A barrier between s
0
and S ensures st ≺ St and therefore that the status variable al is initialized before
any thread encounters S.
Proposition 57. The definitions (4.39) and (4.40) require a barrier between them.
Proof. The subsequence of the critical region in (4.39) is denoted by S1 , the one
in (4.40) is denoted by S2 . Since S1 and S2 are inside a critical region, we consider
them as one statement that is executed atomically. Therefore, the following inter-
leaving is possible where the critical regions are executed by two threads, t and t 0 .
The dots represent a part of the interleaving that is not of importance here.
0
. . . ; St1 ; . . . ; St2 ; St1 ; . . .
218 4 Transformation of OpenMP Constructs
This interleaving has the following effect. The value of y is stored correctly in
St1 and is restored in St2 . The problem is that after the assignment y ← zl in St2 ,
0
the variable y is changed again through the assignment y +← φ (x1 , . . . , xn ) in St1
0
because thread t has not finished the code from (4.39).
0
A barrier between (4.39) and (4.40) ensures St1 ≺ St2 and prevents that the in-
terleaving above can occur. Instead, we can be sure that all threads have finished
their work in S1 . A reasonable point where we can place the barrier is just before
the loop that opens the reverse section.
The next section shows that we can achieve closure of the adjoint source trans-
formation σ (P) no matter the parallel region P fulfills the exclusive read property
or not.
Reference &x(1) and reference &y(1) are read in assignment s1 and the sum of both
values is assigned to reference &x(1) . Both, the read and the stores of &x(1) are
4.2 SPLOMP 1 - Synchronization Constructs 219
critical since all threads read and write to this memory location simultaneously.
Since we have two critical references in this assignment, the statement does not
fulfill the LCR property, which means that we cannot use our interleaving abstrac-
tion.
Fortunately, the atomic construct of OpenMP allows us to declare s1 as to be
executed atomically. This means that no thread schedule may happen between the
read access of &x(1) and the succeeding store.
The explanation above suggests that we use the atomic construct in case that the
original assignment does not fulfill the exclusive read property. The problem is
that the source transformation tool does not gain the information of the exclusive
read property without performing a static program analysis of the original code.
We presented such a static analysis in Chapter 3 called exclusive read analysis.
Without performing the exclusive read analysis, the source transformation tool
has two options. On the one hand, it can be conservative in a way that it gener-
ates atomic constructs preceding each adjoint assignment. This transformation is
shown in (4.41).
ρ ( y ← φ (x1 , . . . , xn )) :=
y ← STACK(1) f .top();
STACK(1) f .pop();
#pragma omp atomic
x(1)1 +← φx1 (x1 , . . . , xn ) · y(1) ;
#pragma omp atomic
(4.41)
x(1)2 +← φx2 (x1 , . . . , xn ) · y(1) ;
...
#pragma omp atomic
x(1)n +← φxn (x1 , . . . , xn ) · y(1) ;
y(1) ← 0;
Obviously, this would have a major impact on the performance and therefore the
source transformation tool could be implemented in a way that it lays the respon-
sibility into the hands of the user. For example, the tool could provide a cus-
220 4 Transformation of OpenMP Constructs
tom pragma that can be used by the user to declare assignments when it con-
tains a reference that does not fulfill the exclusive read property. Let us call this
pragma #pragma ad exclusive read failure. Then the source transformation aug-
ments only assignments with this pragma as in (4.42). All the other assignments
are transformed without a preceding atomic directive. Obviously, the correctness
of the resulting code depends on the correct placement of the compiler directive
#pragma ad e x c l u s i v e r e a d f a i l u r e
The above solutions should only be taken into account when the implementation
of a static program analysis, as the one presented in Chapter 3, is not an option.
With help of such an analysis, the source transformation could obtain information
whether a reference is read exclusive by a thread or not. For example, let us assume
that reference &xk does not fulfill the exclusive read property but all the other
references &x{1,...,k−1,k+1,...,n} fulfill this property. With this knowledge, we can
define the transformation by ρ as in (4.43).
With the synchronization constructs of OpenMP we have adequate tools at hand
to show the closure property from Definition 39 for the source transformations.
The only requirement for the source transformation is that P is noncritical.
4.2 SPLOMP 1 - Synchronization Constructs 221
ρ ( y ← φ (x1 , . . . , xn )) :=
y ← STACK(1) f .top();
STACK(1) f .pop();
x(1)1 +← φx1 (x1 , . . . , xn ) · y(1) ;
..
.
x(1)k−1 +← φxk−1 (x1 , . . . , xn ) · y(1) ;
#pragma omp atomic (4.43)
x(1)k +← φxk (x1 , . . . , xn ) · y(1) ;
x(1)k+1 +← φxk+1 (x1 , . . . , xn ) · y(1) ;
..
.
x(1)n +← φxn (x1 , . . . , xn ) · y(1) ;
y ← 0;
(1)
Proof. We can built on the properties that we already have shown for the case that
P is contained in SPL.
3. critical: The rules (4.27), (4.29) and (4.30) are defined with the help of a
critical region.
4. atomic: The atomic construct in the input code leads to the same constructs
that are listed here.
The proof exposes that we need all of the possible synchronization constructs
that OpenMP provides to achieve the closure property. Assume we have a transfor-
mation tool at hand that takes SPLOMP 1 code as input and emits SPLOMP 1 code
as output. With Proposition 58 we know that the tool is able to generate derivative
codes or arbitrary high-order through reapplication and in addition these codes can
be executed concurrently since they are noncritical.
responsible for storing the input values of S. Inside the reverse section at the
position where the adjoint code of the worksharing construct is, these input values
are restored and the worksharing construct is executed forward (σ (S)) and then
reverse (ρ (S)). For example, the following parallel region
#pragma omp p a r a l l e l
{
S1
#pragma omp f o r
l : f o r ( k← 0 ; k<n ; k←k+1)
{
Si
}
Sj
}
where S1 , Si , and S j are some sequences of statements, is transformed by the source
transformation σ (P) to the following code:
1 #pragma omp p a r a l l e l
2 {
3 σ (S1 )
4 #i n c l u d e " s t o r e _ c h e c k p o i n t . c "
5 STACKc . push ( l ) ;
6 #pragma omp f o r
7 f o r ( k← 0 ; k<n ; k←k+1) { Si }
8 STACKc . push ( l ) ;
9 σ (S j )
10 w h i l e ( n o t STACKc . empty ( ) ) {
11 ρ(S j )
12 i f (STACKc . t o p ( )=l ) {
13 STACKc . pop ( ) ;
14 #i n c l u d e " r e s t o r e _ c h e c k p o i n t . c "
15 #pragma omp f o r
16 f o r ( k← 0 ; k<n ; k←k+1) { σ (Si ) }
17 w h i l e ( STACKc . t o p ( ) 6= l ) { ρ(Si ) }
18 STACKc . pop ( ) ;
19 }
20 ρ(S1 )
21 }
22 }
Let us consider the above listing in more detail. The forward section of this adjoint
code is shown from line 3 to line 9. The sequences S1 and S j are transformed by
4.3 SPLOMP 2 - Worksharing Constructs 225
Example 31. This example presents how a counting loop written in C looks like
in SPL. The statements of the loop body are composite as S. The C code
#pragma omp f o r
f o r ( k← 0 ; k<n ; k←k+1)
{
S
}
whereby one has to take into account the fact that k is a private variable for each
thread, see OpenMP 3.1 [63, p. 85].
In addition, we saw in Example 6 that the loop construct distribute the work
through implicit data decomposition. This means that the OpenMP enabled com-
piler is responsible for creating code that portions the value range of the counting
4.3 SPLOMP 2 - Worksharing Constructs 227
variable (variable k in the code above) into intervals such that each thread gets a
certain part of the value range. In Example 4, we showed a data decomposition a
performed in an explicit way.
We want to assume that the loop associated with the loop construct fulfills the
requirements mentioned in the OpenMP 3.1 [63, p. 40]. Briefly said, the initial-
ization assignment, the update statement, as well as the test expression of the loop
counter variable must have a certain well-defined form in C/C++. Furthermore,
the number of iterations must be known before the execution of any iteration of
this loop starts. We do not distinguish between the cases where a nowait clause
is present and where this is not the case. The implicit barrier must be considered
by the transformation tool but for the transformation of the loop itself this has no
effect. Let us define S as
l0 : #pragma omp for
{
a1 ← e1 ;
l 1 :
l2 : while (b){
S := {
l : S’ ;
3
l4 : a1 ← e2 ;
}
}
#pragma omp for
{
τ ( S ) := τ (a1 ← e1 )
τ ( while (b) { S’ ; a1 ← e2 ; } )
}
228 4 Transformation of OpenMP Constructs
#pragma omp for
{
(2.22),(2.21)
≡ a1 ← e1 (4.44)
while (b) { τ ( S’ ); a1 ← e2 ; }
}
The code in (4.44) displays that the tangent-linear source transformation is quite
similar to the initial code and only the subsequence S’ is transformed. A simple
transformation into a valid C/C++ OpenMP loop construct is possible.
The adjoint source transformation takes more effort and needs more attention
to the specific shape of the loop construct with its associated loop. One important
observation about the loop construct is that the loop iteration variable is private
inside the scope of the loop construct, see OpenMP 3.1 [63, p. 40, and p. 85]. The
loop iteration variable may only be changed by the update assignment a1 ← e2
OpenMP 3.1 [63, p. 40]. In the following, we distinguish between three cases:
1. a1 is invisible outside of the loop construct.
2. a1 is visible and private outside of the loop construct.
3. a1 is visible and shared outside of the loop construct.
ρ ( S ) := ρ ( ( S’ ; a1 ← e2 ) )
if STACK(1)c .top() = l3 {
STACK(1)c .pop();
a1 ← STACK(1)i .top();
≡ (4.46)
STACK(1)i .pop();
ρ (sq ); · · · ; ρ (s1 );
}
When one would examine the stack STACK(1)c after the forward section of P, the
only trace that we would have from the loop construct is label l3 . All these labels
are consumed during the reverse section.
Please remember that the parallel execution is performed in the reverse section
through the work-split during the forward section. The iterations of the work-
sharing loop are distributed among the threads and this distribution is also used
during the reverse section because the individual control flow has been stored on
STACKc . This individual control flow is processed backward during the reverse
section. Hence, the worksharing remains valid during the reverse section. We
conclude this case by showing that we can transform (4.45) into an OpenMP loop
230 4 Transformation of OpenMP Constructs
Case 2: The loop iteration variable a1 is visible and private outside the loop
construct.
Since the value of a1 is visible outside of the loop, we store the value of a1 before
the program execution enters the loop construct. This value is restored after the
adjoint part of the loop construct has been executed.
STACK(1)c .push(l1 )
STACK(1)i .push(a1 )
#pragma omp for
σ ( S ) := { (4.47)
a1 ← e1
σ ( while (b) { S’ ; a1 ← e2 ; } )
}
Before we store the value of a1 onto the integer value stack, we mark the position
in the execution where the reverse section should recover the value a1 by putting
the label l1 onto STACK(1)c . The rewriting of the loop body that we made in (4.45)
also holds for (4.47). The code for the reverse section is also similar as shown in
4.3 SPLOMP 2 - Worksharing Constructs 231
(
ρ ( (a1 ← e1 ) )
ρ ( S ) :=
ρ ( ( S’ ; a1 ← e2 ) )
if STACK(1)c .top() = l1 {
STACK(1)c .pop();
(2.49),(2.55) a1 ← STACK(1)i .top();
≡ (4.48)
STACK(1)i .pop();
}
ρ ( ( S’ ; a1 ← e2 ) )
After the reverse section has executed the code that is shown in (4.46), the label l1
is on top of STACK(1)c . Thus, the old value of a1 is restored from STACK(1)i .
Case 3: The loop iteration variable a1 is visible and shared outside of the
loop construct.
This case takes the most effort in the sense that we have to attend to different
things. The value of a1 shall be stored before the runtime system enters the loop
construct, just as in the previous case. There is only one instance of a1 since it is
shared. Thus, only one thread needs to take care of the store and the restore of the
value of a1 . The method is illustrated in (4.49).
The position where the store of the value of a1 takes place is marked by the
label l0 . Subsequently, one thread enters the single region and puts the label l1
onto STACK(1)c and the value of a1 onto STACK(1)i . It is worth mentioning that
each thread puts label l0 onto its stack but only one thread puts afterwards the label
l1 onto its stack.
Please note that the implicit barrier of the single construct could be dropped by
defining the nowait clause in case that there is an implicit barrier at the end of the
loop construct. It must be ensured that the value of a1 is stored on the stack before
any thread continues with the code that follows (4.49) in the forward section. This
means that only one of the two worksharing constructs in (4.49) has to have an
implicit barrier.
232 4 Transformation of OpenMP Constructs
STACK(1)c .push(l0 )
#pragma omp single
{
STACK(1)c .push(l1 )
STACK(1)i .push(a1 )
σ ( S ) := } /* implicit barrier */ (4.49)
#pragma omp for
{
a1 ← e1
σ ( while (b) { S’ ; a1 ← e2 ; } )
}
if STACK (1)c .top() = l 0 {
STACK(1)c .pop();
#pragma omp barrier
ρ ( S ) := (4.50)
}
ρ ( (a1 ← e1 ) )
ρ ( ( S’ ; a1 ← e2 ) )
During the reverse section only one thread has label l1 on STACK(1)c and this
thread restores the previous value of a1 . All threads have the label l0 on their stack
STACK(1)c and as soon as this label is on top of the stack the test expression in
(4.50) is evaluated true. Therefore, all threads enter at some point the contained
barrier construct.
What should be true for all these cases is that the value of the loop counting
variable a1 has to be the same at two points during the adjoint execution. The one
point is before the control flow enters the loop during the forward section. The
other point is after the control flow has finished the execution of all corresponding
adjoint statements. This property is shown by the following lemma.
Lemma 59. Suppose we have a loop construct with a counting variable a1 . Then
the value of a1 before the loop construct during the forward section is the same as
the value after the loop construct during the reverse section.
4.3 SPLOMP 2 - Worksharing Constructs 233
l0 (l3 ) ∗ .
During the reverse section, all threads execute the adjoint codes associated with
label l3 . Subsequently, all threads except thread t have label l0 on their stack
and therefore they enter the branch displayed in (4.50) where they temporary stop
execution due to the barrier. Meanwhile, thread t has label l1 on its stack and enters
the branch shown in (4.48) where it restores the value R from its STACK(1)i . The
next label that thread t has on its STACK(1)c is label l0 and it also enters the branch
234 4 Transformation of OpenMP Constructs
with the barrier. Once all threads have reached the barrier shown in (4.50), the
execution continues and the shared variable a1 has its previous value R.
V = V ∪ { sequence_of_sections, section }
We will omit the fact that the first section is optional and therefore we define the
syntax of the sections construct in the following way.
ROMP = ROMP
c : #pragma omp sections
∪
{ sequence_of_sections }
sequence_of_sections : section
∪
| sequence_of_sections section
section : #pragma omp section
∪
{ S’ }
The OpenMP sections construct is a control flow statement and is therefore con-
tained in CFSTMT. Suppose a sections construct S is given as
#pragma omp sections
{
#pragma omp section { S10 }
S := #pragma omp section { S20 } .
..
.
#pragma omp section { Sn0
}
}
4.3 SPLOMP 2 - Worksharing Constructs 235
The source transformation for getting the tangent-linear model and the forward
section of the adjoint code are both defined recursively.
#pragma omp sections
{
#pragma omp section { τ (S10 ) }
τ ( S ) := #pragma omp section { τ (S20 ) }
..
.
#pragma omp section { τ (Sn0 )
}
}
#pragma omp sections
{
#pragma omp section { σ (S10 ) }
σ ( S ) := #pragma omp section { σ (S20 ) }
..
.
#pragma omp section { σ (Sn0 ) }
}
ρ (S10 )
ρ (S0 )
2
ρ ( S ) := .
..
ρ (Sn0 )
(
#pragma omp single #pragma omp single
σ := (4.51)
{ S’ } { σ ( S’ ) }
#pragma omp single
ρ := ρ ( S’ ) (4.52)
{ S’ }
Suppose thread t encounters a single construct during the forward section. Assum-
ing that thread t enters the single construct, the label of the first statement in S’ is
put onto STACK(1)c . Only thread t has this label on its stack. During the reverse
section, thread t encounters as the only one the label and executes the code given
by (4.52). The concluding worksharing constructs are the combined constructs.
#pragma omp parallel for
τ
for (a1 ← e1 ; b; a1 ← e2 ) { D S }
(
#pragma omp parallel for
:=
for (a1 ← e1 ; b; a1 ← e2 ) { τ (D) τ (S) }
4.3 SPLOMP 2 - Worksharing Constructs 237
The adjoint source transformation looks very similar to the one defined for the
parallel region in rule (2.32). As explained in Section 2.3.3, the parallel region is
transformed in the joint reversal scheme and we apply the joint reversal scheme to
the combined loop construct as well.
#pragma omp parallel for
for (a1 ← e1 ; b; a1 ← e2 )
{
σ (P) := σ (D) (4.53)
σ (S)
while not STACK (1)c .empty() { ρ( S ) }
}
The second combined parallel construct is the parallel sections construct, which
has the following syntax.
ROMP = ROMP
P : #pragma omp parallel sections
{
#pragma omp section { D1 S1 }
∪ #pragma omp section { D2 S2 } (4.54)
..
.
#pragma omp section { Dn Sn }
}
The adjoint source transformation takes each section and generates the correspond-
238 4 Transformation of OpenMP Constructs
This finishes the second extension of our SPL language where we covered the
worksharing constructs of OpenMP. The language that this extension produces is
called SPLOMP 2 . The next extension deals with the data-sharing possibilities of
OpenMP.
Despite the fact that the threadprivate directive is placed in the global scope and
not inside of the parallel region P, we have to define how the source transforma-
tion influences this directive. Suppose that the adjoint model of the parallel region
P makes use of the stacks STACK(1)c , STACK(1)i , and STACK(1) f . Each source
transformation that supplies an higher derivative model leads to another defini-
tion of these three stacks. The second-order tangent-linear model uses the stacks
(2) (2) (2)
STACKc , STACKi , and STACK f , the second-order adjoint model stores data
in STACK(2)c , STACK(2)i , and STACK(2) f , and so forth.
Is there a floating-point variable x in P that is defined as threadprivate then
the derivative code contains a threadprivate directive that lists the variable x(1) as
threadprivate in case that we apply the tangent-linear transformation, or it lists the
variable x(1) as threadprivate in case of the adjoint transformation, respectively.
The tangent-linear transformation of the threadprivate directive is defined as:
τ (#pragma omp threadprivate (list)) := #pragma omp threadprivate (list1 )
where
n o
(1)
list1 := STACK(c|i| f ) , STACK(c|i| f ) | STACK(c|i| f ) ∈ list
n | a ∈ INTlist }
∪ {a o
∪ x, x(1) | x ∈ FLOATlist
The following sections explain the transformation of several clauses which can
be used to control the data environment at runtime.
when we apply the forward mode. The application of the reverse mode leads to
the definitions
For each occurrence of a variable v inside of list another instance is defined, namely
vl , which represents the private instance of v. In addition, the transformation must
perform a substitution of variable v by vl . We denote the substitution of v by vl
inside of S with
S[v/vl ] .
#pragma omp parallel private(list) [clauses]
τ
{ D S }
#pragma omp parallel [clausest ]
{
:= D1
(1)
τ (S)∀a, x ∈ VARlist : [a/al ][x/xl ][x(1) /xl ]
}
The line
(1)
τ (S)∀a, x ∈ VARlist : [a/al ][x/xl ][x(1) /xl ]
present. The adjoint version of a parallel region with a private clause is given by:
#pragma omp parallel private(list) [clauses]
σ
{ D S }
#pragma omp parallel [clausesa ]
{
D2
x ∈ VARlist : [a/al ][x/xl]
σ (S)∀a,
:=
while not STACK(1)c .empty() {
ρ (S)∀a, x ∈ VARlist : [a/al ][x/xl ][x(1) /x(1)l ]
}
}
We substitute all variables a and x that are part of VARlist . The adjoint values for
xl are stored in x(1)l . The sequence clausesa does not contain a private clause.
Example 32. In this example we assume that the following parallel region is given.
where we only replace the variables a and x inside the forward section. Inside the
a
reverse section we have to replace a, x, and x(1) .
In the following, we consider the case where the private clause is part of a
worksharing construct W . This means W is either for, sections, or single. We
define the sequence S in the parallel region P as:
#pragma omp W private(list)
{
S :=
S’
}
The description of the private clause in the OpenMP standard does not determine
that the private variables has to be initialized by a value. However, we perform an
initialization with zero since we reuse this code pattern in the next section where
we cover the firstprivate clause.
We determined in Section 4.3 that we apply the split reversal scheme to a stand-
alone worksharing construct. The split reversal scheme of W means that we have
two separated parts that belong to the transformation of W . The first part is con-
tained in the forward section of P and the second part is in the reverse section of P.
The private variables have therefore two different scopes. Let us first consider the
case where we did not use the above approach of substituting every variable that
occurs in the list of the private clause. In this case and the fact that the scope of the
private variables is left unchanged, means that their values must be stored on the
stack at the end of the first scope and at the begin of the second scope the values
have to be restored. Since we substitute each variable v in list by a variable vl and
the variable vl is defined in the scope of the parallel region P, the value keeps alive
in between the two scopes and therefore we do not need to store and restore the
values.
∀a ∈ INTlist : al ← 0;
∀x ∈ FLOATlist : xl ← 0.;
#pragma omp W
σ (S) := (4.59)
{
σ ( S’ )∀a, x ∈ VARlist : [a/al ][x/xl ]
}
if
STACK(1)c .top() = LABEL( S’ ) {
ρ (S) := ρ ( S’ )∀a, x ∈ VARlist : [a/al ][x/xl ][x(1) /x(1)l ] (4.60)
}
This concludes the source transformation for the private clause. The next two
sections cover the firstprivate and the lastprivate clause. These two clauses can be
seen as a superset of the functionality provided by the private clause. Hence, the
definitions of the current section are necessary for these clauses as well.
it defines that each variable v in list is to be initialized with the value of the global
instance of v when the runtime execution enters the associated construct. The
production rule for this clause is given by
ROMP = ROMP ∪ cl : firstprivate ( l )
In case that the firstprivate(list) clause associated with a parallel directive, we
define a sequence of definitions D3 such that it creates private copies of all list
items. The label of the firstprivate clause is assumed to be l. Each variable vl is
initialized by the value of the global instance of v.
D3 := τ (D) ∪ {int al ← a; | a ∈ INTlist }
(1)
∪ {double xl ← x; double xl ← x(1) ; | x ∈ FLOATlist }
With this sequence of definitions, we define the tangent-linear source transforma-
tion by
#pragma omp parallel firstprivate(list) [clauses]
τ
{ D S }
#pragma omp parallel [clausest ]
{
:= D3
(1)
τ (S)∀a, x ∈ VARlist : [a/al ][x/xl ][x(1) /xl ]
}
where we substitute each occurrence of a, x, or x(1) in τ (S) with its private instance
(1)
al , xl , or xl . In clausest there must not be a firstprivate clause. This is similar
in case of the adjoint source transformation. We define the sequence of definitions
D4 such that
D4 := σ (D) ∪ {int al ← a; |a ∈ INTlist }
∪ {double xl ← x; double x(1)l ← 0.; |x ∈ FLOATlist }
With the definition of D4 , we obtain a dataflow from outside of P into the parallel
region, namely from variables a and x to variables al and xl . This dataflow has to be
reversed after the code from ρ (P) has been executed. Hence, we define a sequence
of statements S1 which is responsible to reverse the dataflow of all floating-point
elements in VARlist .
#pragma omp atomic
S1 := x ∈ FLOATlist (4.61)
x(1) +← x(1)l ;
246 4 Transformation of OpenMP Constructs
The scope of instance x(1) is valid inside the master thread and this instance is a
scalar reference. For each thread in the group of executing threads there is a private
copy x(1)l of this instance. The initialization from one scalar variable into p private
copies in the forward section is represented in the reverse section by adding up the
p values of the adjoint private copies and to put this value into the adjoint associate
of x. Therefore, we have to use the atomic construct to prevent a race condition.
The adjoint source transformation for P with a firstprivate clause may look like
#pragma omp parallel firstprivate(list) [clauses]
σ
{ D S }
#pragma omp parallel [clausesa ]
{
D4
x ∈ VARlist : [a/al ][x/xl]
σ (S)∀a,
:= while not STACK(1)c .empty() { (4.62)
ρ (S)∀a, x ∈ VARlist : [a/al ][x/xl ][x(1) /x(1)l ]
}
S1
}
where clausesa must not contain a firstprivate clause. The OpenMP experienced
reader may make an objection to S1 in (4.62) and could argue that this also can be
done by another OpenMP construct, namely the reduction clause. Therefore, we
display another definition which is semantically equivalent to (4.62) and contains a
reduction clause with a listr composed of all adjoint floating-point elements from
list. Since the reduction clause implicitly defines private variables for all variables
4.4 SPLOMP 3 - Data-Sharing 247
The source transformation that emits the code for the forward section initializes
the local copies with the values of the global instance of a and x not with zero.
This dataflow must be reversed after the execution has finished the reverse section
248 4 Transformation of OpenMP Constructs
part of the worksharing construct W . For this reason, we push the label of the
firstprivate clause onto the control flow stack STACK(1)c to mark that at this point
the dataflow has to be reversed as soon as the reverse section has reached this label
l.
STACK(1)c .push(l)
∀a ∈ INTlist : al ← a;
∀x ∈ FLOATlist : xl ← x;
σ (S) := #pragma omp W
{
σ ( S’ )∀a, x ∈ VARlist : [a/al ][x/xl ]
}
The transformation ρ (S) for the reverse section contains a branch that checks if
the label l is on top of the control flow stack. If this is the case, the local values of
each thread are summarized and put into the global instance that is associated with
the private variable. This code is defined in S1 (4.61). The other branch statement
in ρ (S) contains the adjoint statements for S0 together with the substitutions of the
private variables.
if STACK(1)c .top() = l { S1 }
if STACK(1)c .top() = LABEL( S’ ) {
ρ (S) :=
ρ ( S’ )∀a, x ∈ VARlist : [a/al ][x/xl ][x(1) /x(1)l ]
}
Let us consider v as an element in list. This means that there are two instances
of v. One global instance v{t} that can be read by all threads, and vt as a private
variable of thread t. At the end of the construct associated with the lastprivate
clause, the global instance of v is assigned to with the current value of a certain vt .
4.4 SPLOMP 3 - Data-Sharing 249
The decision what private vt is assigned to the global instance depends on the kind
of worksharing construct associated with the lastprivate clause. In case that the
clause is associated with a loop then the thread that executes the sequentially last
iteration of the loop writes its thread local value to the global instance of v{t} . The
other possible worksharing construct is the sections construct. There, the thread
that executes the code from the lexically last section assigns its private value vt to
the global instance v{t} .
The test for entering the loop is a2 < N with a2 being the counting variable of the
loop. The easiest approach to transform the lastprivate(list) clause in the forward
mode is to adjust the list such that each occurrence of a floating-point variable
x in list leads to two elements x and x(1) in the corresponding list occurring in
the tangent-linear model. However, in case of the adjoint code we cannot use this
approach and therefore we present a more common approach for the forward mode
as well to ensure a similar implementation of these source transformations.
Our approach is to check in each iteration if the last iteration is currently exe-
cuted. In case that there are several hundreds or even millions of loop iterations
this test becomes a bottleneck and the lastprivate variant would probably the bet-
ter choice. Since we only consider the source transformation on the inner parallel
region and we do not want to define code that is located outside of the parallel
region, we keep the above transformation as it is and the software engineer who
implements these techniques can decide which solution fits her or his demands.
The tangent-linear transformation of P is shown in (4.63). Inside of the sequence
clausest is no lastprivate clause present and D1 is defined as given in (4.56).
The adjoint transformation of P which is shown in (4.64), is performed in the
joint reversal scheme. The body of the loop starts with the sequence of definitions
D2 from (4.57). The forward section with the substitution of the variables a and
x from VARlist follows D2 . Subsequently, a branch statement checks if the current
iteration is the last iteration. If this is the case, we perform the dataflow that is
defined through the lastprivate clause. This means we assign the values of the
thread local variables al and xl to the global instances a and x. We do not need an
250 4 Transformation of OpenMP Constructs
atomic statement since only one thread performs these assignments. Please note
the assignment x(1)l ← x(1) inside the branch statement with the test expression
(a2 = N − 1) shown in (4.64).
#pragma omp parallel for [clauses t ]
for (a2 ← 0; a2 < N; a2 ← a2 + 1)
{
D1
(1)
τ (S)∀a, x ∈ VARlist : [a/al ][x/xl ][x(1) /xl ]
τ (P) := (4.63)
if (a2 = N − 1){
∀a ∈ INTlist : a ← al ;
(1)
∀x ∈ FLOATlist : x ← xl ; x(1) ← xl ;
}
}
#pragma omp parallel for [clauses a ]
for (a2 ← 0; a2 < N; a2 ← a2 + 1)
{
D2
σ (S)∀a, x ∈ VARlist : [a/al ][x/xl ]
if (a2 = N − 1){
σ (P) := ∀a ∈ INTlist : a ← al ; (4.64)
∀x ∈ FLOATlist : x ← xl ; x(1)l ← x(1) ;
}
while not STACK (1)c .empty() {
ρ (S)∀a, x ∈ VARlist : [a/al ][x/xl ][x(1) /x(1)l ]
}
}
This assignment is an adjoint assignment that occurs here before the code of the
reverse section. We perform with this assignment the reversal of the lastprivate
clause. At the beginning of the reverse section all adjoint variables have to be zero.
For the variable x(1)l this means it would be initialized with zero and the thread
4.4 SPLOMP 3 - Data-Sharing 251
that performed the lastprivate dataflow in the forward section would increment the
adjoint x(1)l by the value of x(1) . These two steps are semantically equivalent with
the assignment x(1)l ← x(1) . We could perform this reversal dataflow similar to
the solution that we used in the previous section where we used a label l to mark
the position where the reverse section has to execute the reversal of the dataflow.
But, we select the simpler solution despite the commonly used code pattern that
we only have adjoint assignments inside the reverse section.
In this section, the parallel region P is supposed to have the following structure:
The definition for the syntax in (4.54) determines that each section has its own
sequence with definitions and its own sequence of statements. The label of the
lastprivate clause is assumed to be l. The sequence of definitions are referred to
as Dl,{1,...,N} and the sequence of statements as Sl,{1,...,N} where N ∈ N. The index
l should only avoid misunderstandings with the already defined sequences D1 , D2 ,
and so forth. At this point, there is no connection between Dl,1 and the lastprivate
clause with label l.
We define two sequences of definitions D5 and D6 that consist of the variables
listed in the lastprivate clause together with their tangent-linear or adjoint asso-
ciates. The sequence of definitions for the tangent-linear code is
(1)
D5 := {int al |a ∈ INTlist } ∪ {double xl ; double xl ← 0.|x ∈ FLOATlist }
#pragma omp parallel sections [clausest ]
{
#pragma omp section {
τ Dl,1
D5
τ Sl,1 ∀a, x ∈ VARlist :
(1)
[a/al ][x/xl ][x(1) /xl ]
}
..
τ (P) := . (4.65)
#pragma omp section {
τ Dl,N
D5
τ Sl,N ∀a, x ∈ VARlist :
(1)
[a/al ][x/xl ][x(1) /xl ]
∀a ∈ INTlist : a ← al ;
(1)
∀x ∈ FLOATlist : x ← xl ; x(1) ← xl ;
}
}
where clausest may not contain a lastprivate clause. Each section defines the pri-
vate variables that are listed in the lastprivate clause. The code inside the last
section contains in addition a sequence of assignments that is semantically equiv-
alent to the dataflow performed by the lastprivate clause.
clause. This is the same solution as in the combined parallel loop construct.
σ (P) :=
#pragma omp parallel sections [clausesa ]
{
#pragma omp section {
σ Dl,1
D6
σ Sl,1∀a, x ∈ VARlist : [a/al ][x/x l]
while not STACK(1)c .empty() {
ρ Sl,1 ∀a, x ∈ VARlist : [a/al ][x/xl ][x(1) /x(1)l ]
}
}
..
. (4.66)
#pragma omp section {
σ Dl,N
D6
σ Sl,N ∀a, x ∈ VARlist : [a/al ][x/xl ]
∀a ∈ INTlist : a ← al ;
∀x ∈ FLOAT list : x ← xl ; ← x(1) ;
x(1)l
while not STACK(1)c .empty() {
ρ Sl,1 ∀a, x ∈ VARlist : [a/al ][x/xl ][x(1) /x(1)l ]
}
}
}
#pragma omp for lastprivate(list) [clauses]
{
S := a2 ← 0; (4.67)
while (a2 < N){ S’ ; a2 ← a2 + 1; }
}
To register the thread that executes the sequentially last iteration, we use an
auxiliary variable al,l that is indexed twice with label l to avoid misunderstand-
ings with the variable al . al is used as identifier for the local copy of variable a
contained in the list of lastprivate variables, al,l is used to mark the thread that
executes the sequentially last iteration.
According to (4.67), variable al,l stores the current value of the counting vari-
able a2 . Once the loop has been executed, only one thread has a private variable
al,l that contains the value N − 1. This method prevents that the test that checks
if the current iteration is the last iteration must be performed by each iteration.
This should be a performance gain when assuming that the loop has millions of
iterations. The sequence of definitions has to be adjusted such that it contains the
auxiliary variable al,l . Thus, the tangent-linear transformation τ (P) is supposed to
be D7 where
The adjoint source transformation σ (P) is assumed to have the following sequence
of definitions:
the current value of the counting variable a2 . The branch statement that follows
a ← 0;
l,l
#pragma omp for [clausest ]
{
a2 ← 0;
while (a2 < N){
(1)
τ ( S’ )∀a, x ∈ VARlist : [a/al ][x/xl ][x(1) /xl ]
al,l ← a2 ;
τ (S) := (4.70)
a2 ← a2 + 1;
}
}
if (al,l = N − 1){
∀a ∈ INTlist : a ← al ;
(1)
∀x ∈ FLOATlist : x ← xl ; x(1) ← xl ;
}
al,l ← 0;
#pragma omp for [clausesa ]
{
a2 ← 0;
while (a2 < N){
σ ( S’ )∀a, x ∈ VARlist : [a/al ][x/xl ]
al,l ← a2 ;
σ (S) := σ (a2 ← a2 + 1) (4.71)
}
}
if (al,l = N − 1){
∀a ∈ INTlist : a ← al ;
∀x ∈ FLOATlist : x ← xl ;
STACK (1)c .push(l)
}
256 4 Transformation of OpenMP Constructs
the loop construct checks whether or not the local variable has the value N − 1.
The thread where this check is valid must perform the dataflow connected with
lastprivate clause.
The source transformation for the forward section of the adjoint code is defined
in (4.71). The label of the lastprivate clause is assumed to be l. The structure of
the transformation is similar to the tangent-linear but, in addition, we push label
l onto the control flow stack to indicate that the thread with this label on its stack
has to reverse the dataflow associated with the lastprivate clause.
The code for the reverse section
if STACK (1)c .top() = l {
∀x ∈ FLOATlist : x(1)l ← x(1) ;
ρ (S) :=
}
ρ ( S’ ; a2 ← a2 + 1)∀a, x ∈ VARlist : [a/al ][x/xl ][x(1) /x(1)l ]
contains two parts. The first part is a branch where the test expression is valid if
the executing thread has the label l on its control flow stack. In case that the test
is successful, the executing thread has to reverse the dataflow connected with the
lastprivate clause. We use an assignment and not an incremental assignment to set
the value of the adjoint variable x(1)l . This is correct since this adjoint variable is a
local variable inside the worksharing construct and its value is therefore zero at the
beginning of the adjoint part. The second part is the part that contains the adjoint
statements of the loop body.
We assume that the parallel region has the sequence of definitions as defined
for the worksharing loop in (4.68) and (4.69). The tangent-linear transformation
4.4 SPLOMP 3 - Data-Sharing 257
is similar to the one of the combined parallel sections construct shown in (4.65).
The only difference is the missing sequences of definitions D5 and τ Dl,1,...,N .
For this reason we do not display the definition of τ (S) here and refer to (4.65).
The adjoint source transformation differs from (4.66) because we used the joint
reversal scheme there and we use here the split reversal scheme for the stand-alone
worksharing construct. The transformation for the forward section part is
#pragma omp sections [clausest ]
{
#pragma omp section
{ σ Sl,1 ∀a, x ∈ VARlist : [a/al ][x/xl ] }
..
.
#pragma omp section
{ σ Sl,N−1 ∀a, x ∈ VARlist : [a/al ][x/xl ] }
σ ( S’ ) := #pragma omp section
{
σ Sl,N ∀a, x ∈ VARlist : [a/al ][x/xl ]
∀a ∈ INTlist : a ← al ;
∀x ∈ FLOATlist : x ← xl ;
STACK(1)c .push(l)
}
}
where we perform the dataflow connected with the lastprivate clause in the last
section. We assume that the label of the lastprivate clause is l. Therefore, we push
label l onto the control flow stack to indicate that the dataflow connected to the
lastprivate clause has to be reversed during the reverse section.
258 4 Transformation of OpenMP Constructs
contains the adjoint statements for all the sections Sl,1 to Sl,N . In addition, it con-
tains a branch statement that checks if there is the label l on the top of the control
flow stack. In this case, the executing thread has to reverse the dataflow that is
connected with lastprivate clause.
V = V ∪ {op}
For simplicity reasons, we assume that all variables in list are floating-point vari-
ables. According to the description in OpenMP 3.1 citation 33 (page 52), P can be
4.4 SPLOMP 3 - Data-Sharing 259
where clausesr does not contain a reduction clause and the label of the reduction
clause in P is assumed to be l. For each element x ∈ FLOATlist each thread creates
a private instance xl that is initialized with ID(⊗). ID maps the given operator to
its associated identity element. This means in case of an addition, ID(+) is zero,
and a multiplication operator as argument of ID(∗) yields the value one.
The reader may ask why the reduction clause has been introduced in OpenMP
when it can be expressed by the above method. The answer is that a reduction is a
widely used operation in scientific computing and the clause hides the underlying
synchronization need from the user. Another advantage is that the back-end com-
piler can decide how this synchronization is being implemented on a lower level.
The fact that the reduction clause often occurs in scientific computing motivates
this section although we already have the tools at hand to transform (4.72).
Reduction of a Sum
Let us assume that P has the shape
(
#pragma omp parallel reduction(+:list)
P :=
{ D S }
where list only contains floating-point variables. We define for each variable in
list a private variable that is initialized with zero. In addition we need a derivative
component for this private variable. The sequence of definitions for the tangent-
linear code is
(1)
D9 := τ (D) ∪ {double xl ← 0.; double xl ← 0.; |x ∈ FLOATlist }
260 4 Transformation of OpenMP Constructs
where
#pragma omp atomic
S5 := x +← xl ; x ∈ FLOATlist
x(1)l ← x(1) ;
4.4 SPLOMP 3 - Data-Sharing 261
is the sequence of statements that is responsible for the reduction. More pre-
cisely, the first statement in S5 performs the reduction with an atomic assignment,
whereby the second assignment is the adjoint counterpart of the first assignment.
We do not use an atomic construct for the second assignment since the left-hand
side reference x(1)l is a private variable. As in the section where we covered the
lastprivate clause we set here the adjoint variable despite the fact that the execution
is still in the forward section.
Suppose that a worksharing construct W inside of the parallel region P is aug-
mented with a reduction clause as shown in (4.76). We expect in list again only
floating-point variables.
n
S := #pragma omp W reduction(+: list) { S’ } (4.76)
Since we substitute the variables contained in list we assume that τ (P) contains
D9 , and σ (P) includes D10 as their sequence of definitions, see (4.73) and (4.75).
The tangent-linear source transformation is defined as
#pragma omp W
(1)
τ (S) := { τ ( S’ )∀x ∈ FLOATlist : [x/xl ][x(1) /xl ] } (4.77)
S4
with
#pragma omp atomic
S6 := | x ∈ FLOATlist
x +← xl ;
The single construct stores the values of the variables from list in the auxiliary
variables zl,x . The implicit barrier ensures that no thread enters the worksharing
construct before all the values has been stored. Before the reduction is started by
executing the sequence S6 we mark this position with the label l on the control flow.
This label indicates the reverse section that the adjoint pendant of the reduction
must be executed. In S6 , we use an atomic assignment for each x in list to perform
the reduction operation. The barrier at the end of σ (S) ensures that all threads
have finished the reduction before its adjoint pendant is executed. In case that
there are multiple reduction clauses a better solution would be to emit one barrier
between the forward and the reverse section.
The code that belongs to the reverse section is
if STACK(1)c .top() = LABEL( S’ ) {
ρ ( S’ )∀x ∈ FLOATlist : [x/xl ][x(1) /x(1)l ]
}
if STACK(1)c .top() = l {
ρ (S) := (4.79)
∀x ∈ FLOATlist : x(1)l ← x(1) ;
#pragma omp single
{ ∀x ∈ FLOATlist : x ← zl,x ; }
}
where the first branch statement contains the adjoint statements for the body of
the worksharing construct. The second branch statement is responsible for two
things. On the one hand, the reversal of the dataflow inherited from the reduction
is performed. On the other hand, the single construct restores the values of the
global instances of all elements of list. One could ask why this happens at this
point where the reverse section part of the worksharing construct is lying ahead.
But this does not influence the adjoint code of the worksharing construct since all
the instances of x from list are substituted by xl .
4.4 SPLOMP 3 - Data-Sharing 263
Reduction of a Multiplication
The parallel region P is defined as
(
#pragma omp parallel reduction(*:list)
P := .
{ D S }
Again, we assume that list only contains floating-point variables. We have to ini-
tialize the private copies of variable x contained in list with the identity element
of the multiplication. The sequence of definitions for the tangent-linear source
transformation is
(1)
D11 := τ (D) ∪ {double xl ← 1.; double xl ← 0.; |x ∈ FLOATlist }
where for each variable in list the following auxiliary variables are necessary. z1,x
is a vector with p elements whereby z2,x is a scalar variable.
The tangent-linear transformation of P is
#pragma omp parallel
{
D11
(1)
τ (P) := τ (S)∀x ∈ FLOATlist : [x/xl ][x(1) /xl ] (4.80)
#pragma omp critical
{ S7 }
}
where
(" # )
(1)
x(1) ← x(1) ∗ xl + xl ∗ x;
S7 := x ∈ FLOATlist (4.81)
x ← x ∗ xl ;
shape x ∗← e. OpenMP, on the other side, knows this shape and allows to declare
this shape of assignment to be declared as atomic. The atomic assignment is for
sure the better alternative as to put it in a critical region but here we do not have a
choice.
In Section 4.2.3, we saw that situations can occur where it is necessary to store
the order of the threads that execute a certain part of P. The reason for this is
the fact that a reference occurs on both sides of an assignment. We can avoid the
overhead connected to the storing of the thread order in case of a multiplication
as reduction operation. This is done by calculating the derivative of the reduction
result with respect to all its input references. The result of a reduction that is
connected with a multiplication of a floating-point variable x is
p−1
xa ← xb · ∏ xt (4.82)
t=0
where xa denotes the state of reference x after the reduction, and xb is the state of
reference x before the reduction. This means that the references &xa and &xb are
the same. We need this notation to differ between these two instances of x. The
private copies of each thread are referred to as x0,...,p−1 . For illustration reasons,
we assume for a moment that all x0,...,p−1 are not zero. Then the adjoint model of
(4.82) can be expressed by:
0 +← x xa
x(1) (1)a · ; (4.83)
x0
1 +← x xa
x(1) (1)a · 1 ;
x
..
.
p−1 xa
x(1) +← x(1)a · p−1 ; (4.84)
x
xa
x(1)b ← x(1)a · ; (4.85)
xb
Since the adjoint references of all thread local references are zero at the beginning
of the reverse section, we can rewrite the assignments from (4.83) to (4.84) as
common assignments instead of incremental assignments. (4.85) indicates that the
result before the reduction as well as the result after the reduction are necessary.
In order to cover the special case that an intermediate result can be zero, we use
several auxiliary variables. The number of threads is denoted by p.
The joint reversal scheme of P is shown in (4.86). There, we assume that the
reduction(list) clause is labeled with l and list contains the floating-point variable
4.4 SPLOMP 3 - Data-Sharing 265
x. D12 defines the local copy xl of x. Before the augmented forward section of
S starts, a single thread stores the value of x in z2,x . The other threads wait at
the implicit barrier of the single construct. After the forward section of S, the
intermediate result of each thread is stored by
z1,x,t ← xl ;
The reduction is performed with help of a critical region. This critical region ends
the forward section of P and a barrier ensures that all threads have finished their
forward section.
The reverse section starts by setting the local adjoint x(1)l . This assignment cor-
responds to the assignments shown from (4.83) to (4.84) where the fraction xxat is
represented here by the product. The product multiplies all intermediate values
except the one from the current thread (omp_get_thread_num 6= t). Eventu-
ally, the variable x is restored, and the adjoint x(1) is computed inside of a single
construct.
Now, we explain how a reduction connected with a worksharing construct W is
transformed. Let S be defined as
from the stack. Subsequently, the adjoints of the local copies x(1)l are computed.
This assignment corresponds to the assignments shown from (4.83) to (4.84). The
product consists of the multiplication of all intermediate results except the one
from the current thread (omp_get_thread_num 6= t). Afterwards, the single con-
struct is used to restore the value of x and to compute the adjoint x(1) . The value
of the adjoint is given by the product of all intermediate values z1,x,0,...,p−1 .
σ (P) :=
#pragma omp parallel
{
D12
#pragma omp single
{ ∀x ∈ FLOATlist : z2,x ← x; }
σ (S)∀x ∈ FLOATlist : [x/xl ]
∀x ∈ FLOATlist : z1,x,t ← xl ;
#pragma omp critical
{ ∀x ∈ FLOATlist : x ← x ∗ xl ; }
#pragma omp barrier
∀x ∈ FLOATlist :
x(1)l ← x(1) · z2,x · (4.86)
p−1
∏t=0,omp_get_thread_num6=t z1,x,t ;
while not STACK(1)c .empty() {
ρ (S)∀x ∈ FLOATlist : [x/xl ][x(1) /x(1)l ]
}
#pragma omp single
{
∀x ∈ FLOATlist : x ← z2,x ;
p−1
∀x ∈ FLOATlist : x(1) ← x(1) · ∏t=0 z1,x,t ;
}
}
4.4 SPLOMP 3 - Data-Sharing 267
#pragma omp W
{ τ ( S’ )∀x ∈ FLOAT : [x/x ][x(1) /x(1) ] }
list l l
τ (S) := (4.87)
#pragma omp critical
{ S }
7
#pragma omp single
{ ∀x ∈ FLOATlist : z2,x ← x; }
#pragma omp W
{ σ ( S’ )∀x ∈ FLOATlist : [x/xl ] }
σ (S) := ∀x ∈ FLOATlist : z1,x,t ← xl ; (4.88)
STACK(1)c .push(l)
#pragma omp critical
{ ∀x ∈ FLOATlist : x ← x ∗ xl ; }
#pragma omp barrier
if STACK (1)c .top() = LABEL( S’ ) {
ρ ( S’ )∀x ∈ FLOATlist : [x/xl ][x(1) /x(1)l ]
}
if STACK (1)c .top() = l {
STACK(1)c .pop();
x(1)l ← x(1) · z2,x ·
p−1
ρ (S) := ∏t=0,omp_get_thread_num6=t z1,x,t ; (4.89)
#pragma omp single
{
∀x ∈ FLOATlist :
x ← z2,x ;
p−1
x(1) ← x(1) · ∏t=0 z1,x,t ;
}
}
268 4 Transformation of OpenMP Constructs
4.5 Summary
In this chapter we covered the case that the input code for the AD source trans-
formation consists of an OpenMP parallel region. This region can consist of the
different possible directives and constructs described in the OpenMP 3.1 standard.
In order to achieve rules for an AD source transformation for the possible OpenMP
pragmas, we took the context-free grammar for the language SPL together with the
transformation rules for SPL from Chapter 2 as a starting point. The productive
rules of the grammar were extended step-by-step during the current chapter. In
addition, the corresponding tangent-linear and adjoint source transformation rules
were defined.
The first thing that a software engineer who wants to implement an adjoint
source transformation has to decide is how the values are stored that potentially
can be overwritten during the execution of the forward section. We presented two
possibilities in Section 4.1 where both solutions have their assets and drawbacks.
In Section 4.2 we introduced the transformations of the synchronization con-
structs barrier, master, critical, and atomic of OpenMP. We call the language that
is produced by the context-free grammar G OMP together with rules for recog-
nizing the syntax of the synchronization constructs SPLOMP 1 . This was a very
important section because with help of the synchronization constructs we were
able to show the closure property of our source transformations in Section 4.2.5.
The two most important worksharing constructs of OpenMP are the loop con-
struct and the sections construct. We covered these constructs in Section 4.3.
The language that contains the worksharing constructs of OpenMP is denoted by
SPLOMP 2 . The combined constructs parallel for and the parallel sections were
also part of this section. The adjoint transformation of the combined constructs did
differ from the adjoint transformation of the stand-alone worksharing constructs.
For the first, we used the joint reversal scheme, for the latter we used the split
reversal scheme.
OpenMP allows to define several clauses to control the data-sharing among
threads. These data-sharing options were covered in Section 4.4. The language
that arises from applying all the production rules of this chapter, is referred to
as SPLOMP 3 . We introduced the transformations of the threadprivate directive
and in addition the transformations of the private, firstprivate, lastprivate, and
reduction clauses. Again, we had to differ between the joint reversal and the split
reversal scheme depending on whether the data-sharing clause was defined for a
combined construct or for a worksharing construct.
5 Experimental Results
The previous chapter presented source code transformation rules that take an par-
allel region as input and provide the tangent-linear or the adjoint model as output
code. These source transformation rules have been implemented in a tool called
SPLc which is described in Appendix A. The current chapter presents runtime re-
sults of several derivative codes and of applications where derivative codes are
used. The corresponding derivative codes were all generated by SPLc.
Our target computer architecture was a shared memory system that is a node
contained in the high performance computing cluster of the RWTH Aachen Uni-
versity. This cluster was placed as 37th in the TOP 500 supercomputer list1 when
the system was established in the year 2011. The node of the system that we use
for testing our approach is a compound of four physical nodes where the connec-
tion between these physical nodes is provided by the proprietary Bull Coherent
Switch (BCS). Each physical node is equipped with a board carrying four Intel
X7550 processors. The X7550 processors are clocked at 2 GHz and consist of
eight cores. In total, the system node has 16 sockets with 8 cores per socket which
gives the user the possibility to run programs which exploit 128 cores and a max-
imum of memory that amounts to one terabyte. Detailed information about the
architecture can be found in Table 5.1.
To measure the performance of the tangent-linear and the adjoint models that are
the result of the source transformation rules from Chapter 4, we will present a test
suite that contains example codes with different OpenMP pragmas. The derivative
codes from these code examples are obtained by applying our tool SPLc. Two
compilers are used to build the binaries containing the derivative codes. The first
compiler is from the GNU Compiler Collection in version 4.8.0 and the second
compiler is the Intel Compiler in version 14.0. The detailed runtime results from
the test suite are presented in Section 5.1. To investigate the scaling properties
of a second derivative code, we will choose one test case from the test suite. By
reapplying SPLc to the first derivative code of this certain test case, we obtain the
second derivative code. The scaling properties of the second derivative codes are
pointed out in Section 5.2.
In Section 5.3, we investigate the performance that we gain from applying the
1 www.top500.org
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 1
Core(s) per socket: 8
Socket(s): 16
NUMA node(s): 16
Vendor ID: GenuineIntel
CPU family: 6
Model: 46
Stepping: 6
CPU MHz: 2000.175
BogoMIPS: 3999.51
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 18432K
Table 5.1: The output of the shell command lscpu provides a detailed description about the
test environment’s architecture.
exclusive read analysis (ERA). This static program analysis was introduced in
Chapter 3. We apply the source transformation twice, on the one hand we gen-
erate the derivative code without using the ERA. On the other hand, we exploit the
additional information of the static analysis to generate the derivative code. The
difference between the two source transformation results is the number of atomic
statements.
The least-squares problem from the introduction (Section 1.1.1) was imple-
mented and its scaling properties are presented in Section 5.4. The second mo-
tivating example in the introduction was the nonlinear constrained optimization
problem that we introduced in Section 1.2.3. The runtime results of the implemen-
tation are shown in Section 5.5.
5.1 Test Suite 271
The listings of the example codes can be found in Appendix B. The computation
kernel in each example code makes usage of two dynamically allocated arrays, the
one-dimensional array x, and the two-dimensional array A. The size of x is around
one megabyte, the size of A is about 125 gigabytes.
We used two compilers for the runtime tests, the Intel compiler version 14.0, and
the GNU Compiler Collection (GCC) version 4.8.0. Subsequently, we abbreviate
these two compilers by icpc and g++. Both of these compilers support a certain
set of static program analysis techniques. Since each of the compilers support a
command line option for using no program optimization (-O0) and for a sophis-
ticated program optimization (-O3), we will use these two options to obtain two
different binaries. In case we note that optimization level 0 has been used for a test,
we compile the source files with the command line option -O0. Alternatively, we
write optimization level 3 when the corresponding binary was compiled with -O3
as command line option. Please note that these optimization levels have nothing
to do with the exclusive read analysis.
In order to achieve precise results in addition to only measuring the runtime,
we use hardware counters to measure the CPU cycles and the floating-point opera-
tions. For accessing these hardware counters, we use the Performance API (PAPI)
in version 5.2.02 . For a given runtime test we display the runtime in seconds, CPU
cycles, million floating-point operations per second (MFLOPS), and the speedup.
MFLOPS is a commonly used measure of computer performance, typically used
2 The PAPI project website is http://icl.cs.utk.edu/papi/software/.
272 5 Experimental Results
in the domain of scientific computing. The speedup is the ratio between two exe-
cutions, one execution is performed with only one thread, the other execution uses
a group of, for example, p threads. The number of CPU cycles of the run with
one thread divided by the number of CPU cycles with p threads yields the speedup
result.
The scale abilities of the three different codes, namely the original code P, the
tangent-linear code τ (P), and the adjoint code σ (P), are measured by using an
increasing number of threads ranging from one to 128 threads. All the measured
values are displayed in a table and for better illustration reasons we show for each
table a bar chart that summarizes the speedup and MFLOPS results. An example
bar chart is shown Figure 5.1. The x-axis represents different executions with an
increasing number of threads, starting with only one thread and ending with the
maximum number of threads that the test architecture provides (for example 16 in
Figure 5.1). The bar chart has two y-axes, the left y-axis illustrates speedup values
(light gray bars), the right y-axis clarifies the MFLOPS values (dark gray bars) in
the chart.
The results from each individual execution is shown in a group of six bars. The
six bars can be seen as three pairs of light and dark gray bars. The first pair displays
the results of the original code, the second pair corresponds to the results of the
tangent-linear code, and the last pair displays the adjoint code results. This fact is
explained for the execution with four threads below the Figure 5.1.
Figure 5.1: Example plot for describing the bar charts in this chapter.
codes, one recognizes that the tangent-linear code evaluation takes about twice as
much cycles as the evaluation of the original code. The execution of the adjoint
code needs about five times more CPU cycles as the evaluation of the original
parallel region. These factors fit into our experiences with AD source transforma-
tion and similar factors are presented in [57]. Comparing the columns that show
274 5 Experimental Results
the MFLOPS values, one recognizes that the values for the original code and the
tangent-linear code are very similar. If we compare the MFLOPS values of the
original code with the one from the adjoint code then the adjoint code execution
has a rate that lies between one third and two thirds of the MFLOPS values for the
original code. This is probably the impact of the stack operations which store and
restore data of a size of 250 gb.
The lower part of the table shows the memory consumption of the evaluation
of the adjoint code. In overall, the evaluation of the forward section needs 250
gigabyte of data where half the memory is consumed in STACK f and each of
STACKi and STACKc need one fourth of the 250 gigabyte. These numbers are
divided by two when we redouble the number of threads since the data for each
thread is cut in halves. For example, when we use 128 threads to evaluate the
parallel region then each thread has only two megabyte of stack data to store. This
fact together with the fact that the test machine supports the NUMA architecture
(see Table 5.1), allows each individual thread to use memory that is close to the
CPU core that executes the code for this thread.
Table 5.3 shows the results of a test run where we use the Intel compiler and
optimization level 3 (-O3). The original code execution needs 9590 CPU cycles
(1032 MFLOPS) with one thread and it takes 170 CPU cycles (58149 MFLOPS)
when the computation is made by 128 threads simultaneously. The tangent-linear
code consumes 10494 CPU cycles (1056 MFLOPS) when 1 thread is used and
128 threads reduce the runtime to 308 CPU cycles (36040 MFLOPS). The adjoint
code take about three times longer than the original code. If only one thread is
used, the runtime is 30964 CPU cycles (1265 MFLOPS) and with 128 threads the
computation lasts 580 CPU cycles (67510 MFLOPS). The original code achieves
its highest speedup value of 56 with 128 threads. The tangent-linear code only
reaches a speedup of 34 whereby the adjoint code scales with a value of 53.
The MFLOPS values as well as the speedup values for the binary compiled
without optimization (-O0) are much better than the values obtained with enabled
optimization (-O3). A possible reason for this is that the code optimization of
the Intel compiler is better suited for improving the sequential execution and not
the parallel execution. One fact is that the number of CPU cycles of the tangent-
linear code in Table 5.3 is almost the same as the number for the original code but
in Table 5.2 we see that the number of operations is almost doubled through the
computation of the derivative values. The adjoint code needs without optimization
almost four times more CPU cycles than the original code. With optimization the
difference between the CPU cycles of the original code and the adjoint code is
about three. This means the code optimization of the Intel compiler improves the
tangent-linear code such that the derivative computation comes almost for free.
5.1 Test Suite 275
Original code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 1299 25635 1.0 1296
2 661 13043 1.97 2547
4 329 6521 3.93 5094
8 162 3213 7.98 10339
16 82 1615 15.86 20562
32 41 817 31.38 40667
64 21 409 62.59 81128
128 11 209 122.33 158559
Tangent-linear code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 2629 52031 1.0 1319
2 1336 26430 1.97 2597
4 674 13326 3.9 5152
8 336 6618 7.86 10374
16 164 3258 15.97 21069
32 83 1642 31.68 41810
64 42 818 63.57 83899
128 21 425 122.38 161508
Adjoint code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 6032 113644 1.0 898
2 2997 57681 1.97 1771
4 1586 29332 3.87 3483
8 807 14693 7.73 6955
16 388 7245 15.68 14102
32 201 3701 30.71 27625
64 121 2130 53.35 47993
128 96 1639 69.3 62330
Original code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 486 9590 1.0 1032
2 261 5149 1.86 1923
4 151 2916 3.29 3400
8 75 1497 6.4 6618
16 37 748 12.81 13243
32 19 380 25.22 26068
64 11 230 41.63 43048
128 8 170 56.24 58149
Tangent-linear code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 533 10494 1.0 1056
2 270 5327 1.97 2081
4 138 2739 3.83 4050
8 69 1372 7.65 8078
16 34 682 15.37 16246
32 20 411 25.52 26983
64 18 366 28.62 30318
128 15 308 34.01 36040
Adjoint code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 1733 30964 1.0 1265
2 864 15497 2.0 2528
4 441 7995 3.87 4901
8 218 3951 7.84 9917
16 109 1994 15.52 19642
32 56 1039 29.79 37705
64 45 805 38.43 48635
128 31 580 53.33 67510
Figure 5.2: Plot for the test ‘plain parallel region O0 icpc’. The actual values are shown in
Table 5.2. The original as well as the derivative codes are presented in Appendix B.1.
Figure 5.3: Plot for the test ‘plain parallel region O3 icpc’. Table 5.3 presents the values of
this chart. The original as well as the derivative codes are presented in Appendix B.1.
278 5 Experimental Results
Figure 5.4: Plot for the test ‘plain parallel region O0 g++’. Table 5.4 displays the corre-
sponding values. The original as well as the derivative codes are presented in Appendix
B.1.
Figure 5.5: Plot for the test ‘plain parallel region O3 g++’. The values illustrated in this
chart can be found in Table 5.5. The original as well as the derivative codes are presented
in Appendix B.1.
5.1 Test Suite 279
Original code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 6074 120129 1.0 192
2 3142 61862 1.94 373
4 1568 30932 3.88 747
8 770 15251 7.88 1515
16 382 7570 15.87 3053
32 192 3807 31.55 6068
64 95 1894 63.43 12201
128 48 942 127.44 24515
Tangent-linear code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 12185 240801 1.0 194
2 6208 122263 1.97 383
4 3104 60946 3.95 769
8 1533 30301 7.95 1548
16 765 15144 15.9 3098
32 383 7560 31.85 6205
64 193 3800 63.36 12345
128 97 1926 125.0 24357
Adjoint code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 19658 379776 1.0 181
2 10007 195077 1.95 353
4 5009 96957 3.92 710
8 2468 48416 7.84 1423
16 1223 23933 15.87 2880
32 612 11964 31.74 5761
64 311 6027 63.01 11436
128 192 3417 111.12 20170
Original code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 1155 22833 1.0 589
2 604 11859 1.93 1134
4 301 5935 3.85 2266
8 148 2938 7.77 4577
16 74 1470 15.53 9149
32 37 732 31.18 18369
64 19 365 62.51 36824
128 10 189 120.74 71125
Tangent-linear code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 1233 24295 1.0 650
2 617 12185 1.99 1296
4 320 6284 3.87 2514
8 154 3072 7.91 5143
16 78 1559 15.58 10130
32 39 775 31.34 20384
64 21 420 57.75 37559
128 16 322 75.42 49055
Adjoint code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 12814 245440 1.0 197
2 6426 125100 1.96 387
4 3204 62222 3.94 779
8 1594 31052 7.9 1561
16 800 15455 15.88 3137
32 401 7744 31.69 6260
64 208 3917 62.66 12377
128 143 2457 99.85 19729
The adjoint code can be improved by code optimization but by far not as good as
the tangent-linear code can be improved. This fact indicates that more performance
of the adjoint code is only achieved when we use parallel programming. Indeed,
the adjoint code executed with 128 threads achieves with 67510 a better result than
the tangent-linear code (36040).
GCC 4.8.0:
The results for the test run without optimization (-O0) are presented as a bar chart
in Figure 5.4 and as a table in Table 5.4. The tangent-linear code needs about twice
as much CPU cycles (240801) as the original code (120129). The 192 MFLOPS
for the original code and the 194 MFLOPS for the tangent-linear code shows that
both codes provide quite similar performance. The CPU cycles consumed by the
adjoint code are about three times higher than the values from the original code
and the MFLOPS values are slightly below the ones from the original code. If
we compare Table 5.4 with the results from the Intel compiler (Table 5.2), it is
obvious that the Intel compiler supplies a binary that is more than twice as fast as
the binary created by g++.
Table 5.5 contains the details of the test run with a binary that is provided by
g++ with optimization level 3 (-O3). The corresponding bar chart is Figure 5.5.
The original code evaluation takes 22833 CPU cycles with one thread and 189
cycles with 128 threads. The corresponding MFLOPS values are 589 and 71125.
The tangent-linear code spends 24295 CPU cycles with one thread (650 MFLOPS)
and 322 cycles with 128 threads (49055 MFLOPS).
An interesting fact is that the difference between the original code and the ad-
joint code is about a factor 10. This factor is three when we compare the original
code and the adjoint code obtained without using code optimization. As in the Intel
case, the adjoint code seems not very suited for applying code optimization. The
execution with 128 threads consumes 2457 CPU cycles (19729 MFLOPS) which
is 13 times slower than the original code. The speedup value with 128 threads of
the adjoint is around 100 and therefore bigger than 75 which is the speedup value
for the tangent-linear execution.
The reader finds details about the test run with optimization level 0 (-O0) in Ta-
ble 5.6 and the corresponding bar chart is Figure 5.6. The sequential execution
of the original code consumes 4163 CPU cycles (165 MFLOPS). In case we use
all possible 128 cores the runtime decreases to 239 CPU cycles (2880 MFLOPS).
The tangent-linear code achieves a sequential runtime of 7944 CPU cycles (259
MFLOPS) in contrast to the execution with 128 threads which takes 387 CPU cy-
cles (5318 MFLOPS). This means again, the tangent-linear execution needs about
twice as long as the original code. A maximal speedup value of 20 with 128 is all
but efficient and is probably the result of the overhead connected with the synchro-
nization introduced by the barrier.
The sequential execution of the adjoint code needs 37875 cycles (64 MFLOPS)
whereby the execution of the same code with 128 threads consumes 812 cycles
(2997 MFLOPS). One barrier in the original code leads to two barriers in the ad-
joint code, one in the forward section and one in the reverse section. This means
the overhead for synchronization increases with each application of the adjoint
source transformation. However, we achieve a maximal speedup of 46 with 128
threads which is better than the maximal speedup for the original and the tangent-
linear code. While considering the columns where the speedup values are listed,
one observes that the adjoint code scales are acceptable until 64 threads. The orig-
inal and the tangent-linear code on the other hand have a good speedup until they
reach eight threads.
The bar chart in Figure 5.7 summarizes the detailed information from Table
5.7 where the underlying test run was compiled with optimization level 3 (-O3).
The original code execution needs 2863 CPU cycles (144 MFLOPS) with one
thread and when the execution uses 128 threads it takes 170 CPU cycles (2365
MFLOPS). The tangent-linear evaluation consumes with one thread 4656 CPU
cycles (235 MFLOPS) and with 128 threads 284 CPU cycles (3777 MFLOPS) are
necessary. The runtime of the adjoint code with one thread is 15633 CPU cycles
(143 MFLOPS) and with 128 threads the CPU takes 624 cycles (3605 MFLOPS).
The speedup values 47 (-O0) and 25 (-O3) are better than the results from the
tangent-linear codes and even better than the speedup values from the original
5.1 Test Suite 283
codes. This is a fact one would not expect because the adjoint transformation of
a barrier ((4.3), (4.4)) introduces a barrier construct in the forward section as well
as in the reverse section. But the impact of the synchronization seems to be small
against the improvement of smaller stack sizes when using an higher number of
threads. The smaller stack sizes can be stored locally what on the one hand means
on a memory chip near the CPU socket where the thread is executed, and on the
other side this means on a physical node where the thread is executed. We always
have to keep in mind that the machine with 128 threads is a compound of four
physical nodes where each node has 32 cores.
GCC 4.8.0:
The results from the test run without any optimization (-O0) is shown in Ta-
ble 5.8 and in Figure 5.8. The original code execution takes 4195 CPU cycles
(163 MFLOPS) with one thread and 195 CPU cycles (3523 MFLOPS) with 128
threads. The related tangent-linear code runs with one thread 8411 CPU cycles
(245 MFLOPS) long and it runs 356 CPU cycles (5789 MFLOPS) when 128
threads are used. The speedup of original and tangent-linear code is almost linear
until 16 threads are used. Further increase in number of threads does not reduce
the run time much, the synchronization overhead seams to be the major factor.
The adjoint computation takes with one thread 27284 CPU cycles (88 MFLOPS)
and with 128 threads it needs 772 CPU cycles (3126 MFLOPS). The adjoint source
transformation doubles the number of barriers in the output code. Therefore, the
synchronization overhead of the adjoint execution is twice as high as in the original
execution. This is reflected by low MFLOPS values and the speedup does not grow
much when we use more than 8 threads.
If the compilation is done with optimization level 3 (-O3), we achieve the re-
sults presented in Table 5.9 and the corresponding bar chart is shown in Figure
5.9. The original code execution is performed in 3222 CPU cycles (213 MFLOPS)
when only one thread is used and it takes 207 CPU cycles (3316 MFLOPS) with
128 threads. The tangent-linear execution finishes after 5346 CPU cycles (387
MFLOPS) with one thread and when all 128 threads are used, the runtime is down
at 376 CPU cycles (5503 MFLOPS). The adjoint code consumes 12429 CPU cy-
cles (168 MFLOPS) with one thread whereby the execution with 128 threads re-
duces the runtime to 750 CPU cycles (2806 MFLOPS). All three codes have a
speedup value of about 13 with 16 threads. Afterwards, further increase in the
number of threads shows a stagnation of the speedup.
284 5 Experimental Results
Original code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 214 4163 1.0 165
2 108 2137 1.95 321
4 53 1053 3.95 653
8 27 536 7.75 1281
16 15 307 13.55 2240
32 11 225 18.44 3048
64 11 232 17.9 2963
128 12 239 17.38 2880
Tangent-linear code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 405 7944 1.0 259
2 203 4010 1.98 514
4 102 1996 3.98 1032
8 51 1018 7.8 2024
16 29 580 13.67 3548
32 21 430 18.44 4785
64 20 397 19.98 5186
128 19 387 20.48 5318
Adjoint code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 2030 37875 1.0 64
2 1062 20027 1.89 121
4 545 10214 3.71 238
8 282 5353 7.08 454
16 159 3038 12.47 800
32 86 1649 22.96 1474
64 46 881 42.95 2759
128 43 812 46.61 2997
Original code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 146 2863 1.0 144
2 71 1380 2.07 297
4 40 803 3.56 515
8 22 442 6.48 937
16 11 236 12.1 1754
32 8 165 17.26 2496
64 8 175 16.28 2328
128 8 170 16.8 2365
Tangent-linear code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 237 4656 1.0 235
2 117 2298 2.03 469
4 65 1291 3.6 849
8 34 686 6.79 1585
16 22 436 10.67 2488
32 19 377 12.34 2868
64 14 289 16.1 3732
128 14 284 16.38 3777
Adjoint code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 941 15633 1.0 143
2 493 7880 1.98 283
4 230 4044 3.87 554
8 143 2128 7.35 1052
16 71 1140 13.71 1965
32 46 834 18.74 2691
64 40 647 24.16 3477
128 35 624 25.01 3605
Figure 5.6: Plot for the test ‘Barrier O0 icpc’. We refer to Table 5.6 for details.
Figure 5.7: Plot for the test ‘Barrier O3 icpc’. In Table 5.7, the reader can find the connected
values.
5.1 Test Suite 287
Figure 5.8: Plot for the test ‘Barrier O0 g++’. The actual values are shown in Table 5.8.
Figure 5.9: Plot for the test ‘Barrier O3 g++’. Table 5.9 presents the values of this chart.
288 5 Experimental Results
Original code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 215 4195 1.0 163
2 98 1903 2.2 361
4 55 1092 3.84 629
8 28 541 7.75 1270
16 15 298 14.06 2304
32 11 218 19.22 3149
64 10 213 19.61 3214
128 9 195 21.49 3523
Tangent-linear code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 426 8411 1.0 245
2 206 4081 2.06 505
4 109 2156 3.9 956
8 54 1065 7.89 1934
16 29 589 14.26 3495
32 20 410 20.51 5027
64 15 309 27.22 6672
128 17 356 23.61 5789
Adjoint code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 1498 27284 1.0 88
2 857 15830 1.72 152
4 388 7116 3.83 338
8 195 3582 7.62 672
16 140 2654 10.28 908
32 61 1144 23.84 2106
64 36 665 40.97 3621
128 40 772 35.34 3126
Original code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 166 3222 1.0 213
2 80 1533 2.1 448
4 47 945 3.41 726
8 22 451 7.13 1521
16 12 239 13.45 2870
32 10 204 15.77 3365
64 9 192 16.76 3578
128 10 207 15.52 3316
Tangent-linear code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 274 5346 1.0 387
2 136 2664 2.01 777
4 71 1411 3.79 1467
8 34 673 7.94 3071
16 21 435 12.29 4754
32 19 393 13.58 5250
64 18 357 14.96 5788
128 19 376 14.21 5503
Adjoint code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 737 12429 1.0 168
2 367 6206 2.0 336
4 209 3682 3.38 568
8 100 1750 7.1 1193
16 54 939 13.23 2226
32 36 660 18.81 3170
64 32 603 20.59 3486
128 39 750 16.57 2806
Similar to the binary obtained by Intel compiler, the adjoint code provides the
best speedup values. But the best speedup value is not provided with 128 threads
but with 64 threads. It seams that the code provided by the Intel compiler can
better handle the overhead that is involved by connecting the four physical nodes.
CPU cycles (126363 MFLOPS). The adjoint computation with one thread finishes
after 48470 CPU cycles (1210 MFLOPS) and 128 threads provide the adjoints after
683 CPU cycles (85070 MFLOPS). According to the scalability one recognizes
that the original code and the tangent-linear code scale well until 32 threads. The
execution of the adjoint code achieves a speedup factor of 27 with 32 threads and
the highest speedup of 71 is achieved with 128 threads.
GCC 4.8.0:
The details of a test run where the binary was compiled without using optimiza-
tion (-O0) are shown in Table 5.12 and in Figure 5.12. The runtime of the origi-
nal code is 122932 CPU cycles (182 MFLOPS) with one thread and it takes 961
CPU cycles (23359 MFLOPS) with 128 threads. The tangent-linear execution lasts
250849 CPU cycles (187 MFLOPS) with one thread and 1962 CPU cycles (23923
MFLOPS) with 128 threads. 407296 CPU cycles (172 MFLOPS) are necessary to
provide the adjoints with one thread and 128 threads are able to provide the same
values in 4570 CPU cycles (15358 MFLOPS).
When the optimization level 3 (-O3) is used to compile the binary we achieve
the results displayed in Table 5.13 whereas the corresponding bar chart is shown
in Figure 5.13. The original code execution consumes 23719 CPU cycles (581
MFLOPS) with one thread where 128 threads are performing the same computa-
tion in 222 CPU cycles (62090 MFLOPS). The tangent-linear runtime with one
thread takes 24469 CPU cycles (726 MFLOPS) and 370 cycles (48019 MFLOPS)
with 128 threads. The adjoint computation lasts 61907 CPU cycles (521 MFLOPS)
with one thread and 767 CPU cycles (42020 MFLOPS) with 128 threads. While
comparing the columns where the speedup values are listed, one recognizes that
both the derivative codes cannot reach the speedup that the original code achieves.
The next test case is a parallel region containing a critical region. This section
starts on page 291.
Original code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 1309 25878 1.0 1315
2 672 13272 1.95 2571
4 339 6569 3.94 5184
8 167 3310 7.82 10304
16 84 1668 15.51 20438
32 42 833 31.04 40901
64 21 423 61.13 80520
128 11 226 114.06 150308
Tangent-linear code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 2674 52606 1.0 1381
2 1396 27408 1.92 2651
4 677 13350 3.94 5442
8 345 6654 7.9 10916
16 168 3331 15.79 21807
32 85 1673 31.44 43420
64 43 855 61.52 84968
128 24 452 116.23 160541
Adjoint code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 6312 118448 1.0 924
2 3227 60542 1.96 1808
4 1603 30545 3.88 3584
8 815 15423 7.68 7099
16 416 7925 14.95 13817
32 218 4196 28.22 26097
64 119 2276 52.03 48112
128 81 1485 79.75 73755
Original code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 429 8461 1.0 1179
2 210 4171 2.03 2392
4 110 2179 3.88 4579
8 63 1106 7.65 9016
16 28 556 15.21 17941
32 14 282 29.92 35285
64 9 193 43.66 51536
128 7 151 55.89 66049
Tangent-linear code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 1104 21702 1.0 1761
2 553 10937 1.98 3496
4 277 5472 3.97 6988
8 142 2810 7.72 13606
16 69 1375 15.78 27801
32 35 700 30.99 54613
64 21 434 49.99 88100
128 15 302 71.69 126363
Adjoint code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 2604 48470 1.0 1210
2 1366 25054 1.93 2330
4 659 12448 3.89 4678
8 335 6405 7.57 9081
16 173 3305 14.66 17582
32 93 1789 27.09 32480
64 53 989 48.97 58709
128 42 683 70.95 85070
Figure 5.10: Plot for the test ‘Master O0 icpc’. The detailed results can be found in Table
5.10. The original code for this test can be found in Appendix B.3 on page 380.
Figure 5.11: Plot for the test ‘Master O3 icpc’. The values of this chart are presented in
Table 5.11. Appendix B.3 on page 380 displays the corresponding original code for this test
case.
5.1 Test Suite 295
Figure 5.12: Plot for the test ‘Master O0 g++’. Table 5.12 displays the corresponding values.
The original code for this test can be found in Appendix B.3 on page 380.
Figure 5.13: Plot for the test ‘Master O3 g++’. The values illustrated in this chart can be
found in Table 5.13. Appendix B.3 on page 380 displays the corresponding original code
for this test case.
296 5 Experimental Results
Original code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 6216 122932 1.0 182
2 3217 63520 1.94 353
4 1596 31609 3.89 710
8 797 15669 7.85 1433
16 391 7737 15.89 2903
32 196 3892 31.58 5772
64 98 1948 63.1 11535
128 49 961 127.81 23359
Tangent-linear code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 12670 250849 1.0 187
2 6430 127496 1.97 367
4 3250 64232 3.91 730
8 1606 31824 7.88 1474
16 796 15762 15.91 2976
32 398 7918 31.68 5928
64 202 3987 62.91 11774
128 101 1962 127.85 23923
Adjoint code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 20676 407296 1.0 172
2 10673 209124 1.95 335
4 5369 105019 3.88 667
8 2727 53087 7.67 1320
16 1371 26986 15.09 2597
32 738 14574 27.95 4812
64 409 8072 50.46 8691
128 232 4570 89.12 15358
Original code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 1198 23719 1.0 581
2 623 12254 1.94 1125
4 315 6158 3.85 2240
8 152 3021 7.85 4565
16 77 1505 15.75 9161
32 38 751 31.58 18365
64 19 377 62.81 36530
128 11 222 106.75 62090
Tangent-linear code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 1239 24469 1.0 726
2 637 12583 1.94 1413
4 315 6215 3.94 2861
8 159 3132 7.81 5678
16 79 1567 15.61 11350
32 39 788 31.04 22564
64 24 491 49.75 36169
128 18 370 66.04 48019
Adjoint code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 3256 61907 1.0 521
2 1880 36147 1.71 892
4 860 16406 3.77 1966
8 448 8568 7.22 3763
16 266 5149 12.02 6260
32 130 2502 24.74 12881
64 73 1409 43.93 22863
128 40 767 80.69 42020
In Table 5.14 are shown the details of a test run without using compiler optimiza-
tion (-O0). The corresponding bar chart is Figure 5.14. The processing time for the
original code is 4301 CPU cycles (159 MFLOPS) with one thread and 156 CPU
cycles (4406 MFLOPS) with 128 threads. The tangent-linear execution needs 7614
CPU cycles (270 MFLOPS) with one thread and 263 CPU cycles (7830 MFLOPS)
with 128 threads. The adjoint code consumes during runtime 33951 CPU cycles
(71 MFLOPS) with one thread and 1141 CPU cycles (2116 MFLOPS) with 128
threads. These numbers emphasize that the critical section is an expensive con-
struct according to runtime. The synchronization overhead grows with an increas-
ing number of threads. The scalability of the original and the tangent-linear code
is well up to 16 threads. Subsequently, a further increase of the number of threads
does not improve the runtime much.
In case of the adjoint code we should remind ourselves that the source transfor-
mation, described in Section 4.2.3, is defined in a way such that the order in which
the threads enter the critical section during the forward section must be tracked and
reversed during the reverse section. This makes things even worse in terms of syn-
chronization overhead what one can recognize clearly in the bad speedup results
of the adjoint code. A noticeable fact is that the adjoint code reaches 30 as peak
speedup value what is better than the speedup value of the original and tangent-
linear code with 28 and 29, respectively. Another thing that should be noted is that
the gain in performance from doubling the number of threads is similar in case of
the adjoint code until we reach 128 threads. For the two other codes it holds that
the gain is marginal when more than 32 threads are used.
The bar chart shown in Figure 5.15 summarizes the details of Table 5.15. In
the table we can recognize the results that we achieve when one uses the com-
piler provided optimization (-O3). The original code needs 1679 CPU cycles (243
MFLOPS) for its computation with one thread and it takes 145 CPU cycles (2816
MFLOPS) when 128 threads execute the same code.
The tangent-linear code execution consumes 4112 CPU cycles (264 MFLOPS)
with one thread and 277 CPU cycles (3896 MFLOPS) with 128 threads. The
adjoint needs the most CPU cycles, namely 13904 CPU cycles (160 MFLOPS)
when the code is processed by one thread and 623 CPU cycles (3605 MFLOPS)
when 128 threads are used. As in the non-optimized case (-O0) the scalability is
poor with a maximal speedup value of 15 for the original and the tangent-linear
code. The maximal speedup value of the adjoint code is 22 and therefore better
than the values from the original and the tangent-linear code.
The overhead for synchronizing the access to the critical region seems to have
5.1 Test Suite 299
the most influence on the parallel execution. The performance of the adjoint code
surprises a bit when one thinks of the adjoint code of a critical region. The source
transformation rules (4.29) and (4.30) introduce a lot overhead but this overhead
does not make things worse what is recognizable from the speedup value of 29.73
and 22.29.
GCC 4.8.0:
Table 5.16 illustrates the detailed results from the non-optimized compilation (-O0).
For getting a first impression, the bar chart in Figure 5.16 illustrates the detailed
results. The original code needs 3471 CPU cycles (198 MFLOPS) for processing
its computation with one thread whereby the same computation takes 155 CPU
cycles (4419 MFLOPS) with 128 threads. The runtime of the tangent-linear code
consumes 8018 CPU cycles (257 MFLOPS) with one thread and 278 CPU cycles
(7402 MFLOPS) with 128 threads. In the adjoint case, we achieve 22351 CPU
cycles (107 MFLOPS) when we execute the binary with one thread and if 128
threads are used, the CPU performs 41832 cycles (57 MFLOPS).
The original code and the tangent-linear code provide an acceptable speedup
value until the number of threads is 32 but afterwards the speedup stagnates. The
data for the adjoint code reveals that an execution with more than eight threads is
slower than the sequential execution. This is probably due to the fact that we have
to synchronize the order of the threads that enter the critical region.
Table 5.17 displays the results of an execution where we used the optimization
level 3 (-O3) to build the binary. Figure 5.17 shows the corresponding bar chart.
The original code execution lasts 2900 CPU cycles (237 MFLOPS) with one thread
and it consumes 144 CPU cycles (4782 MFLOPS) when 128 threads are used. The
computation of the tangent-linear code takes 4483 CPU cycles (461 MFLOPS)
with one thread and 270 CPU cycles (7658 MFLOPS) with 128 threads. The
adjoint code results are 12420 CPU cycles (170 MFLOPS) with one thread and
20448 CPU cycles (104 MFLOPS) with 128 threads.
Similar to the non-optimized case, the speedup values grow for the original and
the tangent-linear code up to 20 and 17, respectively. More than 32 threads does
not impact the performance much. The speedup values for the adjoint code are
even worse. Until the usage of 16 threads the speedup grows slowly and afterwards
the speedup decreases until it ends up with a slowdown of the runtime with 128
threads compared to the sequential execution.
Comparing the results for the adjoint code for the four different runs it is con-
spicuous that the Intel compiler provides a binary that scales far better than the
code supplied by g++. The binary from the Intel compiler scales until 128 threads
300 5 Experimental Results
and even has better speedup values with 128 threads than the original and the
tangent-linear code. The MFLOPS values are acceptable when we think of the
effort that the adjoint code has to carry out. The binary that the g++ produces
seems to have not the best implementation for the critical region. More precisely
the scalability drops when the number of threads is higher than eight (Table 5.16)
or 16 (Table 5.17). Afterwards, the performance falls down and we achieve even a
slowdown comparing to the sequential execution. This shows that the performance
of an OpenMP implementation can be very different depending on which compiler
one uses.
Original code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 219 4301 1.0 159
2 106 2076 2.07 331
4 55 1090 3.94 630
8 29 581 7.4 1182
16 15 299 14.35 2293
32 10 210 20.47 3272
64 8 175 24.49 3915
128 7 156 27.54 4406
Tangent-linear code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 386 7614 1.0 270
2 203 3994 1.91 516
4 105 2085 3.65 988
8 55 1076 7.07 1915
16 29 586 12.99 3518
32 17 352 21.61 5852
64 16 326 23.32 6318
128 13 263 28.89 7830
Adjoint code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 2011 33951 1.0 71
2 1179 20701 1.64 116
4 759 14123 2.4 170
8 357 6840 4.96 352
16 254 4746 7.15 508
32 159 2984 11.38 808
64 102 1861 18.24 1296
128 63 1141 29.75 2116
Original code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 86 1679 1.0 243
2 73 1408 1.19 291
4 39 772 2.17 536
8 22 445 3.77 933
16 10 202 8.29 2039
32 6 122 13.7 3389
64 7 141 11.86 2923
128 7 145 11.57 2816
Tangent-linear code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 210 4112 1.0 264
2 117 2306 1.78 468
4 65 1274 3.23 853
8 35 701 5.86 1556
16 21 419 9.81 2592
32 12 243 16.9 4455
64 14 280 14.65 3853
128 14 277 14.83 3896
Adjoint code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 954 13904 1.0 160
2 447 7557 1.84 295
4 262 4697 2.96 476
8 148 2673 5.2 839
16 90 1673 8.31 1341
32 60 1098 12.66 2044
64 40 722 19.25 3114
128 36 623 22.29 3605
Figure 5.14: Plot for the test ‘Critical O0 intel’. The corresponding values can be found in
Table 5.14. The original code for this test can be found in Appendix B.4 on page 387.
Figure 5.15: Plot for the test ‘Critical O3 intel’. The reader can find the details in Table
5.15 useful. Appendix B.4 on page 387 displays the corresponding original code for this
test case.
304 5 Experimental Results
Figure 5.16: Plot for the test ‘Critical O0 g++’. We refer to Table 5.16 for details. The
original code for this test can be found in Appendix B.4 on page 387.
Figure 5.17: Plot for the test ‘Critical O3 g++’. Table 5.17 represents the actual values of
this chart. Appendix B.4 on page 387 displays the corresponding original code for this test
case.
5.1 Test Suite 305
Original code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 176 3471 1.0 198
2 109 2143 1.62 320
4 56 1116 3.11 615
8 28 558 6.22 1231
16 15 292 11.88 2354
32 11 218 15.92 3154
64 9 196 17.69 3505
128 7 155 22.29 4419
Tangent-linear code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 407 8018 1.0 257
2 218 4292 1.87 480
4 111 2210 3.63 932
8 56 1101 7.28 1872
16 29 576 13.92 3579
32 22 435 18.42 4736
64 16 322 24.85 6393
128 14 278 28.76 7402
Adjoint code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 1406 22351 1.0 107
2 676 12201 1.83 197
4 372 6821 3.28 353
8 275 4867 4.59 495
16 1153 22495 0.99 107
32 1439 28304 0.79 85
64 3226 63135 0.35 38
128 2288 41832 0.53 57
Original code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 150 2900 1.0 237
2 67 1325 2.19 518
4 47 943 3.08 728
8 23 470 6.16 1461
16 12 247 11.73 2782
32 8 172 16.78 3981
64 7 141 20.46 4859
128 7 144 20.1 4782
Tangent-linear code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 228 4483 1.0 461
2 134 2629 1.71 786
4 70 1381 3.25 1498
8 37 740 6.05 2792
16 26 519 8.62 3978
32 15 302 14.8 6834
64 13 274 16.35 7551
128 13 270 16.56 7658
Adjoint code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 832 12420 1.0 170
2 412 6965 1.78 304
4 244 4341 2.86 489
8 131 2353 5.28 904
16 86 1585 7.83 1343
32 210 4089 3.04 520
64 590 11538 1.08 184
128 1142 20448 0.61 104
GCC 4.8.0:
If we do not use any optimization (-O0) for compiling the binary, we obtain the
results displayed in Table 5.20. These results are additionally illustrated as a bar
chart in Figure 5.20. The original code needs 3859 CPU cycles (178 MFLOPS)
for finishing its computation when one thread is used. The parallel computation
with 128 threads lasts 150 CPU cycles (4576 MFLOPS). The tangent-linear code
takes 8309 CPU cycles (248 MFLOPS) if we use one thread and 256 CPU cycles
(8043 MFLOPS) in case that we use 128 threads. The execution of the adjoint code
with one thread consumes 21889 CPU cycles (110 MFLOPS) whereby the paral-
lel computation with 128 threads reduces the runtime to 687 CPU cycles (3518
MFLOPS). All three codes have their peak speedup value with 64 threads.
The optimization level 3 of GCC (-O3) provides the runtime results shown in
Table 5.21. The corresponding figure in form of a bar chart is shown in Figure
5.21. The original code execution lasts 1779 CPU cycles (386 MFLOPS) with one
thread and the same computation performed by 128 threads reduces the runtime to
153 CPU cycles (4495 MFLOPS). The tangent-linear computation with one thread
finishes after 3094 CPU cycles (667 MFLOPS) whereby 128 threads finish the
308 5 Experimental Results
execution after 239 CPU cycles (8660 MFLOPS). The adjoint code needs 11929
CPU cycles (175 MFLOPS) if it is executed with one thread. With 128 threads the
computation consumes 737 CPU cycles (2918 MFLOPS).
It is striking that the speedup values for the adjoint code is clearly above the
speedup values from the original and the tangent-linear code. A possible expla-
nation is that the critical region introduced by the source transformation3 , does
not have a big impact but the fact that we do not store all the intermediate results
obviously saves a lot of stack operations and improves therefore the performances.
This concludes the test suite and we continue with presenting the runtime results
of three second derivative codes in Section 5.2 on page 315. As an example, we use
the ‘plain-parallel’ test case from Section 5.1.1. In detail, we will see results for
the forward-over-forward mode, the forward-over-reverse mode, and the reverse-
over-forward mode.
Original code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 174 3438 1.0 200
2 107 2108 1.63 326
4 55 1092 3.15 629
8 27 540 6.36 1272
16 14 275 12.48 2496
32 9 186 18.46 3693
64 8 159 21.58 4319
128 7 143 23.93 4792
Tangent-linear code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 406 7991 1.0 258
2 207 4075 1.96 505
4 103 2022 3.95 1019
8 53 1054 7.58 1956
16 26 522 15.28 3943
32 18 367 21.74 5611
64 13 268 29.76 7683
128 13 268 29.72 7674
Adjoint code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 1815 30951 1.0 78
2 965 16157 1.92 151
4 432 7934 3.9 307
8 231 4067 7.61 600
16 125 2328 13.29 1048
32 65 1180 26.21 2067
64 39 694 44.54 3514
128 34 621 49.81 3932
Original code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 141 2769 1.0 149
2 75 1491 1.86 278
4 41 823 3.36 504
8 19 390 7.1 1064
16 11 217 12.71 1909
32 7 153 18.05 2702
64 7 147 18.78 2797
128 7 148 18.68 2735
Tangent-linear code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 134 2624 1.0 402
2 121 2370 1.11 451
4 66 1318 1.99 812
8 33 653 4.02 1635
16 19 393 6.67 2710
32 16 329 7.96 3221
64 13 264 9.91 4008
128 13 272 9.64 3891
Adjoint code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 789 12585 1.0 180
2 395 6560 1.92 346
4 201 3459 3.64 657
8 99 1718 7.32 1324
16 56 981 12.82 2320
32 36 640 19.65 3565
64 33 586 21.45 3906
128 32 602 20.9 3837
Figure 5.18: Plot for the test ‘Atomic O0 intel’. The details are illustrated in Table 5.18.
The original code for this test can be found in Appendix B.5 on page 392.
Figure 5.19: Plot for the test ‘Atomic O3 intel’. Table 5.19 shows the details of this test.
Appendix B.5 on page 392 displays the corresponding original code for this test case.
312 5 Experimental Results
Figure 5.20: Plot for the test ‘Atomic O0 g++’. The details are illustrated in Table 5.20. The
original code for this test can be found in Appendix B.5 on page 392.
Figure 5.21: Plot for the test ‘Atomic O3 g++’. Table 5.21 shows the details of this test.
Appendix B.5 on page 392 displays the corresponding original code for this test case.
5.1 Test Suite 313
Original code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 196 3859 1.0 178
2 99 1967 1.96 349
4 55 1108 3.48 620
8 27 542 7.12 1268
16 14 281 13.7 2440
32 7 156 24.59 4381
64 6 126 30.43 5421
128 7 150 25.67 4576
Tangent-linear code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 421 8309 1.0 248
2 219 4303 1.93 479
4 109 2160 3.85 954
8 56 1113 7.46 1852
16 27 550 15.08 3741
32 15 305 27.22 6753
64 12 250 33.23 8247
128 13 256 32.4 8043
Adjoint code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 1321 21889 1.0 110
2 634 11221 1.95 214
4 309 5522 3.96 436
8 157 2783 7.86 865
16 87 1575 13.9 1530
32 49 854 25.63 2823
64 33 572 38.22 4214
128 37 687 31.85 3518
Original code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 92 1779 1.0 386
2 86 1692 1.05 406
4 48 951 1.87 722
8 23 469 3.79 1464
16 13 257 6.91 2670
32 9 186 9.54 3687
64 7 157 11.32 4384
128 7 153 11.58 4495
Tangent-linear code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 158 3094 1.0 667
2 98 1921 1.61 1075
4 75 1484 2.08 1393
8 37 733 4.22 2821
16 24 486 6.36 4253
32 18 362 8.53 5706
64 15 307 10.06 6733
128 12 239 12.91 8660
Adjoint code
#Threads Runtime (sec) CPU cycles (·10−9 ) Speedup MFLOPS
1 719 11929 1.0 175
2 442 5635 2.12 371
4 191 3234 3.69 650
8 111 1772 6.73 1185
16 63 969 12.3 2171
32 49 800 14.9 2644
64 43 720 16.55 2964
128 45 737 16.18 2918
Figure 5.22: This bar chart shows the speedup results for the second derivative codes of the
test suite example ‘plain-parallel’, see Section 5.1.1. The original code of this test case is
shown in Appendix B.1. The Intel compiler was used with optimization level 3 (-O3).
316 5 Experimental Results
Figure 5.22 summarizes the runtime results for the second derivative codes of
the ‘plain-parallel’ example code. The binary for this test was created by the In-
tel compiler with optimization level 3. The corresponding values are presented in
Table 5.22. Each bar represents one execution of the corresponding second deriva-
tive code. The x-axis shows the number of threads, the y-axis shows the speedup
values of the execution. The red bar stands for one evaluation of the second-order
tangent-linear code. The blue bar represents the second-order adjoint execution
in forward-over-reverse mode while the yellow bar illustrates the reverse-over-
forward mode. We did not consider the reverse-over-reverse mode due to its huge
memory usage.
In order to set the results from Figure 5.22 and Table 5.22 into the right context,
we remind the reader that the second-order models have different output dimen-
sions. Let us assume that the original function is F : Rn → Rm . According to
Definition 10, the second-order tangent-linear model has the output dimension m.
The second-order adjoint model, on the other side, has the output dimension n
as determined in Definition 11. Depending on the actual values for n and m, the
choice of the second-order model can have a big impact. For example, in case that
m is one, the Hessian is a matrix with n times n elements. One needs n2 eval-
uations of the second-order tangent-linear model to compute the whole Hessian.
With the second-order adjoint model the whole Hessian can be computed with n
evaluations. These remarks emphasize that the first impression when considering
Figure 5.22 is no reason to jump to conclusions.
The second and third column in Table 5.22 show the results for the forward-
over-forward mode (f-o-f), the third and fourth column contains the results for the
forward-over-reverse mode (f-o-r), and the last two columns illustrates the reverse-
over-forward (r-o-f) results. The input size n for the f-o-f and the f-o-r mode is
100.000. For the r-o-f code we had to reduce the size to 32.000 because other-
wise the process shall consume more than one terabyte of memory what exceeds
the system’s resource. The code example uses a two-dimensional array A with
1010 double elements. This means A consumes about 75 gigabyte memory. The
maximum memory peak for the execution of the f-o-f mode is 300 gigabyte, the
f-o-r mode needs 500 gigabyte. The peak for the r-o-f mode where we reduced the
input size n to 32.000, is 200 gigabyte. The f-o-f mode needs 758 seconds with
one thread (819 MFLOPS) and 12 seconds with 128 threads (72727 MLFOPS).
This is a speedup of 63. The f-o-r mode lasts 1466 seconds (1268 MFLOPS) with
one thread and 21 seconds (68425 MFLOPS) with 128 threads which represents a
speedup factor of 70. Due to the reduced input size of the r-o-f mode, we cannot
really compare this mode with the other results. Nevertheless, we present the re-
sults for the sake of completeness. With one thread the execution of the r-o-f mode
5.3 Exclusive Read Analysis 317
needs 158 seconds (1436 MFLOPS), with 128 threads the runtime is reduced to 3
seconds (89549 MFLOPS). This is a speedup value of 52.
This clearly favors the f-o-r mode instead of the r-o-f mode. The memory us-
age is in the r-o-f case much bigger. The best performance in terms of MFLOPS
achieves the f-o-f mode. But one has to keep in mind that the complexity of the
second-order tangent-linear model is different to the second-order adjoint model.
Therefore, the actual performance can be quite different depending on the number
of input and output variables of the original function.
Another fact that arise when one considers Figure 5.22 is that it makes quite a
difference when the execution uses more than 32 threads. In this case there are
more than only one physical node involved in the execution. The speedup values
of the f-o-r code does not grow much when using more than eight and less or
equal to 32 threads. As soon as one uses 64 or 128 threads the speedup grows fast
what probably is connected with the distribution of the data among more than one
physical node.
without any additional analysis information it has to assume that its input code P
does not fulfill the exclusive read property. This fact leads to an adjoint code that
probably contains unnecessary atomic constructs. For this reason, we developed
the exclusive read analysis (ERA) in Chapter 3. The information provided by the
static analysis supports the AD tool in the decision whether an adjoint statement
has to be synchronized or not. In the current section we compare the runtime
results of the adjoint code created with the support of the ERA information and
the adjoint code that is generated by the conservative assumption that every adjoint
assignment has to be synchronized.
As example code for this section serves Listing 3.1 on page 139. In this code we
consider the two-dimensional array A as a data structure for storing coefficients
of polynomials of degree four. The code consist of two loops, the outer loop is
responsible for computing different polynomial values, the inner loop takes the
current value for x[i] and calculates the value of the polynomial. The method to
compute the polynomial value is a very simple one and some multiplications can
be spared by using Horners method instead.
The read access to x[i] is an exclusive read for each thread as the value of index
variable i is achieved by data decomposition. This means that no component of x
is read by more than one thread. If the source transformation tool does not use the
ERA, it cannot obtain the information whether x[i] is only used by one or by more
threads. Therefore, the tool assumes conservatively that multiple threads read the
memory reference x[i]. The lack of the exclusive read property for the original
code leads to a storage operation of multiple threads to the same memory location
in the reverse section of the adjoint code. The corresponding race condition has to
be synchronized by an atomic construct.
As for the test suite, we used the compiler g++ and icpc to obtain the binaries
for this test. In addition, the two common options O0 and O3 were used to create
two binaries, one without using any code optimization and one with using it. This
results in four different binaries. The setups for these binaries are once applied to
the code that SPLc provides while using the ERA and once where SPLc does not
use the static analysis. Eventually, this results in eight different binaries. Figure
5.23 shows the results as a bar chart and the detailed runtime results are explained
in the following two paragraphs.
Intel compiler 14.0: Let us first consider the runtime results for the binary ob-
tained without backend code optimization (O0). The results where we did not use
ERA are displayed in Table 5.23 whereby Table 5.24 shows the corresponding
results of using ERA before the source transformation. The last column shows im-
provement between these two tables. The improvement values lies between 1.51
5.3 Exclusive Read Analysis 319
Figure 5.23: This bar chart summarizes the runtime measurements of this test case. The bars
are grouped into pairs of the same color. Considering a pair of bars with the same color, the
first bar is the runtime without using ERA, and the second bar is the results for the adjoint
code obtained with ERA. The first four bars (red and blue) show results from the Intel
compiler whereby the last four bars (yellow and green) display the corresponding results
obtained by compiling with the g++. Both compilers were called once without optimization
(-O0) and once with optimization (-O3).
and 1.59 for all the parallel executions and the best improvement is achieved with
128 threads.
The results where the code optimization (O3) is used are shown in Table 5.26
and Table 5.25. The improvements are completely different as the one with opti-
mization level 0. The best improvement where the execution was over seven times
faster (7.74) was achieved during the execution with only one thread. Then the
value of the improvement falls until it reaches a factor of 1.69 at the execution
with 128 threads.
The reason why the speedup factor declines with an increasing number of threads
is likely the case because the assembler instruction that implements the atomic
statement needs more CPU cycles than the regular instruction. For example, let us
320 5 Experimental Results
assume that the atomic instruction consumes two CPU cycles and the regular in-
struction uses one cycle. In case that these instructions are executed hundred times
the overall runtime is 200 and 100, respectively. Therefore, we have a difference
of 50 CPU cycles between the two executions. With increasing number of threads,
this difference is reduced and therefore also the corresponding improvement be-
tween the two executions.
g++ 4.8.0: The results where we did not use any code optimization of the g++
(O0) are shown in Table 5.27 (without ERA) and in Table 5.28 (with ERA). The
ration between using the ERA and without ERA lies between 1.50 and 2.23. The
best improvement is achieved with 128 threads. Interesting is the fact that the
runtime of 54 seconds is clearly below the 74 seconds of the Intel binary. The g++
seems to produce code that is much better suited for this high number of threads.
Next, we consider the two binaries where the optimization level 3 (-O3) of g++
is used. Table 5.29 illustrates the runtime results without ERA, the corresponding
results where we use ERA are contained in Table 5.30. As with the Intel binary,
we achieve the peak improvement while using one thread. Here the improvement
factor is 5.54 against the 7.74 of the Intel binary. The improvement values go down
until they reach 1.38 when the execution uses 128 threads.
In summary, the average improvement factor of 3.11 is a strong argument for
using the ERA before the source transformation in order to create a more efficient
adjoint code.
Figure 5.24: Runtime results of the least squares problem. The blue bars represent the
speedup factors of the binaries compiled with the g++ compiler, the red bars represent the
results for the binary compiled with the Intel compiler. The detailed results are shown in
Table 5.31 (g++) and in Table 5.32 (Intel).
have 1024 equations that are all independent of one another and therefore the val-
ues for these equations can be computed concurrently. A possible implementation
of the objective function is shown in Listing 1.5.
The Ipopt software4 has been used to solve this problem. The first- and sec-
ond derivative codes of Listing 1.5 were obtained with the help of SPLc where
the forward-over-reverse mode was used to obtain the second-order adjoint model.
Therefore, the Hessian of the objective function is computed line by line through
the second-order adjoint model. The computation of the Hessian lines does not
have any data dependencies among each other and can be computed independently.
This fact shows, besides the OpenMP parallel region inside the derivative code, an-
other possibility to parallelize the computation. Thus, we use two nested OpenMP
parallel regions in the implementation of this test case. One parallel region starts in
front of the loop that is responsible for computing the Hessian line by line. We de-
4 https://projects.coin-or.org/Ipopt
5.5 Nonlinear Constrained Optimization Problem 325
scribe this parallel region as the outer region of this example. The second parallel
region is the one introduced by our source transformation inside the derivative
code. This parallel region is meant when we speak about the inner region.
The OpenMP standard determines that the nested parallelism is disabled by
default. This means that the inner region has no effect without enabling nested
326 5 Experimental Results
OpenMP. Similar to the previous test cases, we use the g++ and the Intel compiler
to obtain the executable files. In order to compare the effect of nested OpenMP, we
used two different test setups. The first test run does not use nested OpenMP which
means that the inner level is executed sequentially. Afterwards, the second test run
tries to achieve better performance by enabling nested OpenMP. The results where
nested OpenMP was disabled are shown in Table 5.33.
The results are quite good until the maximal number of threads for one physical
node, namely 32, is reached. Up to this number of threads the results for the g++
and the Intel binary do not diverge much. The g++ binary needs 268 seconds
runtime whereby the Intel binary consumes 237 seconds of runtime. However, the
runtime results for 64 and 128 threads are completely different for both binaries.
The g++ binary is able to further reduce the runtime to 222 seconds with 128
threads whereby the Intel binary let rise the runtime up to 2158 seconds with 128
threads.
One may assume that the runtime gets better when enabling nested OpenMP.
The test setup that we used determines that the inner OpenMP level uses two
threads and the outer level uses the number of threads that is presented in the
first column of Table 5.34. The results for the g++ binary are sobering. Even the
execution with one thread on the outer and two threads on the inner level consumes
15.000 seconds and therefore needs more than twice as much time as the sequential
execution of the g++ binary in Table 5.33. If one increases the number of threads
this gets even worse where one achieves a slowdown instead of an performance
improvement. The Intel compiler, on the other hand, produces a binary file that
5.5 Nonlinear Constrained Optimization Problem 327
has a runtime of 5154 seconds with one thread on the outer level and two threads
on the inner level. This runtime is 34 percent faster than the 6899 seconds of the
sequential execution from Table 5.34. The execution with four threads in total -
two threads on the outer and two on the inner level - reveals a speedup value of
1.85. The runtime is 26 percent faster (2783 seconds) than the 3497.54 seconds
from Table 5.33. With four threads on the outer level and two threads on the in-
ner level (eight threads overall), the runtime is 1676 seconds. Eight threads is the
maximum of one CPU socket of our testing machine. This is probably the reason
why a further increase of the number of threads does not improve the performance
much.
The comparison above shows that we can achieve a better performance by us-
ing the inherent parallelism in the derivative codes. However, when we consider
the total number of threads and the runtime results that we achieved with nested
OpenMP then the results from Table 5.33 clearly indicate that it is better to use
only one level of OpenMP. Figure 5.25 summarizes the speedup results from the
current test case. It would be interesting to see how the runtime results are for
heterogeneous parallel setups. For example, when MPI is used on the outer level
and OpenMP is used on the inner level. Another possibility is to use OpenMP on
the outer level and a GPU on the inner level. This should be examined in further
studies.
328 5 Experimental Results
Figure 5.25: Runtime results of four different binaries. Two binaries were compiled by the
g++ compiler and two by the Intel compiler. The first and second bar according to a certain
number of threads, show the speedup values for the code that uses only the parallelization
on the outer level. The third and the fourth bar present the speedup values for the nested
OpenMP approach. The details are shown in Table 5.33 and Table 5.34.
5.6 Summary
This chapter contains several experimental results for testing the scaling properties
of the derivative codes that we achieved from our source transformations. The
source code package of our implementation tool SPLc (see Appendix A) contains
a test suite comprising of different scenarios for an occurring OpenMP parallel
region. Besides the case of a pure parallel region, it was important to examine the
influence of synchronization constructs on the derivative code.
All in all, the test suite reinforces that the first derivative code of an OpenMP
parallel region can carry over the parallel performance from the original code to
its derivative code model. However, there are partly big differences in the results
concerning the binaries from the g++ and from the Intel compiler. One example
for this is the adjoint code of the critical region. The Intel binary achieves more
5.6 Summary 329
than 3500 MFLOPS whereby the g++ binary supplies less than 1300 MFLOPS.
To investigate the scaling properties of the different second derivative codes,
the forward-over-forward, the forward-over-reverse, and the reverse-over-forward
model of the ‘plain-parallel’ example from the test suite were generated by SPLc.
The results showed that the forward-over-reverse provides the best speedup of 70
while using 128 threads. The forward-over-forward mode achieves a speedup of
63 and the reverse-over-forward a speedup of 53. In addition, the reverse-over-
forward model seems to suffer under the big memory usage of the adjoint code
because we had to reduce the input size in order to run the test case.
To measure the impact of the exclusive read analysis, we compared two differ-
ent executions. On the one hand, we performed a test where the contained adjoint
code was generated by conservatively augmenting each adjoint assignment with
an atomic construct. On the other hand, the adjoint code was generated by using
the information obtained by the ERA. The MFLOPS values of the execution with-
out using the ERA could be improved by an average factor of 3.11. The lowest
improvement was a factor 1.50 and the best improvement factor was 7.74. This
strongly emphasizes the use of static analysis to prevent the synchronization over-
head in the adjoint computation.
The two motivating examples from the introduction chapter were also part of
these experimental results. The implementation of the least-squares problem ex-
hibits that we achieve a speedup but the fact that these speedup values were all
below 20 with a maximum number of 128 threads indicates that further investiga-
tions should be made to find the bottleneck. This also holds for the constrained
optimization problem where we achieved a relatively good speedup when we stay
below 32 threads. Another fact that we saw during the last test case was that nested
OpenMP does not seem to be well suited for this kind of problems. The g++ was
not able to show a speedup at all if nested OpenMP was used. Further investiga-
tions should examine these observations.
6 Conclusions
In this dissertation we developed rules which define a source code transformation
for OpenMP parallel regions. This source transformation implements the meth-
ods that are known as algorithmic differentiation (AD). Our focus was mainly on
OpenMP parallel regions but the basic approach can be applied to further parallel
programming APIs in case that they base on an approach where code regions are
declared as parallelizable by a compiler directive.
The main contribution of this thesis is the evidence that the defined source trans-
formations are closed. This means that the transformation fulfills two properties.
On the one hand, the source transformation has to supply code that is contained
in the same language as the one that serves as input language. On the other hand,
the source transformation must maintain the property of the original code of being
correct in the terms of a concurrent execution.
In the next section, we summarize the main results of this work. We conclude
with an outlook that discusses possible research directions to broaden the results
of this dissertation.
6.1 Results
The main results of this dissertation can be summarized as follows.
Source transformation rules for the SPL language as well as for constructs
occurring in the OpenMP standard. A simple parallel language (SPL) was de-
fined where two facts were important. On the one hand, SPL is simplified in order
to avoid the combinatorial explosion of possible statement combinations during
the correctness proof. On the other hand, SPL is sophisticated enough to cover
most of the occurring numerical kernels.
This work presented source transformation rules starting from a point where the
input code was assumed to be a pure parallel region that does not contain any fur-
ther pragmas. Later we extended the SPL language and the corresponding source
transformation rules step by step to cover most of the possible OpenMP constructs.
To cover most of the occurring OpenMP codes we covered synchronization con-
structs such as barrier and critical but also worksharing constructs such as the loop
construct and the sections construct. The worksharing constructs combined with a
parallel construct were also covered.
The simple structure and the formal definitions of the source transformation al-
lowed us to prove certain properties of the derivative codes. In addition, a software
engineer who wants to implement AD by source transformation may adapt these
formal rules easily to another programming language or to a different parallel pro-
gramming model.
Our tangent-linear and the adjoint source transformation fulfill the closure
property. As long as the software engineer assures to implement the source
transformation rules of this work correctly, it is ensured that the resulting deriva-
tive code can be executed concurrently and that the source transformation can be
applied again to its own output. This is guaranteed because we proved the closure
property of the source transformations. We showed how important the closure
property is as we implemented the code for solving a nonlinear constrained opti-
mization problem.
in the GPU assembly at the moment. The second solution is to use the exclusive
read analysis that has been developed in this work. With this static analysis, we
obtain information that indicates whether or not a certain adjoint assignment has to
be synchronized or not. The experimental results showed that we achieved runtime
improvements of a factor that lies between 1.5 and 7.74. The average improvement
factor was 3.11.
An Implementation called SPLc and a test suite to show the scalability. The
SPLc tool implements the source transformation rules from Chapter 2 and from
Section 4.2. The exclusive read analysis from Chapter 3 has also been imple-
mented. The documentation in Appendix A describes the steps for building the
SPLc tool from source.
ensures that this variable has the correct value range. In case the static analysis
knows the semantic of this assert statement, the possible value range of x can be
recognized as [0, 100] during compile time analysis. The approach where the soft-
ware engineer writes code where certain assumptions are made in form of assert
statements is known as programming by contract.
Another interesting approach is probably a mixture of static analysis and dy-
namic information that is obtained during runtime. For example, the intervals that
the ERA obtains as a result during compile time can lead to different versions of
adjoint code. At runtime, the execution can decide between different the versions
depending on certain dynamic information obtained during the execution. This
increases the code size but could allow a customized adjoint code execution where
synchronization is only used when necessary.
A SPLc - A SPL compiler
The implementation of this dissertation is a compiler tool called simple parallel language
compiler (SPLc). The SPLc source package can be obtained from the author1 .
./SPLc
Afterwards, one can type in a parallel region and one finishes the input with ‘END’.
SPLc will generate the tangent-linear code in t1_stdin.spl and the adjoint code can be
found in a1_stdin.spl. In addition, several files are generated such as symbol_table.txt
which contains an overview about the used symbols in the original code. The files with ex-
tension ‘.dot’ are input files for the dot tool that is part of the software package graphviz2 .
In case that dot is installed, the files with the ‘.dot’ extension are converted into PDF files
automatically. The following list explains the generated files and their content.
• ast.dot (Abstract syntax tree of the original code)
• cfg.dot (Control flow graph of the original code)
• ast-tl.dot (Abstract syntax tree of the tangent-linear code)
• ast-adj.dot (Abstract syntax tree of the adjoint code)
• ast-adj-forward-section.dot (Just the AST of the forward section)
• ast-adj-reverse-section.dot (Just the AST of the reverse section)
The common usage is likely the case where a parallel region is contained in a file. To
provide this file provided to SPLc, the syntax is as follows:
The name of the file that contains the parallel region is provided to SPLc as first argument.
As second argument the following options are possible:
• -O0 : Do not use the exclusive read analysis before the transformation.
• -O1 : Use the exclusive read analysis to provide better adjoint code.
• --suppress-atomic : Do not use an atomic construct before the adjoint statements.
One should do this only if it is ensured that the original code fulfills the exclusive
read property.
The option -O1 enables the exclusive read analysis. The results of this static analysis can
be examined in the files
2 http://www.graphviz.org/
A.3 Developer Guide 339
• exclusive-read-results-001.txt
• exclusive-read-results-002.txt
.
• ..
• exclusive-read-results-endresult.txt
Each file in the list represents a result of the fixed point iteration including widening and
narrowing. The last file represents the approximation of the fixed point. This file also
contains the result whether or not a certain floating-point memory reference fulfills the
exclusive read property.
The reader wants probably to see some examples for getting a feel for the source trans-
formation procedure. In this case the test suite provides a perfect starting point. The first
steps and the contained examples are explained in Appendix B.
cd SPLc/test-suite
export SPLDIR=$PWD/..
If one wants to use the PAPI software please execute the following install script.
./install-papi.sh
This script creates a dummy file to indicate whether or not the PAPI software is installed. If
one does not want to use the PAPI software then the dummy can be created by hand inside
the main folder of the test suite.
touch PAPI_NOT_SUCCESSFUL
After these setup steps the test suite is ready to use. For example, enter the folder atomic
and start the scaling test with optimization level 0 and 3 by typing:
cd atomic;
make
If one wants to run through the whole test suite then just type make inside the test suite
folder.
make
Inside a certain test case, one can use further options to run certain tests.
cd atomic
make scale_test_O0 # Run scale test with optimization level 0.
make scale_test_O3 # Run scale test with optimization level 3.
make scale_test # Run both of the above scale tests.
make finite_differences_test
# Compares the AD results with the one from FD.
Examples for the exclusive read analysis are contained in the folder
exclusive-read-examples.
For example, the reader may try the example that shows the impact of the exclusive read
analysis.
cd exclusive-read-examples
SPLc second-example.in -O0
grep atomic a1_second-example.in.spl
Compare the number of atomic assignments in the adjoint code with the following output.
The results for the individual memory references can be considered by:
tail -n 18 exclusive-read-results-endresult.txt
The individual listings for the examples are presented on the following pages.
t i d ←omp_get_thread_num ( ) ;
p←omp_get_num_threads ( ) ;
c←n / p ;
l b ← t i d ∗c ;
ub← ( t i d +1)∗c −1;
i←lb ;
t h r e a d _ r e s u l t tid ← 0 . ;
w h i l e ( i ≤ ub ) {
y← 0 . ;
z← 0 . ;
j←0;
w h i l e ( j <n ) {
y+← s i n (Ai∗n+ j ∗Ai∗n+ j ) ;
z+← c o s (Ai∗n+ j ∗Ai∗n+ j ) ;
j ← j +1;
}
t h r e a d _ r e s u l t tid+←y ∗ z ;
i ← i +1;
}
dummy ( " " , ( v o i d ∗ ) t h r e a d _ r e s u l t ) ;
}
Listing B.1: This code is the original code for the test case ‘plain-parallel’.
(1)0
d o u b l e v1 ;
d o u b l e v 02 ;
(1)0
d o u b l e v2 ;
d o u b l e v 03 ;
(1)0
d o u b l e v3 ;
t i d ←omp_get_thread_num ( ) ;
p←omp_get_num_threads ( ) ;
c←n / p ;
l b ← t i d ∗c ;
ub← ( t i d +1)∗c −1;
i←lb ;
(1)0
v0 ← 0 . ;
v 00 ← 0 . ;
(1) (1)0
t h r e a d _ r e s u l t tid ←v 0 ;
t h r e a d _ r e s u l t tid ←v 00 ;
w h i l e ( i ≤ ub ) {
(1)0
v0 ← 0 . ;
v 00 ← 0 . ;
(1)0
y (1) ←v 0 ;
y←v 00 ;
(1)0
v0 ← 0 . ;
v 00 ← 0 . ;
(1)0
z (1) ←v 0 ;
z←v 00 ;
j←0;
w h i l e ( j <n ) {
(1)0 (1)
v 0 ←Ai∗n+ j ;
v 00 ←Ai∗n+ j ;
(1)0 (1)
v 1 ←Ai∗n+ j ;
v 01 ←Ai∗n+ j ;
(1)0 (1)0 (1)0
v 2 ←v 0 ∗ v 01+v 00 ∗ v 1 ;
v 02 ←v 00 ∗ v 01 ;
(1)0 (1)0
v 3 ←v 2 ∗ c o s ( v 02 ) ;
v 03 ← s i n ( v 02 ) ;
(1)0
y (1)+←v 3 ;
y+←v 03 ;
(1)0 (1)
v 0 ←Ai∗n+ j ;
v 00 ←Ai∗n+ j ;
(1)0 (1)
v 1 ←Ai∗n+ j ;
v 01 ←Ai∗n+ j ;
(1)0 (1)0 (1)0
v 2 ←v 0 ∗ v 01+v 00 ∗ v 1 ;
v 02 ←v 00 ∗ v 01 ;
(1)0 (1)0
v 3 ←v 2 ∗(0. − s i n ( v 02 ) ) ;
v 03 ← c o s ( v 02 ) ;
(1)0
z (1)+←v 3 ;
z+←v 03 ;
j ← j +1;
}
B.1 Test Case Plain Parallel Region 345
(1)0
v 0 ←y (1) ;
v 00 ←y ;
(1)0
v 1 ←z (1) ;
v 01 ←z ;
(1)0 (1)0 (1)0
v 2 ←v 0 ∗ v 01+v 00 ∗ v 1 ;
v 02 ←v 00 ∗ v 01 ;
(1) (1)0
t h r e a d _ r e s u l t tid +←v 2 ;
t h r e a d _ r e s u l t tid+←v 02 ;
i ← i +1;
}
dummy ( " " , ( v o i d ∗ ) t h r e a d _ r e s u l t , ( v o i d ∗ ) t h r e a d _ r e s u l t (1) ) ;
}
Listing B.2: This listing shows the tangent-linear code of the ‘plain-parallel’ test case
(Listing B.1).
STACK(1)c . p u s h ( 4 8 ) ;
STACK(1)i . p u s h ( t i d ) ;
t i d ←omp_get_thread_num ( ) ;
STACK(1)i . p u s h ( p ) ;
p←omp_get_num_threads ( ) ;
STACK(1)i . p u s h ( c ) ;
c←n / p ;
STACK(1)i . p u s h ( l b ) ;
l b ← t i d ∗c ;
STACK(1)i . p u s h ( ub ) ;
ub← ( t i d +1)∗c −1;
STACK(1)i . p u s h ( i ) ;
346 B Test Suite
i←lb ;
STACK(1) f . p u s h ( t h r e a d _ r e s u l t tid ) ;
t h r e a d _ r e s u l t tid ← 0 . ;
w h i l e ( i ≤ ub ) {
STACK(1)c . p u s h ( 8 6 ) ;
STACK(1) f . p u s h ( y ) ;
y← 0 . ;
STACK(1) f . p u s h ( z ) ;
z← 0 . ;
STACK(1)i . p u s h ( j ) ;
j←0;
w h i l e ( j <n ) {
STACK(1)c . p u s h ( 1 1 8 ) ;
STACK(1) f . p u s h ( y ) ;
y+← s i n (Ai∗n+ j ∗Ai∗n+ j ) ;
STACK(1) f . p u s h ( z ) ;
z+← c o s (Ai∗n+ j ∗Ai∗n+ j ) ;
STACK(1)i . p u s h ( j ) ;
j ← j +1;
}
STACK(1)c . p u s h ( 1 5 4 ) ;
STACK(1) f . p u s h ( t h r e a d _ r e s u l t tid ) ;
t h r e a d _ r e s u l t tid+←y ∗ z ;
STACK(1)i . p u s h ( i ) ;
i ← i +1;
}
dummy ( " " , ( v o i d ∗ ) t h r e a d _ r e s u l t ) ;
w h i l e ( n o t STACK(1)c . empty ( ) ) {
i f ( STACK(1)c . t o p ( ) = 4 8 ) {
STACK(1)c . pop ( ) ;
t h r e a d _ r e s u l t tid ←STACK(1) f . t o p ( ) ;
STACK(1) f . pop ( ) ;
v 00 ← 0 . ;
v 0(1)0 ← t h r e a d _ r e s u l t (1)tid ;
t h r e a d _ r e s u l t (1)tid ← 0 . ;
i ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
ub←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
l b ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
c←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
p←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
t i d ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
}
i f ( STACK(1)c . t o p ( ) = 8 6 ) {
STACK(1)c . pop ( ) ;
j ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
z←STACK(1) f . t o p ( ) ;
B.1 Test Case Plain Parallel Region 347
STACK(1) f . pop ( ) ;
v 00 ← 0 . ;
v 0(1)0 ←z (1) ;
z (1) ← 0 . ;
y←STACK(1) f . t o p ( ) ;
STACK(1) f . pop ( ) ;
v 00 ← 0 . ;
v 0(1)0 ←y (1) ;
y (1) ← 0 . ;
}
i f ( STACK(1)c . t o p ( ) = 1 1 8 ) {
STACK(1)c . pop ( ) ;
j ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
z←STACK(1) f . t o p ( ) ;
STACK(1) f . pop ( ) ;
v 00 ←Ai∗n+ j ;
v 01 ←Ai∗n+ j ;
v 02 ←v 00 ∗ v 01 ;
v 03 ← c o s ( v 02 ) ;
v 0(1)3 ←z (1) ;
v 0(1)2 ←v 0(1)3 ∗(0. − s i n ( v 02 ) ) ;
v 0(1)0 ←v 0(1)2 ∗ v 01 ;
v 0(1)1 ←v 0(1)2 ∗ v 00 ;
A(1)i∗n+ j+←v 0(1)0 ;
A(1)i∗n+ j+←v 0(1)1 ;
y←STACK(1) f . t o p ( ) ;
STACK(1) f . pop ( ) ;
v 00 ←Ai∗n+ j ;
v 01 ←Ai∗n+ j ;
v 02 ←v 00 ∗ v 01 ;
v 03 ← s i n ( v 02 ) ;
v 0(1)3 ←y (1) ;
v 0(1)2 ←v 0(1)3 ∗ c o s ( v 02 ) ;
v 0(1)0 ←v 0(1)2 ∗ v 01 ;
v 0(1)1 ←v 0(1)2 ∗ v 00 ;
A(1)i∗n+ j+←v 0(1)0 ;
A(1)i∗n+ j+←v 0(1)1 ;
}
i f ( STACK(1)c . t o p ( ) = 1 5 4 ) {
STACK(1)c . pop ( ) ;
i ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
t h r e a d _ r e s u l t tid ←STACK(1) f . t o p ( ) ;
STACK(1) f . pop ( ) ;
v 00 ←y ;
v 01 ←z ;
v 02 ←v 00 ∗ v 01 ;
v 0(1)2 ← t h r e a d _ r e s u l t (1)tid ;
348 B Test Suite
v 0(1)0 ←v 0(1)2 ∗ v 01 ;
v 0(1)1 ←v 0(1)2 ∗ v 00 ;
y (1)+←v 0(1)0 ;
z (1)+←v 0(1)1 ;
}
dummy ( " " , ( v o i d ∗ ) t h r e a d _ r e s u l t , ( v o i d ∗ ) t h r e a d _ r e s u l t (1) ) ;
}
}
Listing B.3: The adjoint code of the ‘plain-parallel’ test case (Listing B.1) is presented in
this listing where the exclusive read analysis has not been used to obtain static program
information. With help of the exclusive read analysis the four atomic constructs in this
code can be prevented.
d o u b l e v 10 ;
(2)1
d o u b l e v0 ;
d o u b l e v 11 ;
(2)1
d o u b l e v1 ;
d o u b l e v 12 ;
(2)1
d o u b l e v2 ;
d o u b l e v 13 ;
(2)1
d o u b l e v3 ;
d o u b l e v 14 ;
(2)1
d o u b l e v4 ;
d o u b l e v 15 ;
(2)1
d o u b l e v5 ;
d o u b l e v 16 ;
(2)1
d o u b l e v6 ;
t i d ←omp_get_thread_num ( ) ;
p←omp_get_num_threads ( ) ;
c←n / p ;
l b ← t i d ∗c ;
ub← ( t i d +1)∗c −1;
i←lb ;
(2)1
v0 ← 0 . ;
v 10 ← 0 . ;
(1,2)0 (2)1
v0 ←v 0 ;
(1)0
v 0 ←v 10 ;
(2)1
v0 ← 0 . ;
v 10 ← 0 . ;
(2)0 (2)1
v 0 ←v 0 ;
v 00 ←v 10 ;
(2)1 (1,2)0
v 0 ←v 0 ;
(1)0
v 10 ←v 0 ;
(1,2) (2)1
t h r e a d _ r e s u l t tid ←v 0 ;
(1)
t h r e a d _ r e s u l t tid ←v 10 ;
(2)1 (2)0
v 0 ←v 0 ;
v 10 ←v 00 ;
(2) (2)1
t h r e a d _ r e s u l t tid ←v 0 ;
t h r e a d _ r e s u l t tid ←v 10 ;
w h i l e ( i ≤ ub ) {
(2)1
v0 ← 0 . ;
v 10 ← 0 . ;
(1,2)0 (2)1
v0 ←v 0 ;
(1)0 1
v 0 ←v 0 ;
(2)1
v0 ← 0 . ;
v 10 ← 0 . ;
(2)0 (2)1
v 0 ←v 0 ;
v 00 ←v 10 ;
(2)1 (1,2)0
v 0 ←v 0 ;
(1)0
v 10 ←v 0 ;
350 B Test Suite
(2)1
y (1,2) ←v 0 ;
y (1) ←v 10 ;
(2)1 (2)0
v 0 ←v 0 ;
v 10 ←v 00 ;
(2)1
y (2) ←v 0 ;
y←v 10 ;
(2)1
v0 ← 0 . ;
v 10 ← 0 . ;
(1,2)0 (2)1
v0 ←v 0 ;
(1)0 1
v 0 ←v 0 ;
(2)1
v0 ← 0 . ;
v 10 ← 0 . ;
(2)0 (2)1
v 0 ←v 0 ;
0 1
v 0 ←v 0 ;
(2)1 (1,2)0
v 0 ←v 0 ;
1 (1)0
v 0 ←v 0 ;
(2)1
z (1,2) ←v 0 ;
z (1) ←v 10 ;
(2)1 (2)0
v 0 ←v 0 ;
v 10 ←v 00 ;
(2)1
z (2) ←v 0 ;
z←v 10 ;
j←0;
w h i l e ( j <n ) {
(2)1 (1,2)
v 0 ←Ai∗n+ j ;
(1)
v 10 ←Ai∗n+ j ;
(1,2)0 (2)1
v0 ←v 0 ;
(1)0
v 0 ←v 10 ;
(2)1 (2)
v 0 ←Ai∗n+ j ;
1
v 0 ←Ai∗n+ j ;
(2)0 (2)1
v 0 ←v 0 ;
0 1
v 0 ←v 0 ;
(2)1 (1,2)
v 0 ←Ai∗n+ j ;
(1)
v 10 ←Ai∗n+ j ;
(1,2)0 (2)1
v1 ←v 0 ;
(1)0 1
v 1 ←v 0 ;
(2)1 (2)
v 0 ←Ai∗n+ j ;
v 10 ←Ai∗n+ j ;
(2)0 (2)1
v 1 ←v 0 ;
v 01 ←v 10 ;
(2)1 (1,2)0
v 0 ←v 0 ;
(1)0
v 10 ←v 0 ;
(2)1 (2)0
v 1 ←v 1 ;
v 11 ←v 01 ;
(2)1 (2)1 (2)1
v 2 ←v 0 ∗ v 11+v 10 ∗ v 1 ;
v 12 ←v 10 ∗ v 11 ;
B.1 Test Case Plain Parallel Region 351
(2)1 (2)0
v 3 ←v 0 ;
v 13 ←v 00 ;
(2)1 (1,2)0
v 4 ←v 1 ;
(1)0
v 14 ←v 1 ;
(2)1 (2)1 (2)1
v 5 ←v 3 ∗ v 14+v 13 ∗ v 4 ;
v 15 ←v 13 ∗ v 14 ;
(2)1 (2)1 (2)1
v 6 ←v 2 +v 5 ;
v 16 ←v 12+v 15 ;
(1,2)0 (2)1
v2 ←v 6 ;
(1)0 1
v 2 ←v 6 ;
(2)1 (2)0
v 0 ←v 0 ;
v 10 ←v 00 ;
(2)1 (2)0
v 1 ←v 1 ;
v 11 ←v 01 ;
(2)1 (2)1 (2)1
v 2 ←v 0 ∗ v 11+v 10 ∗ v 1 ;
v 12 ←v 10 ∗ v 11 ;
(2)0 (2)1
v 2 ←v 2 ;
v 02 ←v 12 ;
(2)1 (1,2)0
v 0 ←v 2 ;
(1)0
v 10 ←v 2 ;
(2)1 (2)0
v 1 ←v 2 ;
v 11 ←v 02 ;
(2)1 (2)1
v 2 ←v 1 ∗(0. − s i n ( v 11 ) );
v 12 ← c o s ( v 11 ) ;
(2)1 (2)1 (2)1
v 3 ←v 0 ∗ v 12+v 10 ∗ v 2 ;
1 1 1
v 3 ←v 0 ∗ v 2 ;
(1,2)0 (2)1
v3 ←v 3 ;
(1)0 1
v 3 ←v 3 ;
(2)1 (2)0
v 0 ←v 2 ;
v 10 ←v 02 ;
(2)1 (2)1
v 1 ←v 0 ∗ c o s ( v 10 ) ;
v 11 ← s i n ( v 10 ) ;
(2)0 (2)1
v 3 ←v 1 ;
v 03 ←v 11 ;
(2)1 (1,2)0
v 0 ←v 3 ;
(1)0
v 10 ←v 3 ;
(2)1
y (1,2)+←v 0 ;
(1) 1
y +←v 0 ;
(2)1 (2)0
v 0 ←v 3 ;
1 0
v 0 ←v 3 ;
(2)1
y (2)+←v 0 ;
1
y+←v 0 ;
(2)1 (1,2)
v 0 ←Ai∗n+ j ;
(1)
v 10 ←Ai∗n+ j ;
(1,2)0 (2)1
v0 ←v 0 ;
(1)0 1
v 0 ←v 0 ;
352 B Test Suite
(2)1 (2)
v 0 ←Ai∗n+ j ;
v 10 ←Ai∗n+ j ;
(2)0 (2)1
v 0 ←v 0 ;
v 00 ←v 10 ;
(2)1 (1,2)
v 0 ←Ai∗n+ j ;
(1)
v 10 ←Ai∗n+ j ;
(1,2)0 (2)1
v1 ←v 0 ;
(1)0
v 1 ←v 10 ;
(2)1 (2)
v 0 ←Ai∗n+ j ;
v 10 ←Ai∗n+ j ;
(2)0 (2)1
v 1 ←v 0 ;
v 01 ←v 10 ;
(2)1 (1,2)0
v 0 ←v 0 ;
1 (1)0
v 0 ←v 0 ;
(2)1 (2)0
v 1 ←v 1 ;
v 11 ←v 01 ;
(2)1 (2)1 (2)1
v 2 ←v 0 ∗ v 11+v 10 ∗ v 1 ;
v 12 ←v 10 ∗ v 11 ;
(2)1 (2)0
v 3 ←v 0 ;
v 13 ←v 00 ;
(2)1 (1,2)0
v 4 ←v 1 ;
(1)0
v 14 ←v 1 ;
(2)1 (2)1 (2)1
v 5 ←v 3 ∗ v 14+v 13 ∗ v 4 ;
v 15 ←v 13 ∗ v 14 ;
(2)1 (2)1 (2)1
v 6 ←v 2 +v 5 ;
v 16 ←v 12+v 15 ;
(1,2)0 (2)1
v2 ←v 6 ;
(1)0 1
v 2 ←v 6 ;
(2)1 (2)0
v 0 ←v 0 ;
v 10 ←v 00 ;
(2)1 (2)0
v 1 ←v 1 ;
v 11 ←v 01 ;
(2)1 (2)1 (2)1
v 2 ←v 0 ∗ v 11+v 10 ∗ v 1 ;
v 12 ←v 10 ∗ v 11 ;
(2)0 (2)1
v 2 ←v 2 ;
v 02 ←v 12 ;
(2)1 (1,2)0
v 0 ←v 2 ;
(1)0
v 10 ←v 2 ;
(2)1
v1 ← 0 . ;
v 11 ← 0 . ;
(2)1 (2)0
v 2 ←v 2 ;
v 12 ←v 02 ;
(2)1 (2)1
v 3 ←v 2 ∗ c o s ( v 12 ) ;
v 13 ← s i n ( v 12 ) ;
(2)1 (2)1 (2)1
v 4 ←v 1 −v 3 ;
1 1 1
v 4 ←v 1−v 3 ;
B.1 Test Case Plain Parallel Region 353
d o u b l e z (2) ← 0 . ;
d o u b l e z (1) ← 0 . ;
(2)
d o u b l e z (1) ← 0 . ;
d o u b l e v 00 ;
(2)0
d o u b l e v0 ← 0 . ;
d o u b l e v 0(1)0 ;
(2)0
d o u b l e v (1)0 ← 0 . ;
d o u b l e v 01 ;
(2)0
d o u b l e v1 ← 0 . ;
d o u b l e v 0(1)1 ;
(2)0
d o u b l e v (1)1 ← 0 . ;
d o u b l e v 02 ;
(2)0
d o u b l e v2 ← 0 . ;
d o u b l e v 0(1)2 ;
(2)0
d o u b l e v (1)2 ← 0 . ;
d o u b l e v 03 ;
(2)0
d o u b l e v3 ← 0 . ;
d o u b l e v 0(1)3 ;
(2)0
d o u b l e v (1)3 ← 0 . ;
d o u b l e v 10 ;
(2)1
d o u b l e v0 ;
d o u b l e v 11 ;
(2)1
d o u b l e v1 ;
d o u b l e v 12 ;
(2)1
d o u b l e v2 ;
d o u b l e v 13 ;
(2)1
d o u b l e v3 ;
d o u b l e v 14 ;
(2)1
d o u b l e v4 ;
d o u b l e v 15 ;
(2)1
d o u b l e v5 ;
STACK(1)c . p u s h ( 4 8 ) ;
STACK(1)i . p u s h ( t i d ) ;
t i d ←omp_get_thread_num ( ) ;
STACK(1)i . p u s h ( p ) ;
p←omp_get_num_threads ( ) ;
STACK(1)i . p u s h ( c ) ;
c←n / p ;
STACK(1)i . p u s h ( l b ) ;
l b ← t i d ∗c ;
STACK(1)i . p u s h ( ub ) ;
ub← ( t i d +1)∗c −1;
STACK(1)i . p u s h ( i ) ;
i←lb ;
(2)
STACK(2) f . p u s h ( t h r e a d _ r e s u l t tid ) ;
STACK(1) f . p u s h ( t h r e a d _ r e s u l t tid ) ;
(2)1
v0 ← 0 . ;
356 B Test Suite
v 10 ← 0 . ;
(2) (2)1
t h r e a d _ r e s u l t tid ←v 0 ;
t h r e a d _ r e s u l t tid ←v 10 ;
w h i l e ( i ≤ ub ) {
STACK(1)c . p u s h ( 8 6 ) ;
STACK(2) f . p u s h ( y (2) ) ;
STACK(1) f . p u s h ( y ) ;
(2)1
v0 ← 0 . ;
v 10 ← 0 . ;
(2)1
y (2) ←v 0 ;
y←v 10 ;
STACK(2) f . p u s h ( z (2) ) ;
STACK(1) f . p u s h ( z ) ;
(2)1
v0 ← 0 . ;
v 10 ← 0 . ;
(2)1
z (2) ←v 0 ;
z←v 10 ;
STACK(1)i . p u s h ( j ) ;
j←0;
w h i l e ( j <n ) {
STACK(1)c . p u s h ( 1 1 8 ) ;
STACK(2) f . p u s h ( y (2) ) ;
STACK(1) f . p u s h ( y ) ;
(2)1 (2)
v 0 ←Ai∗n+ j ;
v 10 ←Ai∗n+ j ;
(2)1 (2)
v 1 ←Ai∗n+ j ;
v 11 ←Ai∗n+ j ;
(2)1 (2)1 (2)1
v 2 ←v 0 ∗ v 11+v 10 ∗ v 1 ;
v 12 ←v 10 ∗ v 11 ;
(2)1 (2)1
v 3 ←v 2 ∗ c o s ( v 12 ) ;
v 13 ← s i n ( v 12 ) ;
(2)1
y (2)+←v 3 ;
y+←v 13 ;
STACK(2) f . p u s h ( z (2) ) ;
STACK(1) f . p u s h ( z ) ;
(2)1 (2)
v 0 ←Ai∗n+ j ;
v 10 ←Ai∗n+ j ;
(2)1 (2)
v 1 ←Ai∗n+ j ;
v 11 ←Ai∗n+ j ;
(2)1 (2)1 (2)1
v 2 ←v 0 ∗ v 11+v 10 ∗ v 1 ;
v 12 ←v 10 ∗ v 11 ;
(2)1 (2)1
v 3 ←v 2 ∗(0. − s i n ( v 12 ) ) ;
v 13 ← c o s ( v 12 ) ;
(2)1
z (2)+←v 3 ;
z+←v 13 ;
STACK(1)i . p u s h ( j ) ;
j ← j +1;
}
STACK(1)c . p u s h ( 1 5 4 ) ;
B.1 Test Case Plain Parallel Region 357
(2)
STACK(2) f . p u s h ( t h r e a d _ r e s u l t tid ) ;
STACK(1) f . p u s h ( t h r e a d _ r e s u l t tid ) ;
(2)1
v 0 ←y (2) ;
v 10 ←y ;
(2)1
v 1 ←z (2) ;
v 11 ←z ;
(2)1 (2)1 (2)1
v 2 ←v 0 ∗ v 11+v 10 ∗ v 1 ;
v 12 ←v 10 ∗ v 11 ;
(2) (2)1
t h r e a d _ r e s u l t tid +←v 2 ;
t h r e a d _ r e s u l t tid+←v 12 ;
STACK(1)i . p u s h ( i ) ;
i ← i +1;
}
dummy ( " " , ( v o i d ∗ ) t h r e a d _ r e s u l t , ( v o i d ∗ ) t h r e a d _ r e s u l t (2) ) ;
w h i l e ( n o t STACK(1)c . empty ( ) ) {
i f ( STACK(1)c . t o p ( ) = 4 8 ) {
STACK(1)c . pop ( ) ;
(2)
t h r e a d _ r e s u l t tid ←STACK(2) f . t o p ( ) ;
t h r e a d _ r e s u l t tid ←STACK(1) f . t o p ( ) ;
STACK(2) f . pop ( ) ;
STACK(1) f . pop ( ) ;
(2)1
v0 ← 0 . ;
v 10 ← 0 . ;
(2)0 (2)1
v 0 ←v 0 ;
v 00 ←v 10 ;
(2)1 (2)
v 0 ← t h r e a d _ r e s u l t (1)tid ;
v 10 ← t h r e a d _ r e s u l t (1)tid ;
(2)0 (2)1
v (1)0 ←v 0 ;
v 0(1)0 ←v 10 ;
(2)1
v0 ← 0 . ;
v 10 ← 0 . ;
(2) (2)1
t h r e a d _ r e s u l t (1)tid ←v 0 ;
t h r e a d _ r e s u l t (1)tid ←v 10 ;
i ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
ub←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
l b ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
c←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
p←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
t i d ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
}
i f ( STACK(1)c . t o p ( ) = 8 6 ) {
STACK(1)c . pop ( ) ;
j ←STACK(1)i . t o p ( ) ;
358 B Test Suite
STACK(1)i . pop ( ) ;
z (2) ←STACK(2) f . t o p ( ) ;
z←STACK(1) f . t o p ( ) ;
STACK(2) f . pop ( ) ;
STACK(1) f . pop ( ) ;
(2)1
v0 ← 0 . ;
v 10 ← 0 . ;
(2)0 (2)1
v 0 ←v 0 ;
v 00 ←v 10 ;
(2)1 (2)
v 0 ←z (1) ;
v 10 ←z (1) ;
(2)0 (2)1
v (1)0 ←v 0 ;
v (1)0 ←v 10 ;
0
(2)1
v0 ← 0 . ;
v 10 ← 0 . ;
(2) (2)1
z (1) ←v 0 ;
1
z (1) ←v 0 ;
(2)
y ←STACK(2) f . t o p ( ) ;
y←STACK(1) f . t o p ( ) ;
STACK(2) f . pop ( ) ;
STACK(1) f . pop ( ) ;
(2)1
v0 ← 0 . ;
v 10 ← 0 . ;
(2)0 (2)1
v 0 ←v 0 ;
v 00 ←v 10 ;
(2)1 (2)
v 0 ←y (1) ;
v 10 ←y (1) ;
(2)0 (2)1
v (1)0 ←v 0 ;
v 0(1)0 ←v 10 ;
(2)1
v0 ← 0 . ;
v 10 ← 0 . ;
(2) (2)1
y (1) ←v 0 ;
y (1) ←v 10 ;
}
i f ( STACK(1)c . t o p ( ) = 1 1 8 ) {
STACK(1)c . pop ( ) ;
j ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
z (2) ←STACK(2) f . t o p ( ) ;
z←STACK(1) f . t o p ( ) ;
STACK(2) f . pop ( ) ;
STACK(1) f . pop ( ) ;
(2)1 (2)
v 0 ←Ai∗n+ j ;
v 10 ←Ai∗n+ j ;
(2)0 (2)1
v 0 ←v 0 ;
v 00 ←v 10 ;
(2)1 (2)
v 0 ←Ai∗n+ j ;
B.1 Test Case Plain Parallel Region 359
v 10 ←Ai∗n+ j ;
(2)0 (2)1
v 1 ←v 0 ;
v 01 ←v 10 ;
(2)1 (2)0
v 0 ←v 0 ;
v 10 ←v 00 ;
(2)1 (2)0
v 1 ←v 1 ;
v 11 ←v 01 ;
(2)1 (2)1 (2)1
v 2 ←v 0 ∗ v 11+v 10 ∗ v 1 ;
v 12 ←v 10 ∗ v 11 ;
(2)0 (2)1
v 2 ←v 2 ;
v 02 ←v 12 ;
(2)1 (2)0
v 0 ←v 2 ;
v 10 ←v 02 ;
(2)1 (2)1
v 1 ←v 0 ∗(0. − s i n ( v 10 ) ) ;
v 11 ← c o s ( v 10 ) ;
(2)0 (2)1
v 3 ←v 1 ;
v 03 ←v 11 ;
(2)1 (2)
v 0 ←z (1) ;
v 10 ←z (1) ;
(2)0 (2)1
v (1)3 ←v 0 ;
v 0(1)3 ←v 10 ;
(2)1 (2)0
v 0 ←v (1)3 ;
v 10 ←v 0(1)3 ;
(2)1
v1 ← 0 . ;
v 11 ← 0 . ;
(2)1 (2)0
v 2 ←v 2 ;
1 0
v 2 ←v 2 ;
(2)1 (2)1
v 3 ←v 2 ∗ c o s ( v 12 ) ;
1
v 3 ← s i n ( v 12 ) ;
(2)1 (2)1 (2)1
v 4 ←v 1 −v 3 ;
1 1 1
v 4 ←v 1−v 3 ;
(2)1 (2)1 (2)1
v 5 ←v 0 ∗ v 14+v 10 ∗ v 4 ;
v 15 ←v 10 ∗ v 14 ;
(2)0 (2)1
v (1)2 ←v 5 ;
v 0(1)2 ←v 15 ;
(2)1 (2)0
v 0 ←v (1)2 ;
1 0
v 0 ←v (1)2 ;
(2)1 (2)0
v 1 ←v 1 ;
v 11 ←v 01 ;
(2)1 (2)1 (2)1
v 2 ←v 0 ∗ v 11+v 10 ∗ v 1 ;
v 12 ←v 10 ∗ v 11 ;
(2)0 (2)1
v (1)0 ←v 2 ;
v 0(1)0 ←v 12 ;
(2)1 (2)0
v 0 ←v (1)2 ;
v 10 ←v 0(1)2 ;
360 B Test Suite
(2)1 (2)0
v 1 ←v 0 ;
v 11 ←v 00 ;
(2)1 (2)1 (2)1
v 2 ←v 0 ∗ v 11+v 10 ∗ v 1 ;
v 12 ←v 10 ∗ v 11 ;
(2)0 (2)1
v (1)1 ←v 2 ;
v 0(1)1 ←v 12 ;
(2)1 (2)0
v 0 ←v (1)0 ;
v 10 ←v 0(1)0
;
#pragma omp a t o m i c
(2) (2)1
A(1)i∗n+ j+←v 0 ;
#pragma omp a t o m i c
A(1)i∗n+ j+←v 10 ;
(2)1 (2)0
v 0 ←v (1)1 ;
v 10 ←v 0(1)1
;
#pragma omp a t o m i c
(2) (2)1
A(1)i∗n+ j+←v 0 ;
#pragma omp a t o m i c
A(1)i∗n+ j+←v 10 ;
y (2) ←STACK(2) f . t o p ( ) ;
y←STACK(1) f . t o p ( ) ;
STACK(2) f . pop ( ) ;
STACK(1) f . pop ( ) ;
(2)1 (2)
v 0 ←Ai∗n+ j ;
v 10 ←Ai∗n+ j ;
(2)0 (2)1
v 0 ←v 0 ;
v 00 ←v 10 ;
(2)1 (2)
v 0 ←Ai∗n+ j ;
v 10 ←Ai∗n+ j ;
(2)0 (2)1
v 1 ←v 0 ;
v 01 ←v 10 ;
(2)1 (2)0
v 0 ←v 0 ;
v 10 ←v 00 ;
(2)1 (2)0
v 1 ←v 1 ;
v 11 ←v 01 ;
(2)1 (2)1 (2)1
v 2 ←v 0 ∗ v 11+v 10 ∗ v 1 ;
v 12 ←v 10 ∗ v 11 ;
(2)0 (2)1
v 2 ←v 2 ;
v 02 ←v 12 ;
(2)1 (2)0
v 0 ←v 2 ;
v 10 ←v 02 ;
(2)1 (2)1
v 1 ←v 0 ∗ c o s ( v 10 ) ;
v 11 ← s i n ( v 10 ) ;
(2)0 (2)1
v 3 ←v 1 ;
v 03 ←v 11 ;
(2)1 (2)
v 0 ←y (1) ;
v 10 ←y (1) ;
B.1 Test Case Plain Parallel Region 361
(2)0 (2)1
v (1)3 ←v 0 ;
v 0(1)3 ←v 10 ;
(2)1 (2)0
v 0 ←v (1)3 ;
v 10 ←v 0(1)3 ;
(2)1 (2)0
v 1 ←v 2 ;
v 11 ←v 02 ;
(2)1 (2)1
v 2 ←v 1 ∗(0. − s i n ( v 11 ) );
v 12 ← c o s ( v 11 ) ;
(2)1 (2)1 (2)1
v 3 ←v 0 ∗ v 12+v 10 ∗ v 2 ;
v 13 ←v 10 ∗ v 12 ;
(2)0 (2)1
v (1)2 ←v 3 ;
0 1
v (1)2 ←v 3 ;
(2)1 (2)0
v 0 ←v (1)2 ;
v 10 ←v 0(1)2 ;
(2)1 (2)0
v 1 ←v 1 ;
1 0
v 1 ←v 1 ;
(2)1 (2)1 (2)1
v 2 ←v 0 ∗ v 11+v 10 ∗ v 1 ;
1 1 1
v 2 ←v 0 ∗ v 1 ;
(2)0 (2)1
v (1)0 ←v 2 ;
0 1
v (1)0 ←v 2 ;
(2)1 (2)0
v 0 ←v (1)2 ;
1 0
v 0 ←v (1)2 ;
(2)1 (2)0
v 1 ←v 0 ;
v 11 ←v 00 ;
(2)1 (2)1 (2)1
v 2 ←v 0 ∗ v 11+v 10 ∗ v 1 ;
v 12 ←v 10 ∗ v 11 ;
(2)0 (2)1
v (1)1 ←v 2 ;
v 0(1)1 ←v 12 ;
(2)1 (2)0
v 0 ←v (1)0 ;
1 0
v 0 ←v (1)0 ;
#pragma omp a t o m i c
(2) (2)1
A(1)i∗n+ j+←v 0 ;
#pragma omp a t o m i c
A(1)i∗n+ j+←v 10 ;
(2)1 (2)0
v 0 ←v (1)1 ;
v 10 ←v 0(1)1
;
#pragma omp a t o m i c
(2) (2)1
A(1)i∗n+ j+←v 0 ;
#pragma omp a t o m i c
A(1)i∗n+ j+←v 10 ;
}
i f ( STACK(1)c . t o p ( ) = 1 5 4 ) {
STACK(1)c . pop ( ) ;
i ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
362 B Test Suite
(2)
t h r e a d _ r e s u l t tid ←STACK(2) f . t o p ( ) ;
t h r e a d _ r e s u l t tid ←STACK(1) f . t o p ( ) ;
STACK(2) f . pop ( ) ;
STACK(1) f . pop ( ) ;
(2)1
v 0 ←y (2) ;
v 10 ←y ;
(2)0 (2)1
v 0 ←v 0 ;
v 00 ←v 10 ;
(2)1
v 0 ←z (2) ;
v 10 ←z ;
(2)0 (2)1
v 1 ←v 0 ;
v 01 ←v 10 ;
(2)1 (2)0
v 0 ←v 0 ;
v 10 ←v 00 ;
(2)1 (2)0
v 1 ←v 1 ;
v 11 ←v 01 ;
(2)1 (2)1 (2)1
v 2 ←v 0 ∗ v 11+v 10 ∗ v 1 ;
v 12 ←v 10 ∗ v 11 ;
(2)0 (2)1
v 2 ←v 2 ;
v 02 ←v 12 ;
(2)1 (2)
v 0 ← t h r e a d _ r e s u l t (1)tid ;
v 10 ← t h r e a d _ r e s u l t (1)tid ;
(2)0 (2)1
v (1)2 ←v 0 ;
v 0(1)2 ←v 10 ;
(2)1 (2)0
v 0 ←v (1)2 ;
v 10 ←v 0(1)2 ;
(2)1 (2)0
v 1 ←v 1 ;
1 0
v 1 ←v 1 ;
(2)1 (2)1 (2)1
v 2 ←v 0 ∗ v 11+v 10 ∗ v 1 ;
1 1 1
v 2 ←v 0 ∗ v 1 ;
(2)0 (2)1
v (1)0 ←v 2 ;
0 1
v (1)0 ←v 2 ;
(2)1 (2)0
v 0 ←v (1)2 ;
1 0
v 0 ←v (1)2 ;
(2)1 (2)0
v 1 ←v 0 ;
v 11 ←v 00 ;
(2)1 (2)1 (2)1
v 2 ←v 0 ∗ v 11+v 10 ∗ v 1 ;
v 12 ←v 10 ∗ v 11 ;
(2)0 (2)1
v (1)1 ←v 2 ;
v 0(1)1 ←v 12 ;
(2)1 (2)0
v 0 ←v (1)0 ;
1 0
v 0 ←v (1)0 ;
(2) (2)1
y (1)+←v 0 ;
1
y (1)+←v 0 ;
(2)1 (2)0
v 0 ←v (1)1 ;
B.1 Test Case Plain Parallel Region 363
v 10 ←v 0(1)1 ;
(2) (2)1
z (1)+←v 0 ;
z (1)+←v 10 ;
}
dummy ( " " , ( v o i d ∗ ) t h r e a d _ r e s u l t , ( v o i d ∗ ) t h r e a d _ r e s u l t (1) , ( v o i d ∗ )
(2)
t h r e a d _ r e s u l t (2) , ( v o i d ∗ ) t h r e a d _ r e s u l t (1) ) ;
}
}
Listing B.5: This code represents the second-order adjoint code of the ‘plain-parallel’
test case (Listing B.1) in forward-over-reverse mode. This code is obtained by applying
first the reverse mode and then the forward mode, in other words one applies the source
transformation τ (σ (P)).
Reverse-over-Forward
#pragma omp p a r a l l e l
{
int tid←0;
i n t p← 0 ;
i n t c← 0 ;
int lb←0;
i n t ub← 0 ;
int i←0;
int j←0;
double y ;
d o u b l e y (2) ← 0 . ;
d o u b l e y (1) ← 0 . ;
(1)
double y (2) ← 0 . ;
double z;
double z (2) ← 0 . ;
double z (1) ← 0 . ;
(1)
d o u b l e z (2) ← 0 . ;
d o u b l e v 00 ;
d o u b l e v 0(2)0 ← 0 . ;
(1)0
d o u b l e v0 ;
(1)0
d o u b l e v (2)0 ← 0 . ;
d o u b l e v 01 ;
d o u b l e v 0(2)1 ← 0 . ;
(1)0
d o u b l e v1 ;
(1)0
d o u b l e v (2)1 ← 0 . ;
d o u b l e v 02 ;
d o u b l e v 0(2)2 ← 0 . ;
(1)0
d o u b l e v2 ;
(1)0
d o u b l e v (2)2 ← 0 . ;
d o u b l e v 03 ;
d o u b l e v 0(2)3 ← 0 . ;
(1)0
d o u b l e v3 ;
(1)0
d o u b l e v (2)3 ← 0 . ;
364 B Test Suite
double v 10 ;
double v 1(2)0 ;
double v 11 ;
double v 1(2)1 ;
double v 12 ;
double v 1(2)2 ;
double v 13 ;
double v 1(2)3 ;
double v 14 ;
double v 1(2)4 ;
double v 15 ;
double v 1(2)5 ;
double v 16 ;
double v 1(2)6 ;
STACK(2)c . p u s h ( 7 8 ) ;
STACK(2)i . p u s h ( t i d ) ;
t i d ←omp_get_thread_num ( ) ;
STACK(2)i . p u s h ( p ) ;
p←omp_get_num_threads ( ) ;
STACK(2)i . p u s h ( c ) ;
c←n / p ;
STACK(2)i . p u s h ( l b ) ;
l b ← t i d ∗c ;
STACK(2)i . p u s h ( ub ) ;
ub← ( t i d +1)∗c −1;
STACK(2)i . p u s h ( i ) ;
i←lb ;
(1)0
STACK(2) f . p u s h ( v 0 );
(1)0
v0 ← 0 . ;
STACK(2) f . p u s h ( v 00 ) ;
v 00 ← 0 . ;
(1)
STACK(2) f . p u s h ( t h r e a d _ r e s u l t tid ) ;
(1) (1)0
t h r e a d _ r e s u l t tid ←v 0 ;
STACK(2) f . p u s h ( t h r e a d _ r e s u l t tid ) ;
t h r e a d _ r e s u l t tid ←v 00 ;
w h i l e ( i ≤ ub ) {
STACK(2)c . p u s h ( 1 2 7 ) ;
(1)0
STACK(2) f . p u s h ( v 0 );
(1)0
v0 ← 0 . ;
STACK(2) f . p u s h ( v 00 ) ;
v 00 ← 0 . ;
STACK(2) f . p u s h ( y (1) ) ;
(1)0
y (1) ←v 0 ;
STACK(2) f . p u s h ( y ) ;
y←v 00 ;
(1)0
STACK(2) f . p u s h ( v 0 );
(1)0
v0 ← 0 . ;
STACK(2) f . p u s h ( v 00 ) ;
B.1 Test Case Plain Parallel Region 365
v 00 ← 0 . ;
STACK(2) f . p u s h ( z (1) ) ;
(1)0
z (1) ←v 0 ;
STACK(2) f . p u s h ( z ) ;
z←v 00 ;
STACK(2)i . p u s h ( j ) ;
j←0;
w h i l e ( j <n ) {
STACK(2)c . p u s h ( 1 6 6 ) ;
(1)0
STACK(2) f . p u s h ( v 0 );
(1)0 (1)
v 0 ←Ai∗n+ j ;
STACK(2) f . p u s h ( v 00 ) ;
v 00 ←Ai∗n+ j ;
(1)0
STACK(2) f . p u s h ( v 1 );
(1)0 (1)
v 1 ←Ai∗n+ j ;
STACK(2) f . p u s h ( v 01 ) ;
v 01 ←Ai∗n+ j ;
(1)0
STACK(2) f . p u s h ( v 2 );
(1)0 (1)0 (1)0
v 2 ←v 0 ∗ v 01+v 00 ∗ v 1 ;
STACK(2) f . p u s h ( v 02 ) ;
v 02 ←v 00 ∗ v 01 ;
(1)0
STACK(2) f . p u s h ( v 3 ) ;
(1)0 (1)0 0
v 3 ←v 2 ∗ c o s ( v 2 ) ;
STACK(2) f . p u s h ( v 03 ) ;
v 03 ← s i n ( v 02 ) ;
STACK(2) f . p u s h ( y (1) ) ;
(1)0
y (1)+←v 3 ;
STACK(2) f . p u s h ( y ) ;
y+←v 03 ;
(1)0
STACK(2) f . p u s h ( v 0 ) ;
(1)0 (1)
v 0 ←Ai∗n+ j ;
STACK(2) f . p u s h ( v 00 ) ;
v 00 ←Ai∗n+ j ;
(1)0
STACK(2) f . p u s h ( v 1 ) ;
(1)0 (1)
v 1 ←Ai∗n+ j ;
STACK(2) f . p u s h ( v 01 ) ;
v 01 ←Ai∗n+ j ;
(1)0
STACK(2) f . p u s h ( v 2 ) ;
(1)0 (1)0 0 0 (1)0
v 2 ←v 0 ∗ v 1+v 0 ∗ v 1 ;
STACK(2) f . p u s h ( v 02 ) ;
v 02 ←v 00 ∗ v 01 ;
(1)0
STACK(2) f . p u s h ( v 3 ) ;
(1)0 (1)0
v 3 ←v 2 ∗(0. − s i n ( v 02 ) );
STACK(2) f . p u s h ( v 03 ) ;
v 03 ← c o s ( v 02 ) ;
STACK(2) f . p u s h ( z (1) ) ;
(1)0
z (1)+←v 3 ;
STACK(2) f . p u s h ( z ) ;
366 B Test Suite
z+←v 03 ;
STACK(2)i . p u s h ( j ) ;
j ← j +1;
}
STACK(2)c . p u s h ( 3 1 6 ) ;
(1)0
STACK(2) f . p u s h ( v 0 );
(1)0
v 0 ←y (1);
STACK(2) f . p u s h ( v 00 ) ;
v 00 ←y ;
(1)0
STACK(2) f . p u s h ( v 1 );
(1)0
v 1 ←z (1);
STACK(2) f . p u s h ( v 01 ) ;
v 01 ←z ;
(1)0
STACK(2) f . p u s h ( v 2 );
(1)0 (1)0 (1)0
v 2 ←v 0 ∗ v 01+v 00 ∗ v 1 ;
STACK(2) f . p u s h ( v 02 ) ;
v 02 ←v 00 ∗ v 01 ;
(1)
STACK(2) f . p u s h ( t h r e a d _ r e s u l t tid ) ;
(1) (1)0
t h r e a d _ r e s u l t tid +←v 2 ;
STACK(2) f . p u s h ( t h r e a d _ r e s u l t tid );
t h r e a d _ r e s u l t tid+←v 02 ;
STACK(2)i . p u s h ( i ) ;
i ← i +1;
}
dummy ( " " , ( v o i d ∗ ) t h r e a d _ r e s u l t , ( v o i d ∗ ) t h r e a d _ r e s u l t (1) ) ;
w h i l e ( n o t STACK(2)c . empty ( ) ) {
i f ( STACK(2)c . t o p ( ) = 7 8 ) {
STACK(2)c . pop ( ) ;
t h r e a d _ r e s u l t tid ←STACK(2) f . t o p ( ) ;
STACK(2) f . pop ( ) ;
v 10 ←v 00 ;
v 1(2)0 ← t h r e a d _ r e s u l t (2)tid ;
t h r e a d _ r e s u l t (2)tid ← 0 . ;
v 0(2)0+←v 1(2)0 ;
(1)
t h r e a d _ r e s u l t tid ←STACK(2) f . t o p ( ) ;
STACK(2) f . pop ( ) ;
(1)0
v 10 ←v 0 ;
(1)
v 1(2)0 ← t h r e a d _ r e s u l t (2)tid ;
(1)
t h r e a d _ r e s u l t (2)tid ← 0 . ;
(1)0
v (2)0+←v 1(2)0 ;
v 00 ←STACK(2) f . t o p ( ) ;
STACK(2) f . pop ( ) ;
v 10 ← 0 . ;
v 1(2)0 ←v 0(2)0 ;
v 0(2)0 ← 0 . ;
(1)0
v 0 ←STACK(2) f . t o p ( ) ;
STACK(2) f . pop ( ) ;
v 10 ← 0 . ;
B.1 Test Case Plain Parallel Region 367
(1)0
v 1(2)0 ←v (2)0 ;
(1)0
v (2)0 ← 0 . ;
i ←STACK(2)i . t o p ( ) ;
STACK(2)i . pop ( ) ;
ub←STACK(2)i . t o p ( ) ;
STACK(2)i . pop ( ) ;
l b ←STACK(2)i . t o p ( ) ;
STACK(2)i . pop ( ) ;
c←STACK(2)i . t o p ( ) ;
STACK(2)i . pop ( ) ;
p←STACK(2)i . t o p ( ) ;
STACK(2)i . pop ( ) ;
t i d ←STACK(2)i . t o p ( ) ;
STACK(2)i . pop ( ) ;
}
i f ( STACK(2)c . t o p ( ) = 1 2 7 ) {
STACK(2)c . pop ( ) ;
j ←STACK(2)i . t o p ( ) ;
STACK(2)i . pop ( ) ;
z←STACK(2) f . t o p ( ) ;
STACK(2) f . pop ( ) ;
v 10 ←v 00 ;
v 1(2)0 ←z (2) ;
z (2) ← 0 . ;
v 0(2)0+←v 1(2)0 ;
z (1) ←STACK(2) f . t o p ( ) ;
STACK(2) f . pop ( ) ;
(1)0
v 10 ←v 0 ;
(1)
v 1(2)0 ←z (2) ;
(1)
z (2) ← 0 . ;
(1)0
v (2)0+←v 1(2)0 ;
v 00 ←STACK(2) f . top ( ) ;
STACK(2) f . pop ( ) ;
v 10 ← 0 . ;
v 1(2)0 ←v 0(2)0 ;
v 0(2)0 ← 0 . ;
(1)0
v 0 ←STACK(2) f . t o p ( ) ;
STACK(2) f . pop ( ) ;
v 10 ← 0 . ;
(1)0
v 1(2)0 ←v (2)0 ;
(1)0
v (2)0 ← 0 . ;
y←STACK(2) f . t o p ( ) ;
STACK(2) f . pop ( ) ;
v 10 ←v 00 ;
v 1(2)0 ←y (2) ;
y (2) ← 0 . ;
v 0(2)0+←v 1(2)0 ;
y (1) ←STACK(2) f . t o p ( ) ;
368 B Test Suite
STACK(2) f . pop ( ) ;
(1)0
v 10 ←v 0 ;
(1)
v 1(2)0 ←y (2) ;
(1)
y (2) ← 0 . ;
(1)0
v (2)0+←v 1(2)0 ;
v 00 ←STACK(2) f . top ( ) ;
STACK(2) f . pop ( ) ;
v 10 ← 0 . ;
v 1(2)0 ←v 0(2)0 ;
v 0(2)0 ← 0 . ;
(1)0
v 0 ←STACK(2) f . t o p ( ) ;
STACK(2) f . pop ( ) ;
v 10 ← 0 . ;
(1)0
v 1(2)0 ←v (2)0 ;
(1)0
v (2)0 ← 0 . ;
}
i f ( STACK(2)c . t o p ( ) = 1 6 6 ) {
STACK(2)c . pop ( ) ;
j ←STACK(2)i . t o p ( ) ;
STACK(2)i . pop ( ) ;
z←STACK(2) f . t o p ( ) ;
STACK(2) f . pop ( ) ;
v 10 ←v 03 ;
v 1(2)0 ←z (2) ;
v 0(2)3+←v 1(2)0 ;
z (1) ←STACK(2) f . t o p ( ) ;
STACK(2) f . pop ( ) ;
(1)0
v 10 ←v 3 ;
(1)
v 1(2)0 ←z (2) ;
(1)0
v (2)3+←v 1(2)0 ;
v 03 ←STACK(2) f . top ( ) ;
STACK(2) f . pop ( ) ;
v 10 ←v 02 ;
v 11 ← c o s ( v 10 ) ;
v 1(2)1 ←v 0(2)3 ;
v 0(2)3 ← 0 . ;
v 1(2)0 ←v 1(2)1 ∗(0. − s i n ( v 10 ) ) ;
v 0(2)2+←v 1(2)0 ;
(1)0
v 3 ←STACK(2) f . t o p ( ) ;
STACK(2) f . pop ( ) ;
(1)0
v 10 ←v 2 ;
v 11 ← 0 . ;
v 12 ←v 02 ;
v 13 ← s i n ( v 12 ) ;
v 14 ←v 11−v 13 ;
v 15 ←v 10 ∗ v 14 ;
(1)0
v 1(2)5 ←v (2)3 ;
B.1 Test Case Plain Parallel Region 369
(1)0
v (2)3 ← 0 . ;
v 1(2)0 ←v 1(2)5 ∗ v 14 ;
v 1(2)4 ←v 1(2)5 ∗ v 10 ;
(1)0
v (2)2+←v 1(2)0 ;
v 1(2)1 ←v 1(2)4 ;
v 1(2)3 ←0. − v 1(2)4 ;
v 1(2)2 ←v 1(2)3 ∗ c o s ( v 12 ) ;
v 0(2)2+←v 1(2)2 ;
v 02 ←STACK(2) f . t o p ( ) ;
STACK(2) f . pop ( ) ;
v 10 ←v 00 ;
v 11 ←v 01 ;
v 12 ←v 10 ∗ v 11 ;
v 1(2)2 ←v 0(2)2 ;
v 0(2)2 ← 0 . ;
v 1(2)0 ←v 1(2)2 ∗ v 11 ;
v 1(2)1 ←v 1(2)2 ∗ v 10 ;
v 0(2)0+←v 1(2)0 ;
v 0(2)1+←v 1(2)1 ;
(1)0
v 2 ←STACK(2) f . t o p ( ) ;
STACK(2) f . pop ( ) ;
(1)0
v 10 ←v 0 ;
v 11 ←v 01 ;
v 12 ←v 10 ∗ v 11 ;
v 13 ←v 00 ;
(1)0
v 14 ←v 1 ;
v 15 ←v 13 ∗ v 14 ;
v 16 ←v 12+v 15 ;
(1)0
v 1(2)6 ←v (2)2 ;
(1)0
v (2)2 ← 0 . ;
v 1(2)2 ←v 1(2)6 ;
v 1(2)5 ←v 1(2)6 ;
v 1(2)0 ←v 1(2)2 ∗ v 11 ;
v 1(2)1 ←v 1(2)2 ∗ v 10 ;
(1)0
v (2)0+←v 1(2)0 ;
v 0(2)1+←v 1(2)1 ;
v 1(2)3 ←v 1(2)5 ∗ v 14 ;
v 1(2)4 ←v 1(2)5 ∗ v 13 ;
v 0(2)0+←v 1(2)3 ;
(1)0
v (2)1+←v 1(2)4 ;
v 01 ←STACK(2) f . top ( ) ;
STACK(2) f . pop ( ) ;
v 10 ←Ai∗n+ j ;
v 1(2)0 ←v 0(2)1 ;
v 0(2)1 ← 0 . ;
#pragma omp a t o m i c
370 B Test Suite
(1)0
v 1(2)3 ←v (2)3 ;
(1)0
v (2)3 ← 0 . ;
v 1(2)0 ←v 1(2)3 ∗ v 12 ;
v 1(2)2 ←v 1(2)3 ∗ v 10 ;
(1)0
v (2)2+←v 1(2)0 ;
v 1(2)1 ←v 1(2)2 ∗(0. − s i n ( v 11 ) );
v 0(2)2+←v 1(2)1 ;
v 02 ←STACK(2) f . t o p ( ) ;
STACK(2) f . pop ( ) ;
v 10 ←v 00 ;
v 11 ←v 01 ;
v 12 ←v 10 ∗ v 11 ;
v 1(2)2 ←v 0(2)2 ;
v 0(2)2 ← 0 . ;
v 1(2)0 ←v 1(2)2 ∗ v 11 ;
v 1(2)1 ←v 1(2)2 ∗ v 10 ;
v 0(2)0+←v 1(2)0 ;
v 0(2)1+←v 1(2)1 ;
(1)0
v 2 ←STACK(2) f . t o p ( ) ;
STACK(2) f . pop ( ) ;
(1)0
v 10 ←v 0 ;
v 11 ←v 01 ;
v 12 ←v 10 ∗ v 11 ;
v 13 ←v 00 ;
(1)0
v 14 ←v 1 ;
v 15 ←v 13 ∗ v 14 ;
v 16 ←v 12+v 15 ;
(1)0
v 1(2)6 ←v (2)2 ;
(1)0
v (2)2 ← 0 . ;
v 1(2)2 ←v 1(2)6 ;
v 1(2)5 ←v 1(2)6 ;
v 1(2)0 ←v 1(2)2 ∗ v 11 ;
v 1(2)1 ←v 1(2)2 ∗ v 10 ;
(1)0
v (2)0+←v 1(2)0 ;
v 0(2)1+←v 1(2)1 ;
v 1(2)3 ←v 1(2)5 ∗ v 14 ;
v 1(2)4 ←v 1(2)5 ∗ v 13 ;
v 0(2)0+←v 1(2)3 ;
(1)0
v (2)1+←v 1(2)4 ;
v 01 ←STACK(2) f . top ( ) ;
STACK(2) f . pop ( ) ;
v 10 ←Ai∗n+ j ;
v 1(2)0 ←v 0(2)1 ;
v 0(2)1 ← 0 . ;
#pragma omp a t o m i c
A(2)i∗n+ j+←v 1(2)0 ;
372 B Test Suite
(1)0
v 1 ←STACK(2) f . t o p ( ) ;
STACK(2) f . pop ( ) ;
(1)
v 10 ←Ai∗n+ j ;
(1)0
v 1(2)0 ←v (2)1 ;
(1)0
v (2)1 ← 0 . ;
#pragma omp a t o m i c
(1)
A(2)i∗n+ j+←v 1(2)0 ;
v 00 ←STACK(2) f . top ( ) ;
STACK(2) f . pop ( ) ;
v 10 ←Ai∗n+ j ;
v 1(2)0 ←v 0(2)0 ;
v 0(2)0 ← 0 . ;
#pragma omp a t o m i c
A(2)i∗n+ j+←v 1(2)0 ;
(1)0
v 0 ←STACK(2) f . t o p ( ) ;
STACK(2) f . pop ( ) ;
(1)
v 10 ←Ai∗n+ j ;
(1)0
v 1(2)0 ←v (2)0 ;
(1)0
v (2)0 ← 0 . ;
#pragma omp a t o m i c
(1)
A(2)i∗n+ j+←v 1(2)0 ;
}
i f ( STACK(2)c . t o p ( ) = 3 1 6 ) {
STACK(2)c . pop ( ) ;
i ←STACK(2)i . t o p ( ) ;
STACK(2)i . pop ( ) ;
t h r e a d _ r e s u l t tid ←STACK(2) f . t o p ( ) ;
STACK(2) f . pop ( ) ;
v 10 ←v 02 ;
v 1(2)0 ← t h r e a d _ r e s u l t (2)tid ;
v 0(2)2+←v 1(2)0 ;
(1)
t h r e a d _ r e s u l t tid ←STACK(2) f . t o p ( ) ;
STACK(2) f . pop ( ) ;
(1)0
v 10 ←v 2 ;
(1)
v 1(2)0 ← t h r e a d _ r e s u l t (2)tid ;
(1)0 1
v (2)2+←v (2)0 ;
v 02 ←STACK(2) f . t o p ( ) ;
STACK(2) f . pop ( ) ;
v 10 ←v 00 ;
v 11 ←v 01 ;
v 12 ←v 10 ∗ v 11 ;
v 1(2)2 ←v 0(2)2 ;
v 0(2)2 ← 0 . ;
v 1(2)0 ←v 1(2)2 ∗ v 11 ;
v 1(2)1 ←v 1(2)2 ∗ v 10 ;
v 0(2)0+←v 1(2)0 ;
B.1 Test Case Plain Parallel Region 373
v 0(2)1+←v 1(2)1 ;
(1)0
v 2 ←STACK(2) f . t o p ( ) ;
STACK(2) f . pop ( ) ;
(1)0
v 10 ←v 0 ;
v 11 ←v 01 ;
v 12 ←v 10 ∗ v 11 ;
v 13 ←v 00 ;
(1)0
v 14 ←v 1 ;
v 15 ←v 13 ∗ v 14 ;
v 16 ←v 12+v 15 ;
(1)0
v 1(2)6 ←v (2)2 ;
(1)0
v (2)2 ← 0 . ;
v 1(2)2 ←v 1(2)6 ;
v 1(2)5 ←v 1(2)6 ;
v 1(2)0 ←v 1(2)2 ∗ v 11 ;
v 1(2)1 ←v 1(2)2 ∗ v 10 ;
(1)0
v (2)0+←v 1(2)0 ;
v 0(2)1+←v 1(2)1 ;
v 1(2)3 ←v 1(2)5 ∗ v 14 ;
v 1(2)4 ←v 1(2)5 ∗ v 13 ;
v 0(2)0+←v 1(2)3 ;
(1)0
v (2)1+←v 1(2)4 ;
v 01 ←STACK(2) f . top ( ) ;
STACK(2) f . pop ( ) ;
v 10 ←z ;
v 1(2)0 ←v 0(2)1 ;
v 0(2)1 ← 0 . ;
z (2)+←v 1(2)0 ;
(1)0
v 1 ←STACK(2) f . t o p ( ) ;
STACK(2) f . pop ( ) ;
v 10 ←z (1) ;
(1)0
v 1(2)0 ←v (2)1 ;
(1)0
v (2)1 ← 0 . ;
(1)
z (2)+←v 1(2)0 ;
v 00 ←STACK(2) f . top ( ) ;
STACK(2) f . pop ( ) ;
v 10 ←y ;
v 1(2)0 ←v 0(2)0 ;
v 0(2)0 ← 0 . ;
y (2)+←v 1(2)0 ;
(1)0
v 0 ←STACK(2) f . t o p ( ) ;
STACK(2) f . pop ( ) ;
v 10 ←y (1) ;
(1)0
v 1(2)0 ←v (2)0 ;
(1)0
v (2)0 ← 0 . ;
374 B Test Suite
(1)
y (2)+←v 1(2)0 ;
}
dummy ( " " , ( v o i d ∗ ) t h r e a d _ r e s u l t , ( v o i d ∗ ) t h r e a d _ r e s u l t (1) , ( v o i d ∗ )
(1)
t h r e a d _ r e s u l t (2) , ( v o i d ∗ ) t h r e a d _ r e s u l t (2) ) ;
}
}
Listing B.6: This code represents the second-order adjoint code of the ‘plain-parallel’
test case (Listing B.1) in reverse-over-forward mode. This code is obtained by applying
first the forward mode and then the reverse mode, more precisely one applies the source
transformation σ (τ (P)).
t i d ←omp_get_thread_num ( ) ;
p←omp_get_num_threads ( ) ;
c←n / p ;
l b ← t i d ∗c ;
ub← ( t i d +1)∗c −1;
i←lb ;
z← 0 . ;
w h i l e ( i ≤ ub ) {
j←0;
y← 0 . ;
w h i l e ( j <n ) {
y+←Ai∗n+ j ∗ x j ;
j ← j +1;
}
z+←y ;
i ← i +1;
}
t h r e a d _ r e s u l t tid ←z ;
#pragma omp b a r r i e r
i f ( t i d = 0) {
i←0;
y← 0 . ;
B.2 Test Case with a Barrier 375
w h i l e ( i <p ) {
y+← ( t h r e a d _ r e s u l t i ) ;
i ← i +1;
}
t h r e a d _ r e s u l t 0 ←y ;
dummy ( " " , ( v o i d ∗ ) t h r e a d _ r e s u l t ) ;
}
}
Listing B.7: This code is the original code for the test case ‘barrier’.
t i d ←omp_get_thread_num ( ) ;
p←omp_get_num_threads ( ) ;
c←n / p ;
l b ← t i d ∗c ;
ub← ( t i d +1)∗c −1;
i←lb ;
(1)0
v0 ← 0 . ;
v 00 ← 0 . ;
(1)0
z (1) ←v 0 ;
z←v 00 ;
w h i l e ( i ≤ ub ) {
j←0;
(1)0
v0 ← 0 . ;
v 00 ← 0 . ;
(1)0
y (1) ←v 0 ;
y←v 00 ;
w h i l e ( j <n ) {
(1)0 (1)
v 0 ←Ai∗n+ j ;
v 00 ←Ai∗n+ j ;
(1)0 (1)
v 1 ←x j ;
376 B Test Suite
v 01 ←x j ;
(1)0 (1)0 (1)0
v 2 ←v 0 ∗ v 01+v 00 ∗ v 1 ;
v 02 ←v 00 ∗ v 01 ;
(1)0
y (1)+←v 2 ;
y+←v 02 ;
j ← j +1;
}
(1)0
v 0 ←y (1) ;
v 00 ←y ;
(1)0
z (1)+←v 0 ;
z+←v 00 ;
i ← i +1;
}
(1)0
v 0 ←z (1) ;
v 00 ←z ;
(1) (1)0
t h r e a d _ r e s u l t tid ←v 0 ;
t h r e a d _ r e s u l t tid ←v 00 ;
#pragma omp b a r r i e r
i f ( t i d = 0) {
i←0;
(1)0
v0 ← 0 . ;
v 00 ← 0 . ;
(1)0
y (1) ←v 0 ;
y←v 00 ;
w h i l e ( i <p ) {
(1)0 (1)
v0 ← t h r e a d _ r e s u l t i ;
v 00 ← t h r e a d _ r e s u l t i ;
(1)0
y (1)+←v 0 ;
y+←v 00 ;
i ← i +1;
}
(1)0
v 0 ←y (1) ;
v 00 ←y ;
(1) (1)0
t h r e a d _ r e s u l t 0 ←v 0 ;
t h r e a d _ r e s u l t 0 ←v 00 ;
dummy ( " " , ( v o i d ∗ ) t h r e a d _ r e s u l t , ( v o i d ∗ ) t h r e a d _ r e s u l t (1) ) ;
}
}
Listing B.8: This listing is the first-order tangent-linear code of the test case ‘barrier’
(Listing B.7).
unsigned i n t p ;
unsigned i n t c ;
unsigned i n t i ;
unsigned i n t j ;
unsigned i n t lb ;
u n s i g n e d i n t ub ;
d o u b l e v 00 ;
d o u b l e v 0(1)0 ;
d o u b l e v 01 ;
d o u b l e v 0(1)1 ;
d o u b l e v 02 ;
d o u b l e v 0(1)2 ;
STACK(1)c . p u s h ( 3 4 ) ;
STACK(1)i . p u s h ( t i d ) ;
t i d ←omp_get_thread_num ( ) ;
STACK(1)i . p u s h ( p ) ;
p←omp_get_num_threads ( ) ;
STACK(1)i . p u s h ( c ) ;
c←n / p ;
STACK(1)i . p u s h ( l b ) ;
l b ← t i d ∗c ;
STACK(1)i . p u s h ( ub ) ;
ub← ( t i d +1)∗c −1;
STACK(1)i . p u s h ( i ) ;
i←lb ;
STACK(1) f . p u s h ( z ) ;
z← 0 . ;
w h i l e ( i ≤ ub ) {
STACK(1)c . p u s h ( 7 0 ) ;
STACK(1)i . p u s h ( j ) ;
j←0;
STACK(1) f . p u s h ( y ) ;
y← 0 . ;
w h i l e ( j <n ) {
STACK(1)c . p u s h ( 9 2 ) ;
STACK(1) f . p u s h ( y ) ;
y+←Ai∗n+ j ∗ x j ;
STACK(1)i . p u s h ( j ) ;
j ← j +1;
}
STACK(1)c . p u s h ( 1 0 2 ) ;
STACK(1) f . p u s h ( z ) ;
z+←y ;
STACK(1)i . p u s h ( i ) ;
i ← i +1;
}
STACK(1)c . p u s h ( 1 1 3 ) ;
STACK(1) f . p u s h ( t h r e a d _ r e s u l t tid ) ;
t h r e a d _ r e s u l t tid ←z ;
#pragma omp b a r r i e r
i f ( t i d = 0) {
STACK(1)c . p u s h ( 1 2 0 ) ;
STACK(1)i . p u s h ( i ) ;
378 B Test Suite
i←0;
STACK(1) f . p u s h ( y ) ;
y← 0 . ;
w h i l e ( i <p ) {
STACK(1)c . p u s h ( 1 3 3 ) ;
STACK(1) f . p u s h ( y ) ;
y+← ( t h r e a d _ r e s u l t i ) ;
STACK(1)i . p u s h ( i ) ;
i ← i +1;
}
STACK(1)c . p u s h ( 1 4 5 ) ;
STACK(1) f . p u s h ( t h r e a d _ r e s u l t 0 ) ;
t h r e a d _ r e s u l t 0 ←y ;
dummy ( " " , ( v o i d ∗ ) t h r e a d _ r e s u l t ) ;
}
w h i l e ( n o t STACK(1)c . empty ( ) ) {
i f ( STACK(1)c . t o p ( ) = 3 4 ) {
STACK(1)c . pop ( ) ;
z←STACK(1) f . t o p ( ) ;
STACK(1) f . pop ( ) ;
v 00 ← 0 . ;
v 0(1)0 ←z (1) ;
z (1) ← 0 . ;
i ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
ub←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
l b ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
c←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
p←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
t i d ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
}
i f ( STACK(1)c . t o p ( ) = 7 0 ) {
STACK(1)c . pop ( ) ;
y←STACK(1) f . t o p ( ) ;
STACK(1) f . pop ( ) ;
v 00 ← 0 . ;
v 0(1)0 ←y (1) ;
y (1) ← 0 . ;
j ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
}
i f ( STACK(1)c . t o p ( ) = 9 2 ) {
STACK(1)c . pop ( ) ;
j ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
y←STACK(1) f . t o p ( ) ;
STACK(1) f . pop ( ) ;
B.2 Test Case with a Barrier 379
v 00 ←Ai∗n+ j ;
v 01 ←x j ;
v 02 ←v 00 ∗ v 01 ;
v 0(1)2 ←y (1) ;
v 0(1)0 ←v 0(1)2 ∗ v 01 ;
v 0(1)1 ←v 0(1)2 ∗ v 00 ;
A(1)i∗n+ j+←v 0(1)0 ;
#pragma omp a t o m i c
x (1) j+←v 0(1)1 ;
}
i f ( STACK(1)c . t o p ( ) = 1 0 2 ) {
STACK(1)c . pop ( ) ;
i ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
z←STACK(1) f . t o p ( ) ;
STACK(1) f . pop ( ) ;
v 00 ←y ;
v 0(1)0 ←z (1) ;
y (1)+←v 0(1)0 ;
}
i f ( STACK(1)c . t o p ( ) = 1 1 3 ) {
STACK(1)c . pop ( ) ;
#pragma omp b a r r i e r
t h r e a d _ r e s u l t tid ←STACK(1) f . t o p ( ) ;
STACK(1) f . pop ( ) ;
v 00 ←z ;
v 0(1)0 ← t h r e a d _ r e s u l t (1)tid ;
t h r e a d _ r e s u l t (1)tid ← 0 . ;
z (1)+←v 0(1)0 ;
}
i f ( STACK(1)c . t o p ( ) = 1 2 0 ) {
STACK(1)c . pop ( ) ;
y←STACK(1) f . t o p ( ) ;
STACK(1) f . pop ( ) ;
v 00 ← 0 . ;
v 0(1)0 ←y (1) ;
y (1) ← 0 . ;
i ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
}
i f ( STACK(1)c . t o p ( ) = 1 3 3 ) {
STACK(1)c . pop ( ) ;
i ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
y←STACK(1) f . t o p ( ) ;
STACK(1) f . pop ( ) ;
v 00 ← t h r e a d _ r e s u l t i ;
v 0(1)0 ←y (1) ;
#pragma omp a t o m i c
t h r e a d _ r e s u l t (1)i+←v 0(1)0 ;
380 B Test Suite
}
i f ( STACK(1)c . t o p ( ) = 1 4 5 ) {
STACK(1)c . pop ( ) ;
t h r e a d _ r e s u l t 0 ←STACK(1) f . t o p ( ) ;
STACK(1) f . pop ( ) ;
v 00 ←y ;
v 0(1)0 ← t h r e a d _ r e s u l t (1)0 ;
t h r e a d _ r e s u l t (1)0 ← 0 . ;
y (1)+←v 0(1)0 ;
}
dummy ( " " , ( v o i d ∗ ) t h r e a d _ r e s u l t , ( v o i d ∗ ) t h r e a d _ r e s u l t (1) ) ;
}
}
Listing B.9: This code represents the adjoint code of the test case ‘barrier’ (Listing B.7).
The exclusive read analysis recognizes that the atomic construct in line 120 can be
neglected but the one in line 122 is necessary. The atomic construct in line 163 is
recognized as exclusively since the analysis does not recognize that this code is only
performed by the master thread.
t i d ←omp_get_thread_num ( ) ;
p←omp_get_num_threads ( ) ;
c←n / p ;
l b ← t i d ∗c ;
ub← ( t i d +1)∗c −1;
i←lb ;
y← 1 . ;
w h i l e ( i ≤ ub ) {
j←0;
w h i l e ( j <n ) {
y←y ∗ s i n (Ai∗n+ j ∗ x j ) ∗ c o s (Ai∗n+ j ∗ x j ) ;
j ← j +1;
}
i ← i +1;
B.3 Test Case with a master Construct 381
}
t h r e a d _ r e s u l t tid ←y ;
#pragma omp b a r r i e r
#pragma omp m a s t e r
{
i←1;
w h i l e ( i <p ) {
t h r e a d _ r e s u l t 0← t h r e a d _ r e s u l t 0 ∗ t h r e a d _ r e s u l t i ;
i ← i +1;
}
dummy ( " " , ( v o i d ∗ ) t h r e a d _ r e s u l t ) ;
}
}
Listing B.10: This code is the original code for the test case ‘master’.
t i d ←omp_get_thread_num ( ) ;
p←omp_get_num_threads ( ) ;
c←n / p ;
l b ← t i d ∗c ;
ub← ( t i d +1)∗c −1;
i←lb ;
(1)0
v0 ← 0 . ;
v 00 ← 1 . ;
(1)0
y (1) ←v 0 ;
y←v 00 ;
w h i l e ( i ≤ ub ) {
j←0;
w h i l e ( j <n ) {
(1)0
v 0 ←y (1) ;
v 00 ←y ;
(1)0 (1)
v 1 ←Ai∗n+ j ;
v 01 ←Ai∗n+ j ;
(1)0 (1)
v 2 ←x j ;
v 02 ←x j ;
(1)0 (1)0 (1)0
v 3 ←v 1 ∗ v 02+v 01 ∗ v 2 ;
v 03 ←v 01 ∗ v 02 ;
(1)0 (1)0
v 4 ←v 3 ∗ c o s ( v 03 ) ;
v 04 ← s i n ( v 03 ) ;
(1)0 (1)0 (1)0
v 5 ←v 0 ∗ v 04+v 00 ∗ v 4 ;
v 05 ←v 00 ∗ v 04 ;
(1)0 (1)
v 6 ←Ai∗n+ j ;
v 06 ←Ai∗n+ j ;
(1)0 (1)
v 7 ←x j ;
v 07 ←x j ;
(1)0 (1)0 (1)0
v 8 ←v 6 ∗ v 07+v 06 ∗ v 7 ;
v 08 ←v 06 ∗ v 07 ;
(1)0 (1)0
v 9 ←v 8 ∗(0. − s i n ( v 08 ) ) ;
v 09 ← c o s ( v 08 ) ;
(1)0 (1)0
v (1)010 ←v 5 ∗ v 09+v 05 ∗ v 9 ;
v 010 ←v 05 ∗ v 09 ;
y (1) ←v (1)010 ;
y←v 010 ;
j ← j +1;
}
i ← i +1;
}
(1)0
v 0 ←y (1) ;
v 00 ←y ;
(1) (1)0
t h r e a d _ r e s u l t tid ←v 0 ;
t h r e a d _ r e s u l t tid ←v 00 ;
#pragma omp b a r r i e r
#pragma omp m a s t e r
{
i←1;
B.3 Test Case with a master Construct 383
w h i l e ( i <p ) {
(1)0 (1)
v0 ← t h r e a d _ r e s u l t 0 ;
v 00 ← t h r e a d _ r e s u l t 0 ;
(1)0 (1)
v1 ← t h r e a d _ r e s u l t i ;
v 01 ← t h r e a d _ r e s u l t i ;
(1)0 (1)0 (1)0
v 2 ←v 0 ∗ v 01+v 00 ∗ v 1 ;
v 02 ←v 00 ∗ v 01 ;
(1) (1)0
t h r e a d _ r e s u l t 0 ←v 2 ;
t h r e a d _ r e s u l t 0 ←v 02 ;
i ← i +1;
}
dummy ( " " , ( v o i d ∗ ) t h r e a d _ r e s u l t , ( v o i d ∗ ) t h r e a d _ r e s u l t (1) ) ;
}
}
Listing B.11: This listing shows the first-order tangent-linear code for the test case
‘master’ (Listing B.10).
d o u b l e v 0(1)9 ;
d o u b l e v 010 ;
d o u b l e v (1) 010 ;
STACK(1)c . p u s h ( 5 4 ) ;
STACK(1)i . p u s h ( t i d ) ;
t i d ←omp_get_thread_num ( ) ;
STACK(1)i . p u s h ( p ) ;
p←omp_get_num_threads ( ) ;
STACK(1)i . p u s h ( c ) ;
c←n / p ;
STACK(1)i . p u s h ( l b ) ;
l b ← t i d ∗c ;
STACK(1)i . p u s h ( ub ) ;
ub← ( t i d +1)∗c −1;
STACK(1)i . p u s h ( i ) ;
i←lb ;
STACK(1) f . p u s h ( y ) ;
y← 1 . ;
w h i l e ( i ≤ ub ) {
STACK(1)c . p u s h ( 9 0 ) ;
STACK(1)i . p u s h ( j ) ;
j←0;
w h i l e ( j <n ) {
STACK(1)c . p u s h ( 1 2 7 ) ;
STACK(1) f . p u s h ( y ) ;
y←y ∗ s i n (Ai∗n+ j ∗ x j ) ∗ c o s (Ai∗n+ j ∗ x j ) ;
STACK(1)i . p u s h ( j ) ;
j ← j +1;
}
STACK(1)c . p u s h ( 1 3 9 ) ;
STACK(1)i . p u s h ( i ) ;
i ← i +1;
}
STACK(1)c . p u s h ( 1 4 5 ) ;
STACK(1) f . p u s h ( t h r e a d _ r e s u l t tid ) ;
t h r e a d _ r e s u l t tid ←y ;
#pragma omp b a r r i e r
#pragma omp m a s t e r
{
STACK(1)c . p u s h ( 1 4 9 ) ;
STACK(1)i . p u s h ( i ) ;
i←1;
w h i l e ( i <p ) {
STACK(1)c . p u s h ( 1 6 4 ) ;
STACK(1) f . p u s h ( t h r e a d _ r e s u l t 0 ) ;
t h r e a d _ r e s u l t 0← t h r e a d _ r e s u l t 0 ∗ t h r e a d _ r e s u l t i ;
STACK(1)i . p u s h ( i ) ;
i ← i +1;
}
dummy ( " " , ( v o i d ∗ ) t h r e a d _ r e s u l t ) ;
}
w h i l e ( n o t STACK(1)c . empty ( ) ) {
i f ( STACK(1)c . t o p ( ) = 5 4 ) {
B.3 Test Case with a master Construct 385
STACK(1)c . pop ( ) ;
y←STACK(1) f . t o p ( ) ;
STACK(1) f . pop ( ) ;
v 00 ← 1 . ;
v 0(1)0 ←y (1) ;
y (1) ← 0 . ;
i ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
ub←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
l b ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
c←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
p←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
t i d ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
}
i f ( STACK(1)c . t o p ( ) = 9 0 ) {
STACK(1)c . pop ( ) ;
j ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
}
i f ( STACK(1)c . t o p ( ) = 1 2 7 ) {
STACK(1)c . pop ( ) ;
j ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
y←STACK(1) f . t o p ( ) ;
STACK(1) f . pop ( ) ;
v 00 ←y ;
v 01 ←Ai∗n+ j ;
v 02 ←x j ;
v 03 ←v 01 ∗ v 02 ;
v 04 ← s i n ( v 03 ) ;
v 05 ←v 00 ∗ v 04 ;
v 06 ←Ai∗n+ j ;
v 07 ←x j ;
v 08 ←v 06 ∗ v 07 ;
v 09 ← c o s ( v 08 ) ;
v 010 ←v 05 ∗ v 09 ;
v (1) 010 ←y (1) ;
y (1) ← 0 . ;
v 0(1)5 ←v (1) 010 ∗ v 09 ;
v 0(1)9 ←v (1) 010 ∗ v 05 ;
v 0(1)0 ←v 0(1)5 ∗ v 04 ;
v 0(1)4 ←v 0(1)5 ∗ v 00 ;
y (1)+←v 0(1)0 ;
v 0(1)3 ←v 0(1)4 ∗ c o s ( v 03 ) ;
v 0(1)1 ←v 0(1)3 ∗ v 02 ;
386 B Test Suite
v 0(1)2 ←v 0(1)3 ∗ v 01 ;
#pragma omp a t o m i c
A(1)i∗n+ j+←v 0(1)1 ;
#pragma omp a t o m i c
x (1) j+←v 0(1)2 ;
v 0(1)8 ←v 0(1)9 ∗(0. − s i n ( v 08 ) ) ;
v 0(1)6 ←v 0(1)8 ∗ v 07 ;
v 0(1)7 ←v 0(1)8 ∗ v 06 ;
#pragma omp a t o m i c
A(1)i∗n+ j+←v 0(1)6 ;
#pragma omp a t o m i c
x (1) j+←v 0(1)7 ;
}
i f ( STACK(1)c . t o p ( ) = 1 3 9 ) {
STACK(1)c . pop ( ) ;
i ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
}
i f ( STACK(1)c . t o p ( ) = 1 4 5 ) {
STACK(1)c . pop ( ) ;
#pragma omp b a r r i e r
t h r e a d _ r e s u l t tid ←STACK(1) f . t o p ( ) ;
STACK(1) f . pop ( ) ;
v 00 ←y ;
v 0(1)0 ← t h r e a d _ r e s u l t (1)tid ;
t h r e a d _ r e s u l t (1)tid ← 0 . ;
y (1)+←v 0(1)0 ;
}
i f ( STACK(1)c . t o p ( ) = 1 4 9 ) {
STACK(1)c . pop ( ) ;
i ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
}
i f ( STACK(1)c . t o p ( ) = 1 6 4 ) {
STACK(1)c . pop ( ) ;
i ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
t h r e a d _ r e s u l t 0 ←STACK(1) f . t o p ( ) ;
STACK(1) f . pop ( ) ;
v 00 ← t h r e a d _ r e s u l t 0 ;
v 01 ← t h r e a d _ r e s u l t i ;
v 02 ←v 00 ∗ v 01 ;
v 0(1)2 ← t h r e a d _ r e s u l t (1)0 ;
t h r e a d _ r e s u l t (1)0 ← 0 . ;
v 0(1)0 ←v 0(1)2 ∗ v 01 ;
v 0(1)1 ←v 0(1)2 ∗ v 00 ;
#pragma omp a t o m i c
t h r e a d _ r e s u l t (1)0+←v 0(1)0 ;
#pragma omp a t o m i c
t h r e a d _ r e s u l t (1)i+←v 0(1)1 ;
B.4 Test Case with a critical Region 387
}
dummy ( " " , ( v o i d ∗ ) t h r e a d _ r e s u l t , ( v o i d ∗ ) t h r e a d _ r e s u l t (1) ) ;
}
}
Listing B.12: This listing presents the first-order adjoint code for the test case ‘master’
(Listing B.10). The exclusive read analysis has not been used to achieve static
information about the original code. The atomic constructs in the lines 182 and 184
may be neglected because these assignments are only executed by the master thread.
This is a possible extension of the exclusive read analysis since it does not recognize this
fact at the moment.
t i d ←omp_get_thread_num ( ) ;
p←omp_get_num_threads ( ) ;
c←n / p ;
l b ← t i d ∗c ;
ub← ( t i d +1)∗c −1;
i←lb ;
y← 0 . ;
w h i l e ( i ≤ ub ) {
j←0;
w h i l e ( j <n ) {
y+←Ai∗n+ j ∗ x j ;
j ← j +1;
}
i ← i +1;
}
#pragma omp c r i t i c a l
{
t h r e a d _ r e s u l t 0← s i n ( t h r e a d _ r e s u l t 0 ) ∗ s i n ( y ) ;
}
}
Listing B.13: This code is the original code for the test case ‘critical’.
388 B Test Suite
t i d ←omp_get_thread_num ( ) ;
p←omp_get_num_threads ( ) ;
c←n / p ;
l b ← t i d ∗c ;
ub← ( t i d +1)∗c −1;
i←lb ;
(1)0
v0 ← 0 . ;
v 00 ← 0 . ;
(1)0
y (1) ←v 0 ;
y←v 00 ;
w h i l e ( i ≤ ub ) {
j←0;
w h i l e ( j <n ) {
(1)0 (1)
v 0 ←Ai∗n+ j ;
v 00 ←Ai∗n+ j ;
(1)0 (1)
v 1 ←x j ;
v 01 ←x j ;
(1)0 (1)0 (1)0
v 2 ←v 0 ∗ v 01+v 00 ∗ v 1 ;
v 02 ←v 00 ∗ v 01 ;
(1)0
y (1)+←v 2 ;
y+←v 02 ;
j ← j +1;
}
i ← i +1;
}
#pragma omp c r i t i c a l
{
(1)0 (1)
v0 ← t h r e a d _ r e s u l t 0 ;
B.4 Test Case with a critical Region 389
v 00 ← t h r e a d _ r e s u l t 0 ;
(1)0 (1)0
v 1 ←v 0 ∗ c o s ( v 00 ) ;
v 01 ← s i n ( v 00 ) ;
(1)0
v 2 ←y (1) ;
v 02 ←y ;
(1)0 (1)0
v 3 ←v 2 ∗ c o s ( v 02 ) ;
v 03 ← s i n ( v 02 ) ;
(1)0 (1)0 (1)0
v 4 ←v 1 ∗ v 03+v 01 ∗ v 3 ;
v 04 ←v 01 ∗ v 03 ;
(1) (1)0
t h r e a d _ r e s u l t 0 ←v 4 ;
t h r e a d _ r e s u l t 0 ←v 04 ;
}
}
Listing B.14: This listing presents the first-order tangent-linear code for the test case
‘critical’ (Listing B.13).
STACK(1)c . p u s h ( 4 6 ) ;
STACK(1)i . p u s h ( t i d ) ;
t i d ←omp_get_thread_num ( ) ;
STACK(1)i . p u s h ( p ) ;
p←omp_get_num_threads ( ) ;
STACK(1)i . p u s h ( c ) ;
c←n / p ;
STACK(1)i . p u s h ( l b ) ;
l b ← t i d ∗c ;
STACK(1)i . p u s h ( ub ) ;
390 B Test Suite
STACK(1)c . pop ( ) ;
j ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
}
i f ( STACK(1)c . t o p ( ) = 1 0 1 ) {
STACK(1)c . pop ( ) ;
j ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
y←STACK(1) f . t o p ( ) ;
STACK(1) f . pop ( ) ;
v 00 ←Ai∗n+ j ;
v 01 ←x j ;
v 02 ←v 00 ∗ v 01 ;
v 0(1)2 ←y (1) ;
v 0(1)0 ←v 0(1)2 ∗ v 01 ;
v 0(1)1 ←v 0(1)2 ∗ v 00 ;
A(1)i∗n+ j+←v 0(1)0 ;
#pragma omp a t o m i c
x (1) j+←v 0(1)1 ;
}
i f ( STACK(1)c . t o p ( ) = 1 1 3 ) {
STACK(1)c . pop ( ) ;
i ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
}
i f ( STACK(1)c . t o p ( ) = 1 2 7 ) {
STACK(1)c . pop ( ) ;
w h i l e ( STACK(1)c . t o p ( ) 6= 1 2 7 ) {
#pragma omp c r i t i c a l
{
i f ( STACK(1)c . t o p ( ) = c r i t i c a l _ c o u n t e r _ 1 0 8 7 −1) {
STACK(1)c . pop ( ) ;
c r i t i c a l _ c o u n t e r _ 1 0 8 7 ← c r i t i c a l _ c o u n t e r _ 1 0 8 7 −1;
w h i l e ( STACK(1)c . t o p ( ) 6= 1 2 7 ) {
i f ( STACK(1)c . t o p ( ) = 1 2 5 ) {
STACK(1)c . pop ( ) ;
t h r e a d _ r e s u l t 0 ←STACK(1) f . t o p ( ) ;
STACK(1) f . pop ( ) ;
v 00 ← t h r e a d _ r e s u l t 0 ;
v 01 ← s i n ( v 00 ) ;
v 02 ←y ;
v 03 ← s i n ( v 02 ) ;
v 04 ←v 01 ∗ v 03 ;
v 0(1)4 ← t h r e a d _ r e s u l t (1)0 ;
t h r e a d _ r e s u l t (1)0 ← 0 . ;
v 0(1)1 ←v 0(1)4 ∗ v 03 ;
v 0(1)3 ←v 0(1)4 ∗ v 01 ;
v 0(1)0 ←v 0(1)1 ∗ c o s ( v 00 ) ;
t h r e a d _ r e s u l t (1)0+←v 0(1)0 ;
v 0(1)2 ←v 0(1)3 ∗ c o s ( v 02 ) ;
392 B Test Suite
y (1)+←v 0(1)2 ;
}
}
}
}
}
STACK(1)c . pop ( ) ;
}
}
}
Listing B.15: This listing presents the first-order adjoint code for the test case ‘critical’
(Listing B.13). The atomic construct in line 102 can be prevented by using the exclusive
read analysis.
t i d ←omp_get_thread_num ( ) ;
p←omp_get_num_threads ( ) ;
c←n / p ;
l b ← t i d ∗c ;
ub← ( t i d +1)∗c −1;
i←lb ;
y← 0 . ;
w h i l e ( i ≤ ub ) {
j←0;
w h i l e ( j <n ) {
y+←Ai∗n+ j ∗ x j ;
j ← j +1;
}
i ← i +1;
}
#pragma omp a t o m i c
t h r e a d _ r e s u l t 0+←y ;
}
Listing B.16: This listings shows the original code for the test case ‘atomic’.
B.5 Test Case with an atomic Construct 393
#pragma omp p a r a l l e l
{
int tid←0;
i n t p← 0 ;
i n t c← 0 ;
int lb←0;
i n t ub← 0 ;
int i←0;
int j←0;
double y ;
d o u b l e y (1) ← 0 . ;
d o u b l e v 00 ;
(1)0
d o u b l e v0 ;
d o u b l e v 01 ;
(1)0
d o u b l e v1 ;
d o u b l e v 02 ;
(1)0
d o u b l e v2 ;
t i d ←omp_get_thread_num ( ) ;
p←omp_get_num_threads ( ) ;
c←n / p ;
l b ← t i d ∗c ;
ub← ( t i d +1)∗c −1;
i←lb ;
(1)0
v0 ← 0 . ;
v 00 ← 0 . ;
(1)0
y (1) ←v 0 ;
y←v 00 ;
w h i l e ( i ≤ ub ) {
j←0;
w h i l e ( j <n ) {
(1)0 (1)
v 0 ←Ai∗n+ j ;
v 00 ←Ai∗n+ j ;
(1)0 (1)
v 1 ←x j ;
v 01 ←x j ;
(1)0 (1)0 (1)0
v 2 ←v 0 ∗ v 01+v 00 ∗ v 1 ;
v 02 ←v 00 ∗ v 01 ;
(1)0
y (1)+←v 2 ;
y+←v 02 ;
j ← j +1;
}
i ← i +1;
}
(1)0
v 0 ←y (1) ;
v 00 ←y ;
#pragma omp a t o m i c
(1) (1)0
t h r e a d _ r e s u l t 0 +←v 0 ;
#pragma omp a t o m i c
t h r e a d _ r e s u l t 0+←v 00 ;
394 B Test Suite
}
Listing B.17: This listing presents the first-order tangent-linear code for the test case
‘atomic’ (Listing B.18).
#pragma omp m a s t e r
{
atomic_flag_120← 0 ;
}
#pragma omp b a r r i e r
STACK(1)c . p u s h ( 4 6 ) ;
STACK(1)i . p u s h ( t i d ) ;
t i d ←omp_get_thread_num ( ) ;
STACK(1)i . p u s h ( p ) ;
p←omp_get_num_threads ( ) ;
STACK(1)i . p u s h ( c ) ;
c←n / p ;
STACK(1)i . p u s h ( l b ) ;
l b ← t i d ∗c ;
STACK(1)i . p u s h ( ub ) ;
ub← ( t i d +1)∗c −1;
STACK(1)i . p u s h ( i ) ;
i←lb ;
STACK(1) f . p u s h ( y ) ;
y← 0 . ;
w h i l e ( i ≤ ub ) {
STACK(1)c . p u s h ( 8 2 ) ;
STACK(1)i . p u s h ( j ) ;
j←0;
w h i l e ( j <n ) {
STACK(1)c . p u s h ( 1 0 1 ) ;
STACK(1) f . p u s h ( y ) ;
y+←Ai∗n+ j ∗ x j ;
STACK(1)i . p u s h ( j ) ;
B.5 Test Case with an atomic Construct 395
j ← j +1;
}
STACK(1)c . p u s h ( 1 1 3 ) ;
STACK(1)i . p u s h ( i ) ;
i ← i +1;
}
STACK(1)c . p u s h ( 1 2 0 ) ;
#pragma omp c r i t i c a l
{
i f ( atomic_flag_120 = 0) {
atomic_flag_120← 1 ;
atomic_storage_120← t h r e a d _ r e s u l t 0 ;
}
t h r e a d _ r e s u l t 0+←y ;
}
#pragma omp b a r r i e r
w h i l e ( n o t STACK(1)c . empty ( ) ) {
i f ( STACK(1)c . t o p ( ) = 4 6 ) {
STACK(1)c . pop ( ) ;
y←STACK(1) f . t o p ( ) ;
STACK(1) f . pop ( ) ;
v 00 ← 0 . ;
v 0(1)0 ←y (1) ;
y (1) ← 0 . ;
i ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
ub←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
l b ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
c←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
p←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
t i d ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
}
i f ( STACK(1)c . t o p ( ) = 8 2 ) {
STACK(1)c . pop ( ) ;
j ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
}
i f ( STACK(1)c . t o p ( ) = 1 0 1 ) {
STACK(1)c . pop ( ) ;
j ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
y←STACK(1) f . t o p ( ) ;
STACK(1) f . pop ( ) ;
v 00 ←Ai∗n+ j ;
v 01 ←x j ;
v 02 ←v 00 ∗ v 01 ;
v 0(1)2 ←y (1) ;
396 B Test Suite
v 0(1)0 ←v 0(1)2 ∗ v 01 ;
v 0(1)1 ←v 0(1)2 ∗ v 00 ;
A(1)i∗n+ j+←v 0(1)0 ;
#pragma omp a t o m i c
x (1) j+←v 0(1)1 ;
}
i f ( STACK(1)c . t o p ( ) = 1 1 3 ) {
STACK(1)c . pop ( ) ;
i ←STACK(1)i . t o p ( ) ;
STACK(1)i . pop ( ) ;
}
i f ( STACK(1)c . t o p ( ) = 1 2 0 ) {
STACK(1)c . pop ( ) ;
v 00 ←y ;
v 0(1)0 ← t h r e a d _ r e s u l t (1)0 ;
y (1)+←v 0(1)0 ;
#pragma omp c r i t i c a l
{
i f ( a t o m i c _ f l a g _ 1 2 0 6= 2 ) {
atomic_flag_120← 2 ;
t h r e a d _ r e s u l t 0← a t o m i c _ s t o r a g e _ 1 2 0 ;
}
}
}
}
}
Listing B.18: The first-order adjoint code for the test case ‘atomic’ is shown in this listing
(Listing B.18). The atomic construct in line 102 can be avoided by using the exclusive
read analysis.
Bibliography
[1] A. Aho, M. Lam, R. Sethi, and J. Ullman. Compilers. Principles, Techniques, and
Tools (Second Edition). Addison-Wesley Educational Publishers, Incorporated, 2007.
[2] R. Altenfeld, M. Apel, D. an Mey, B. Böttger, S. Benke, and C. Bischof. Parallelis-
ing Computational Microstructure Simulations for Metallic Materials with OpenMP.
2011.
[3] H. Anton, I.C. Bivens, and S. Davis. Calculus Early Transcendentals, 10th Edition
E-Text. Wiley, 2011.
[4] A.W. Appel and M. Ginsburg. Modern Compiler Implementation in C. Cambridge
University Press, 2004.
[5] M. Ben-Ari. Principles of Concurrent and Distributed Programming. Prentice-Hall
International Series in Computer Science. Addison-Wesley, 2006.
[6] C. Bischof, N. Guertler, A. Kowarz, and A. Walther. Parallel Reverse Mode Auto-
matic Differentiation for OpenMP Programs with ADOL-C., pages 163–173. Berlin:
Springer, 2008.
[7] B. Braunschweig and R. Gani. Software Architectures and Tools for Computer Aided
Process Engineering: Computer-Aided Chemical Engineeirng. Computer Aided
Chemical Engineering. Elsevier Science, 2002.
[8] S.C. Brenner and R. Scott. The Mathematical Theory of Finite Element Methods.
Texts in Applied Mathematics. Springer, 2008.
[9] C. Breshears. The Art of Concurrency - A Thread Monkey’s Guide to Writing Parallel
Applications. O’Reilly, 2009.
[10] M. Bücker, B. Lang, D. an Mey, and C. Bischof. Bringing together automatic differ-
entiation and OpenMP. In ICS ’01: Proceedings of the 15th international conference
on Supercomputing, pages 246–251, New York, 2001. ACM.
[11] M. Bücker, B. Lang, A. Rasch, C. Bischof, and D. an Mey. Explicit Loop Scheduling
in OpenMP for Parallel Automatic Differentiation. High Performance Computing
Systems and Applications, Annual International Symposium on, 0:121, 2002.
[12] M. Bücker, A. Rasch, and A. Wolf. A class of OpenMP applications involving nested
parallelism. In SAC ’04: Proceedings of the 2004 ACM symposium on Applied com-
puting, pages 220–224, New York, 2004. ACM.
[13] D. Butenhof. Programming with POSIX threads. Addison-Wesley Longman Publish-
ing Co., Inc., Boston, MA, USA, 1997.
[44] Khronos OpenCL Working Group. The OpenCL Specification, version 1.0.29, 2008.
[45] B. Knaster. Un théorème sur les fonctions d’ensembles. Annales de la Société Polon-
aise de Mathématiques, 6:133–134, 1928.
[46] D.E. Knuth. The Art of Computer Programming. The Art of Computer Programming:
Seminumerical Algorithms. Addison-Wesley, 2001.
[47] A. Kowarz and A. Walther. Optimal checkpointing for time-stepping procedures in
ADOL-C. In V. Alexandrov, G. van Albada, P. Sloot, and J. Dongarra, editors, Com-
putational Science – ICCS 2006, volume 3994 of Lecture Notes in Computer Science,
pages 541–549, Heidelberg, 2006. Springer.
[48] A. Kowarz and A. Walther. Parallel Derivative Computation using ADOL-C. In
W. Nagel, R. Hoffmann, and A. Koch, editors, 9th Workshop on Parallel Systems
and Algorithms (PASA) held at the 21st Conference on the Architecture of Comput-
ing Systems (ARCS), February 26th, 2008, in Dresden, Germany, volume 124 of LNI,
pages 83–92. GI, 2008.
[49] M. Lange, G. Gorman, M. Weiland, L. Mitchell, and J. Southern. Achieving Effi-
cient Strong Scaling with PETSc using Hybrid MPI/OpenMP Optimisation. CoRR,
abs/1303.5275, 2013.
[50] K. Levenberg. A Method for the Solution of Vertain Problems in Least Squares, 1944.
[51] K. Madsen, B. Nielsen, and O. Tingleff. Methods for Non-Linear Least-Squares Prob-
lems, 2004.
[52] Z. Manna and A. Pnueli. The Temporal Logic of Reactive and Concurrent Systems -
Specification. Springer, 1992.
[53] D. Marquardt. An Algorithm for Least Squares Estimation on Nonlinear Parameters,
1963.
[54] Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, Ver-
sion 2.1. Specification, 2008.
[55] R. Moona. "Assembly Language Programming In Gnu/Linux For IA32 Architectures".
Prentice-Hall, New Delhi, 2007.
[56] S. Muchnick. Advanced Compiler Design Implementation. Morgan Kaufmann Pub-
lishers, 1997.
[57] U. Naumann. The Art of Differentiating Computer Programs. Software, Environments
and Tools. Society for Industrial and Applied Mathematics, 2012.
[58] U. Naumann, J. Utke, J. Riehme, P. Hovland, and C. Hill. A Framework for Prov-
ing Correctness of Adjoint Message Passing Programs. In Proceedings of EU-
ROPVM/MPI 2008, pages 316–321, 2008.
[59] F. Nielson, H.R. Nielson, and C. Hankin. Principles of Program Analysis. Springer,
1999.
Bibliography 401
[60] J. Nocedal and S. Wright. Numerical Optimization. Springer, New York, 2nd edition,
2006.
[61] NVIDIA. CUDA Technology. Technical report, 2007.
[62] OpenACC Architecture Review Board. OpenACC Application Program Interface.
Specification, 2011.
[63] OpenMP Architecture Review Board. OpenMP Application Program Interface. Spec-
ification, 2011.
[64] F. Potra and S. Wright. Primal-dual interior-point methods. SIAM, 1997.
[65] W.H. Press. Numerical Recipes in C++: The Art of Scientific Computing. Cambridge
University Press, 2002.
[66] T. Rauber and G. Rünger. Parallel Programming for Multicore and Cluster Systems.
Springer Verlag, 2010.
[67] H. Ricardo. A Modern Introduction to Differential Equations. Elsevier Science, 2009.
[68] H. Rice. Classes of Recursively Enumerable Sets and Their Decision Problems. Trans-
actions of the American Mathematical Society, 74(2):pp. 358–366, 1953.
[69] J. Sanders and E. Kandrot. CUDA by Example: An Introduction to General-Purpose
GPU Programming. Pearson Education, 2010.
[70] M. Schanen, M. Förster, B. Gendler, and U. Naumann. Compiler-based Differentiation
of Higher-Order Numerical Simulation Codes using Interprocedural Checkpointing.
International Journal on Advances in Software, 5(1&2):27–35, 2012.
[71] M. Schanen, U. Naumann, and M. Förster. Second-order adjoint algorithmic differ-
entiation by source transformation of MPI code. In Recent Advances in the Mes-
sage Passing Interface, Lecture Notes in Computer Science, pages 257–264. Springer,
2010.
[72] M. Schanen, U. Naumann, L. Hascoët, and J. Utke. Interpretative adjoints for numer-
ical simulation codes using MPI. In Procedia Computer Science, pages 1819–1827.
Elsevier, 2010.
[73] M. Schwartzbach. Lecture Notes on Static Analysis, 2014 (accessed February 23,
2014). http://www.itu.dk/people/brabrand/UFPE/Data-Flow-Analysis/
static.pdf.
[74] A. Silberschatz, P.B. Galvin, and G. Gagne. Operating System Concepts Essentials.
Wiley, 2010.
[75] A.S. Tanenbaum and A.S. Woodhull. Operating Systems Design and Implementation.
Pearson Education, 2011.
[76] A. Tarski. A lattice-theoretical fixpoint theorem and its applications. 1955.
[77] J. Utke, L. Hascoët, C. Hill, P. Hovland, and U. Naumann. Toward Adjoinable MPI.
In Proceedings of IPDPS 2009, 2009.
402 Bibliography
[78] A. Walther and A. Griewank. Bounding The Number Of Processes And Checkpoints
Needed In ... , 2001.
[79] R. Wilhelm and D. Maurer. Übersetzerbau: Theorie, Konstruktion, Generierung.
Springer-Lehrbuch. Springer, 1997.
[80] M.J. Wolfe. High Performance Compilers for Parallel Computing. ADDISON WES-
LEY Publishing Company Incorporated, 1996.
[81] S. Wolfram. The Mathematica Book (5. ed.). Wolfram-Media, 2003.
[82] A. Wächter and L. Biegler. On the implementation of an interior-point filter line-
search algorithm for large-scale nonlinear programming. Mathematical Programming,
106(1):25–57, 2006.
[83] D. Zill, W. Wright, and M. Cullen. Differential Equations With Boundary-Value Prob-
lems. Textbooks Available with Cengage Youbook. BROOKS COLE Publishing Com-
pany, 2012.
Index
finite differences, 5, 11
bounded waiting, 44
forward mode, 13
forward section, 16, 94, 95
chain, 162
forward-difference, 11
closure of a source transformation, 116,
forward-over-forward mode, 18
118, 218
forward-over-reverse, 20
closure property, 21, 79
FPGA, 1
complete lattice, 158
constrained optimization problem, 6 GLB, see greatest lower bound
control flow, 74 GPU, see graphic processing units
critical parallel region, 75 gradient, 7
critical reference, 36, 63, 64, 66, 67, 75, graphic processing units, 1
76 greatest lower bound, 146, 154
DAE, 24 Hessian, 8
DAG, see directed acyclic graph high performance computing, 1, 29
damped least-squares method, 8 HPC, see high performance computing
data decomposition, 35
explicit, 35 incremental assignment, 82
implicit, 38 interior point method, 26
semantically equivalent, 38
set of possible interleavings, 68
shared memory, 2
shared memory parallel programming, 55
shared variables, 2
simple parallel language, see SPL
single assignment code, see SAC
SMP, 188
SMT, see Satisfiability modulo theories
SPL, 79
split reversal, 114, 223, 225, 244, 257
SPMD, 35, 138, 165
stack, 80
straight-line code, 74
symbolic differentiation, 12
symmetric multiprocessing, see SMP
tangent-linear code, 22
tangent-linear model, 13, 86
true dependence, 198
vector mode, 52