Unit 4
Shared-memory parallel programming
with OpenMP
Objectives
A very brief introduction to OpenMP
An application programming interface (API) based mostly on a
set of compiler directives
Today’s most widely used approach for shared-memory parallel
programming
Basic OpenMP programming through examples
(More advanced OpenMP programming will be taught in later
chapters.)
First things first
The applicable hardware context: shared memory
All processors can directly access all data in a shared memory,
no need for explicit communication between the processors
OpenMP: A parallel programming standard for shared-memory
parallel computers
A set of compiler directives (with additional clauses) A
small number of library functions
A few environment variables
Advantages of compiler
directives
An OpenMP compiler directive is a “parallelization hint”
intended for a compatible compiler to automatically create
parallel code, will be ignored by non-OpenMP compilers
otherwise.
#pragma omp . . .
It is possible to maintain a same code for both the serial
implementation and the parallel OpenMP implementation.
Allows incremental parallelization—simplifies coding
effort—easier to debug
Threads in OpenMP
The central execution entities in an OpenMP program are
threads—lightweight processes.
The OpenMP threads share a common address space and
mutually access data.
Spawning a thread is much less costly than forking a new
process, because threads share everything except the
instruction pointer, stack pointer and register state.
If wanted, each thread can have a few “private variables” (by
means of the local stack pointer).
“Fork-join”
model
In any OpenMP program, a single thread, called master
thread, runs immediately after startup.
The master thread can spawn (also called fork) a number of
additional threads when entering a so-called parallel region.
Inside a parallel region, the master thread and the spawned
threads execute instruction streams concurrently.
Each thread has a unique ID.
Different threads work on different parts of the shared data or
carry out different tasks.
OpenMP has compiler directives for dividing the work among
the threads.
At the end of a parallel region, the threads are joined: all the
threads are terminated but the master thread.
There can be multiple parallel regions in an OpenMP program.
OpenMP’s parallel execution model
Hello world in
OpenMP
#include <stdio.h>
#include <omp.h>
int main (int nargs, char **args) { #pragma
omp parallel
{
printf("Hello world!\n");
}
return 0;
}
Example of compilation: gcc
-fopenmp hello_world.c
Parallel execution: . / a . o u t
What do you get?
Control the number of threads at runtime
Example on a Linux system:
gcc -fopenmp hello_world.c
export OMP_NUM_THREADS=6
./a.out
export OMP_NUM_THREADS=8
./a.out
OMP_NUM_THREADS is an environment variable (understood by
OpenMP)
It is also possible to hard-code the number of threads inside an
OpenMP program:
#pragma omp parallel num_threads(6)
{
// ......
}
But this approach is normally not recommended!
Hello world in OpenMP (a bit more interesting example)
#include <stdio.h>
#include <omp.h>
int main (int nargs, char
**args) {
printf("I’m
#pragma omp the parallel
master
{ thread, I’m
alone.\n");
int num_threads, thread_id; num_threads =
omp_get_num_threads(); thread_id =
omp_get_thread_num();
printf("Hello world! I’m thread No.% d out of %d
threads.\n",thread_id,num_threads);
}
return 0;
}
“Manual” loop parallelization
Now, let’s try to “manually” parallelize a for-loop, that is, divide the
iterations evenly among the threads.
Example:
for (i=0; i<N; i++) a[i] =
b[i] + c[i];
Important observation: Assuming that the arrays a, b and c do
not overlap (that is, no aliasing), then the iterations of this for-loop
are independent of each other, thus safe to be executed by multiple
threads concurrently.
“Manual” loop parallelization (2)
How to divide the iterations evenly among the threads?
Given num_threads as the total number of threads, one way to
divide N iterations for thread with ID thread_id is as follows:
int blen, bstart; blen =
N/num_threads;
if (thread_id < (N% num_threads)) { blen =
blen + 1;
bstart = blen * thread_id;
}
else {
bstart = blen * thread_id + (N%
num_threads);
}
Why is this a fair division?
“Manual” loop parallelization (3)
OpenMP coding
#pragma omp parallel
{
int num_threads, thread_id; int
blen, bstart, bend, i;
num_threads = omp_get_num_threads();
thread_id = omp_get_thread_num();
blen = N/num_threads;
if (thread_id < (N% num_threads)) { blen =
blen + 1;
bstart = blen * thread_id;
}
else {
bstart = blen * thread_id + (N%
num_threads);
}
bend = bstart + blen;
for (i=bstart; i<bend; i++) a[i] =
b[i] + c[i];
} // end of parallel region
Data scoping
Any variables that existed before a parallel region still exist inside
the parallel region, and are by default shared between all threads.
Often it will be necessary for the threads to have some private
variables.
Each thread can either declare new local variables inside the
parallel region, these variables are private “by birth”;
Or, each thread can “privatize” some of the shared variables
that already existed before a parallel region (using the
p r i v a t e clause):
int blen, bstart, bend;
#pragma omp parallel private(blen, bstart, bend)
{
// ...
}
Each “privatized” variable has one (uninitialized) instance per
thread;
The private variables’ scope is until the end of the parallel
region.
Actually, parallelizing a for-loop is easy in
OpenMP
Parallelizing for-loops is OpenMP’s main work-sharing
mechanism.
OpenMP has several built-in strategies for dividing the
iterations among the threads.
No need to manually calculate each thread’s loop
bounds
#pragma omp parallel
{
#pragma omp for
for (i=0; i<N; i++) a[i] =
b[i] + c[i];
} // end of parallel region
or simply
#pragma omp parallel for for
(i=0; i<N; i++)
a[i] = b[i] + c[i];
More remarks
The loop can not contain break, return, e x i t statements.
The continue statement is allowed.
The index update has to be an increment (or decrement) by a
fixed amount.
The loop index variable is automatically private, and changes
to it inside the loop are not allowed.
Numerical integration
x1
How to numerically calculate ∫ f (x )dx ?
x0
Numerical integration for calculating π
Serial implementation:
int N, i;
double w = 1.0/N, x, approx_pi; double
sum = 0.;
for (i=1; i<=N; i++) { x =
w*(i-0.5);
sum = sum + 4.0/(1.0+x*x);
}
approx_pi = w*sim;
A naive OpenMP implementation
int N, i;
double w = 1.0/N, x, approx_pi = 0.; double
sum = 0.;
#pragma omp parallel private(x)
firstprivate(sum)
{
#pragma omp for
for (i=1; i<=N; i++) { x =
w*(i-0.5);
sum = sum + 4.0/(1.0+x*x);
}
#pragma omp critial
{
approx_pi = approx_pi +
w*sim;
}
} // end of the parallel region
OpenMP critical regions
Concurrent write accesses to a shared variable must be
avoided by all means to circumvent race conditions.
An OpenMP c r i t i c a l code block is executed by one thread
at a time. This is one way to avoid race conditions. (There are
other ways.)
The variable approx_pi in the above example is a shared
variable, to which all the threads will write. Thus, “protection”
is provided by a c r i t i c a l code block.
Use of the c r i t i c a l directive will incur overhead.
Improper use of the c r i t i c a l directive may lead to deadlock.
Use of OpenMP’s reduction
clause
Actually, using c r i t i c a l to prevent concurrent writes to the
variable approx_pi is an “over-kill”. The reduction clause of
OpenMP is designed for this particular purpose:
int N, i;
double sum = 0.;
double w = 1.0/N, x, approx_pi;
#pragma omp parallel for private(x) reduction(+:sum) for (i=1;
i<=N; i++) {
x = w*(i-0.5);
sum = sum + 4.0/(1.0+x*x);
}
approx_pi = w*sim;
Another example of using the reduction
clause
#pragma omp parallel for reduction(+:s) for (i=0;
i<N; i++)
s = s + a[i]*a[i];
Loop scheduling
#pragma omp parallel for for
(i=0; i<N; i++)
a[i] = b[i] + c[i];
How are the loop
iterations exactly
divided among the
threads?
Mapping of loop iterations to threads is configurable in
OpenMP.
The “secret” is the schedule clause:
#pragma omp parallel for schedule(static|dynamic|guided [,chunk])
Default scheduling is s t a t i c (no need to specify), which
divides the iterations into contiguous chunks of (roughly)
equal size.
Other alternatives of scheduling: dynamic and guided
Examples of different
schedulers
Tasking
A task can be defined by OpenMP’s t a s k directive, containing
the code to be executed.
When a thread encounters a task construct, it may execute it
right away or set up the appropriate data environment and
defer its execution. The task is then ready to be executed later
by any thread of the team.
An example of OpenMP tasks
#pragma omp parallel private(r, i)
{
#pragma single
{
for (i=0; i<N; i++) {
r = rand(); // a randomly generated number if (p[i] > r) {
#pragma task
{
do_some_work (p[i]);
}
} // enf of if-test
} // end of for-loop
} // end of the single
directive
} // end of the parallel region
The actual number of calls to do_some_work is unknown, so
tasking is a natural choice for work division.
single and
master
A “si ngle” code block in OpenMP will be entered by one thread
only, namely the therad that reaches the s i n g l e directive first. All
others skip the code and wait at the end of the s i n g l e block due
to an implicit barrier.
A “master” code block is only entered by the master thread, all the
other threads skip over without waiting for the master thread to
finish.
OpenMP has a separate “ b a r r i e r ” directive for explicit
synchronization among the threads. (Use with care!!!)
Yet another
example
int myid, numthreads;
#pragma omp parallel private(myid)
{
my_id = omp_get_thread_num();
#pragma omp single
{
numthreads = omp_get_num_threads();
}
#pragma omp critical // not strictly necessary
{
printf("This is thread No.% d out of %d threads\n", my_id,
numthreads);
}
} // end of the parallel region
Jacobi algorithm
Serial C implementation (slightly different from that of Chapter 3):
double maxdelta = 1.0, eps = 1.0e-14; while
(maxdelta > eps) {
maxdelta = 0.;
for (k=1; k<kmax-1; k++) for (i=1;
i<imax-1; i++) {
phi_new[k][i] = (phi[k-1][i]+ph[k]
[i-1]
+phi[k][i+1]+phi[k+1][i])*0.25; maxdelta =
max(maxdelta, abs(phi_new[k][i]-phi[k][i]));
}
/*
pointer swapping */
temp_ptr = phi_new;
phi_new = phi;
phi =
temp_ptr;
}
OpenMP-parallel Jacobi algorithm
double maxdelta = 1.0, eps = 1.0e-14;
while (maxdelta > eps)
{ maxdelta = 0.;
#pragma omp parallel for
reduction(max: maxdelta)
private(i)
{
for (k=1; k<kmax-1; k++) for (i=1;
i<imax-1; i++) {
phi_new[k][i] = (phi[k-1][i]+ph[k]
[i-1]
+phi[k][i+1]+phi[k+1][i])*0.25; maxdelta =
max(maxdelta, abs(phi_new[k][i]-phi[k][i]));
}
} // end of the parallel region
/*
pointer swapping */
temp_ptr = phi_new;
phi_new = phi;
phi =
temp_ptr;
OpenMP-parallel Jacobi algorithm (version 2)
double maxdelta = 1.0, eps = 1.0e-14;
#pragma omp parallel
{
while (maxdelta > eps) {
#pragma omp single
{
maxdelta = 0.;
}
#pragma omp for reduction(max: maxdelta) private(i)
{
for (k=1; k<kmax-1; k++)
for (i=1; i<imax-1; i++) {
phi_new[k][i] = (phi[k-1][i]+ph[k][i-1]
+phi[k][i+1]+phi[k+1][i])*0.25; maxdelta =
max(maxdelta, abs(phi_new[k][i]-phi[k][i]));
}
}
#pragma omp master
{
/*
pointer swapping */
temp_ptr = phi_new;
phi_new = phi;
phi =
temp_ptr;
}
} // end of
while loop
} // end of
Challenge: Parallelizing 3D Gauss-Seidel algorithm
What if the iterations of a triple loop nest are not entirely
independent? (There are loop-carried independences.)
Example: 3D Gauss-Seidel algorithm (computational
core)
for (k=1; k<kmax-1; k++) for
(j=1; j<jmax-1; j++)
for (i=1; i<imax-1; i++)
phi[k][j][i] = (phi[k-1][j][i] + phi[k][j-1][i]
+phi[k][j][i-1] + phi[k][j][i+1]
+phi[k][j+1][i] + phi[k+1][j][i])/6.0;
We cannot just add #pragma omp p a r a l l e l f o r
before the
k-indexed loop.
Note: The upper limits of k, j and i are different from those given
in Chapter 6 of the textbook.
Wavefront parallelization
Although not as simple as the Jacobi algorithm, it is still possible to
parallelize the Gauss-Seidel algorithm with OpenMP.
The key idea is to find a way of traversing the 3D lattice that
fulfills the dependency constraints imposed by the stencil update.
A wavefront travels in the k direction. The dimension along which
to parallelize is j . Each of the threads, T0, T1, . . ., T t − 1 , is
assinged a consecutive chuck of the j indices.
Wavefront parallelization (2)
Wavefront parallelization (3)
Important observations
The k index goes between 1 and kmax-2.
All the j indices 1 , 2 , . . . j m a x - 2 are divided evenly into
consecutive chucks: J0, J1, . . . , J t − 1 (one chunk per thread).
Total number of wavefronts: (kmax-2)+t − 1, for computing
through the entire 3D lattice
Wavefront W1 has only one block (k=1, J0)
Wavefront W2 has two concurrent blocks (k=1, J1) and (k=2,
J0 )
Wavefront W3 has three concurrent blocks (k=1, J2), (k=2,
J1) and (k=3, J0)
···
For wavefronts W ,t W t+1 , . . . , kmax-2 , each has t
concurrent
W blocks
Wavefronts Wkmax-1, . . . , Wkmax-2+t−1 have fewer and
fewer concurrent blocks (the wind-down phase).
OpenMP wavefront parallelization
#pragma omp parallel private(k,j,i)
{
int numthreads, threadID, jstart, jend, m;
numthreads = omp_get_num_threads();
threadID = omp_get_thread_num();
jstart = ((jmax-2)*threadID)/numthreads + 1;
jend = ((jmax-2)*(threadID+1))/numthreads;
for (m=1; m<=kmax+numthreads-3; m++) { // the wavefronts k = m -
threadID;
if (k>=1 && k<=kmax-2) {
for (j=jstart; j<=jend; j++) for (i=1;
i<imax-1; i++)
phi[k][j][i] = (phi[k-1][j][i] + phi[k][j-1][i]
+phi[k][j][i-1] + phi[k][j][i+1]
+phi[k][j+1][i] + phi[k+1][j][i])/6.0;
}
#pragma omp barrier
}
} // end of the parallel region