High Performance Computing
(HPC)
Lecture 3
By: Dr. Maha Dessokey
Programming with Shared Memory
(OpenMP)
Parallel Computer Memory Architectures
Shared Memory
All processors access all memory as a single global address
space.
Data sharing is fast.
Lack of scalability between memory and CPUs
Multithreading vs. Multiprocessing
Threads - shares “heavyweight”
the same process -
memory space completely
and global separate program
variables with its own
between variables, stack,
routines. and memory
allocation.
Programming with Shared Memory
The most popular shared memory multithreading API is
POSIX Threads (Pthreads).
OpenMP
Agenda
Introduction to OpenMP
Creating Threads
Synchronization
Parallel Loops
What is OpenMP?
OpenMP: An API for Writing Multithreaded Applications
“Standard” API for defining multi-threaded shared-memory
programs
Set of compiler directives and library routines for parallel
application programmers
Greatly simplifies writing multi-threaded (MT)
programs in Fortran, C and C++
OpenMP Solution Stack
A Programmer’s View of OpenMP
OpenMP will:
Allow a programmer to separate a program into serial regions and parallel
regions, rather than concurrently-executing threads.
Hide stack management
Provide synchronization constructs
OpenMP will not:
Parallelize automatically
Guarantee speedup
Provide freedom from data races
race condition: when the program’s outcome changes as
the threads are scheduled differently
An instance of a program
Threads interact through reads/writes
to a shared address space.
OS scheduler decides when to run
which threads … interleaved for
fairness.
Synchronization to assure every legal
order results in correct results.
OpenMP core syntax
Most of the constructs in OpenMP are compiler directives.
Example #pragma omp parallel num_threads(4)
where omp is an OpenMP keyword.
Function prototypes and types in the file:
#include < omp.h>
OpenMP constructs apply to a “structured block”.
Structured block: a block of one or more statements with one point of entry at
the top and one point of exit at the bottom.
It’s OK to have an exit() within the structured block.
A non-structured block lacks clear control flow and can lead to "spaghetti code,"
where the logic is tangled and difficult to follow. This often includes the use of
GOTO statements or deeply nested control structures.
A multi-threaded “Hello world” program
Write a multithreaded program where each thread prints “hello
world”
How do threads interact?
OpenMP is a multi-threading, shared address model.
Threads communicate by sharing variables.
Unintended sharing of data causes race conditions:
Race Condition: when the program’s outcome changes as the threads are
scheduled differently.
To control race conditions: – Use synchronization to protect data conflicts.
OpenMP Programming Model
Fork-Join Model:
Master thread spawns a team
of threads as needed.
Parallelism added
incrementally until
performance goals are met:
i.e. the sequential program
evolves into a parallel
program.
Thread Creation: Parallel Regions
You create threads in OpenMP* with the parallel construct.
For example, To create a 4 thread Parallel region:
Each thread calls pooh(ID,A) for ID = 0 to 3
Thread Creation: Parallel Regions
You create threads in OpenMP* with the parallel construct.
For example, To create a 4 thread Parallel region:
Each thread calls pooh(ID,A) for ID = 0 to 3
Thread Creation: Parallel Regions
Example- Numerical Integration
Mathematically, we know that:
We can approximate the integral
as a sum of rectangles:
Where each rectangle has width
∆x and height F(x’) at the middle
of interval i.
Serial PI Program
static long num_steps = 100000;
double step;
int main () for (i=0;i< num_steps; i++)
{ {
int i; double x, pi, sum = 0.0; x = (i+0.5)*step;
step = 1.0/ /(double) num_steps; sum = sum + 4.0/(1.0+x*x);
}
pi = step * sum;
}
A simple Parallel pi program
To create a parallel version of the pi program pay close attention to
shared versus private variables.
We will need the runtime library routines
int omp_get_num_threads(); -------→Get number of threads
int omp_get_thread_num(); ---------→ Get Thread ID or rank
double omp_get_wtime()------------→Time in Seconds since a fixed
point in the past
A simple Parallel pi program
#include <omp.h> if (id == 0) nthreads = nthrds;
static long num_steps = 100000; double step;
#define NUM_THREADS 2 for (i=id, sum[id]=0.0; i< num_steps; i=i+nthrds)
void main () {
{ int i, nthreads; double pi, sum[NUM_THREADS]; x = (i+0.5)*step;
step = 1.0/(double) num_steps; sum[id] += 4.0/(1.0+x*x);
omp_set_num_threads(NUM_THREADS); }
#pragma omp parallel } // End of parallel region
{
int i, id,nthrds; double x; for(i=0, pi=0.0; i<nthreads; i++)
id = omp_get_thread_num(); pi += sum[i] * step;
nthrds = omp_get_num_threads(); }
How to calculate the runtime?
#include <omp.h> if (id == 0) nthreads = nthrds;
static long num_steps = 100000; double step;
#define NUM_THREADS 2 for (i=id, sum[id]=0.0; i< num_steps; i=i+nthrds)
void main () {
{ int i, nthreads; double pi, sum[NUM_THREADS]; x = (i+0.5)*step;
double runtime; sum[id] += 4.0/(1.0+x*x);
runtime = omp_get_wtime(); }
step = 1.0/(double) num_steps; } // End of parallel region
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel for(i=0, pi=0.0; i<nthreads; i++)
{ pi += sum[i] * step;
int i, id,nthrds; double x; runtime = omp_get_wtime() - runtime;
id = omp_get_thread_num(); printf(" In %lf seconds, The sum is %lf \n",runtime,sum);
nthrds = omp_get_num_threads(); }
Algorithm strategy
The SIMD (Single Instruction Multiple Data) design pattern
Run the same program on P processing elements where P can be
arbitrarily large.
Use the rank( an ID ranging from 0 to (P-1)) to select between a set
of tasks and to manage any shared data structures.
This pattern is very general and has been used to support most (if
not all) the algorithm strategy patterns.
MPI programs almost always use this pattern
it is probably the most commonly used pattern in the history of
parallel programming.
How do threads interact?
OpenMP is a multi-threading, shared address model.
Threads communicate by sharing variables.
Unintended sharing of data causes race conditions:
Race Condition: when the program’s outcome changes as the threads are
scheduled differently.
To control race conditions: – Use synchronization to protect data conflicts.
Synchronization
Synchronization: bringing one or more threads to a well
defined and known point in their execution.
Synchronization is used to impose order constraints and to
protect access to shared data
The two most common forms of synchronization are
Barrier: each thread wait at the barrier until all threads arrive.
Mutual exclusion: Define a block of code that only one thread
at a time can execute.
Synchronization: Barrier
Barrier: Each thread waits until all threads arrive.
#pragma omp parallel
{
int id=omp_get_thread_num();
A[id] = big_calc1(id);
B [] will not be
#pragma omp barrier calculated unless all
B[id] = big_calc2(id, A); threads complete
A[] calculations
}
Synchronization: Mutual exclusion
Mutual exclusion: Only one thread at a time can enter a critical region
float res;
#pragma omp parallel
{
float B; int i, id, nthrds;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
for(i=id;i<niters;i+=nthrds){
B = big_job(i);
#pragma omp critical Threads wait their
turn – only one at a
res += consume (B); time calls consume()
}
}
Synchronization: Atomic
Atomic: provides mutual exclusion but only applies to the update of a
memory location (the update of X in the following example)
#pragma omp parallel
{
double tmp, B;
B = DOIT();
tmp = big_ugly(B);
#pragma omp atomic Atomic only protects
X += big_ugly(B); the read/update of X
}
SPMD vs. worksharing
A parallel construct by itself creates an SPMD or “Single Program
Multiple Data” program … i.e., each thread redundantly executes
the same code.
How do you split up pathways through the code between threads
within a team?
This is called worksharing
Loop construct
Sections/section constructs
Single construct Out of our scope
Task construct
The loop worksharing Constructs
The loop worksharing construct splits up loop iterations among the
threads in a team
#pragma omp parallel
{
#pragma omp for
for (I=0;I<N;I++)
{
The variable I is made
NEAT_STUFF(I);
“private” to each
} thread by default.
}
The loop worksharing Constructs
Sequential for(i=0;i<N;i++) { a[i] = a[i] + b[i];}
code
#pragma omp parallel Block distribution for
{ loop iterations
int id, i, Nthrds,Step,, istart, iend;
OpenMP id = omp_get_thread_num();
Nthrds = omp_get_num_threads();
parallel Step= N / Nthrds
region istart = id *Step;
iend = (id+1) * Step;
if (id == Nthrds-1) iend = N; /// last thread takes the remainder
for(i=istart;i<iend;i++)
{
a[i] = a[i] + b[i];
}
}
The loop worksharing Constructs
Sequential for(i=0;i<N;i++) { a[i] = a[i] + b[i];}
code
OpenMP parallel region #pragma omp parallel
and a worksharing for #pragma omp for
construct for(i=0;i<N;i++) { a[i] = a[i] + b[i];}
Combined parallel/worksharing construct
OpenMP shortcut: Put the “parallel” and the worksharing directive
on the same line
loop worksharing constructs:
The schedule clause
The schedule clause affects how loop iterations are mapped onto threads
schedule(static [,chunk])
Deal-out blocks of iterations of size “chunk” to each thread.
schedule(dynamic[,chunk])
Each thread grabs “chunk” iterations off a queue until all iterations have been handled.
schedule(guided[,chunk])
Threads dynamically grab blocks of iterations. The size of the block starts large and shrinks down
to size “chunk” as the calculation proceeds.
schedule(runtime)
Schedule and chunk size taken from the OMP_SCHEDULE environment variable (or the runtime
library).
schedule(auto) – Schedule is left up to the runtime to choose (does not have to be any of
the above)
Assignment 1- Parallel Matrix Addition
Using OpenMP
you will implement a parallel program to perform matrix addition
using OpenMP.
This exercise will help you understand how to utilize parallel computing to
enhance performance.
Due will be after 2 weeks