OpenMP Basics
OpenMP Basics
Philip Blood
Scientific Specialist
Pittsburgh Supercomputing Center
• Amdahl’s Law:
F = Fraction of serial execution time that cannot
be
parallelized
N = Number of processors
• True64: -mp
• SGI IRIX: -mp
• IBM AIX: -qsmp=omp
• Portland Group: -mp
• Intel: -openmp
• gcc (4.2) -fopenmp
Compiling and Running OpenMP
• OMP_NUM_THREADS environment
variable sets the number of processors
the OpenMP program will have at its
disposal.
• Example script
#!/bin/tcsh
setenv OMP_NUM_THREADS 4
mycode < my.in > my.out
OpenMP Basics:
2 Approaches to Parallelism
Divide loop Divide various
iterations among sections of
threads: We will
focus mainly on
code between
loop level threads
parallelism in this
lecture
Sections: Functional parallelism
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp
section
block1
#pragma omp
section
block2 Image from:
https://computing.llnl.gov/tutorials/openMP
}
Parallel DO/for:
Loop level parallelism
Fortran:
!$omp parallel do
do i = 1, n
a(i) = b(i) + c(i)
enddo
C/C++:
#pragma omp parallel for
for(i=1; i<=n; i++)
a[i] = b[i] + c[i];
Image from:
https://computing.llnl.gov/tutorials/openMP
Pitfall #1: Data dependencies
• Consider the following code:
a[0] = 1;
for(i=1; i<5; i++)
a[i] = i + a[i-1];
1
2
Chunk 1 Thread 1
3
4
Chunk 2 Thread 2
5
6
Chunk 3 Thread 3
7
Optimization: Scheduling
• This strategy has the least amount of overhead
• However, if not all iterations take the same amount of
time, this simple strategy will lead to load imbalance.
0
Chunk 0 Thread 0
Loop iterations
1
2
Chunk 1 Thread 1
3
4
Chunk 2 Thread 2
5
6
Chunk 3 Thread 3
7
Optimization: Scheduling
• OpenMP offers a variety of scheduling
strategies:
– schedule(static,[chunksize])
• Divides workload into equal-sized chunks
• Default chunksize is Nwork/Nthreads
– Setting chunksize to less than this will result in chunks
being assigned in an interleaved manner
• Lowest overhead
• Least optimal workload distribution
Optimization: Scheduling
– schedule(dynamic,[chunksize])
• Dynamically assigned chunks to threads
• Default chunksize is 1
• Highest overhead
• Optimal workload distribution
– schedule(guided,[chunksize])
• Starts with big chunks proportional to (number of
unassigned iterations)/(number of threads), then makes
them progressively smaller until chunksize is reached
• Attempts to seek a balance between overhead and
workload optimization
Optimization: Scheduling
– schedule(runtime)
• Scheduling can be selected at runtime using
OMP_SCHEDULE
• e.g. setenv OMP_SCHEDULE “guided, 100”
– In practice, often use:
• Default scheduling (static, large chunks)
• Guided with default chunksize
– Experiment with your code to determine
optimal strategy
What we have learned
• How to compile and run OpenMP progs
• Private vs. shared variables
• Critical sections and reductions for
updating scalar shared variables
• Techniques for minimizing thread
spawning/exiting overhead
• Different scheduling strategies
Summary
• OpenMP is often the easiest way to achieve
moderate parallelism on shared memory
machines
• In practice, to achieve decent scaling, will
probably need to invest some amount of effort
in tuning your application.
• More information available at:
– https://computing.llnl.gov/tutorials/openMP/
– http://www.openmp.org
– Using OpenMP, MIT Press, 2008
Hands-On
If you’ve finished parallelizing the Laplace code
(or you want a break from MPI):