Parallel Computing and Openmp Tutorial: Shao-Ching Huang
Parallel Computing and Openmp Tutorial: Shao-Ching Huang
OpenMP Tutorial
Shao-Ching Huang
2013-02-11
Overview
Part I: Parallel Computing Basic Concepts
– Memory models
– Data parallelism
2
Part I : Basic Concepts
Why Parallel Computing?
Bigger data
– High-res simulation
– Single machine too small to hold/process all data
Utilize all resources to solve one problem
– All new computers are parallel computers
– Multi-core phones, laptops, desktops
– Multi-node clusters, supercomputers
4
Memory models
Two types:
Shared memory model
Distributed memory model
Shared Memory
All CPUs have access to the (shared) memory
(e.g. Your laptop/desktop computer)
6
Distributed Memory
Each CPU has its own (local) memory, invisible to other CPUs
7
Hybrid Model
Shared-memory style within a node
Distributed-memory style across nodes
8
Parallel Scalability
Strong scaling
– fixed the global problem size
Real code
– local size decreases as N is increased T
– ideal case: T*N=const (linear decay) ideal
N
Weak scaling
– fixed the local problem size (per processor) Real code
– global size increases as N increases
T ideal
– ideal case: T=const.
9
Identify Data Parallelism – some typical examples
“High-throughput” calculations
– Many independent jobs
Mesh-based problems
– Structured or unstructured mesh
– Mesh viewed as a graph – partition the graph
– For structured mesh one can simply partition along coord. axes
Particle-based problems
– Short-range interaction
• Group particles in cells – partition the cells
– Long-range interaction
• Parallel fast multipole method – partition the tree
10
Portal parallel programming – OpenMP example
OpenMP
– Compiler support
– Works on ONE multi-core computer
Compile (with openmp support):
$ ifort openmp foo.f90
Run with 8 “threads”:
$ export OMP_NUM_THREADS=8
$ ./a.out
Typically you will see CPU utilization over 100% (because the
program is utilizing multiple CPUs)
11
Portal parallel programming – MPI example
n32 slots=8
n48 slots=8
n50 slots=8
The exact format of machine file may vary slightly in each MPI
implementation. More on this in MPI class...
12
Part II : OpenMP Tutorial
(thread programming)
14
What is OpenMP?
http://www.openmp.org
Elements of Shared-memory Programming
Fork/join threads
Synchronization
– barrier
– mutual exclusive (mutex)
16
OpenMP Execution Model
Source: wikipedia.org
saxpy operation (C)
23
Parallel Region
To fork a team of N threads, numbered 0,1,..,N-1
Probably the most important construct in OpenMP
Implicit barrier
C/C++ Fortran
//sequential code here (master thread) !sequential code here (master thread)
24
Clauses for Parallel Construct
nowait firstprivate
if num_threads
reduction default
copyin
25
Clause “Private”
The values of private data are undefined upon entry to and exit
from the specific construct
To ensure the last value is accessible after the construct,
consider using “lastprivate”
To pre-initialize private variables with values available prior to the
region, consider using “firstprivate”
Loop iteration variable is private by default
26
Clause “Shared”
Shared among the team of threads executing the region
Each thread can read or modify shared variables
Data corruption is possible when multiple threads attempt to
update the same memory location
– Data race condition
– Memory store operation not necessarily atomic
Code correctness is user’s responsibility
27
nowait
C/C++ Fortran
#pragma omp for nowait !$omp do
! do-loop here
// for loop here !$omp end do nowait
28
If clause
if (integer expression)
– determine if the region should run in parallel
– useful option when data is too small (or too large)
Example
C/C++ Fortran
29
Work Sharing
We have not yet discussed how work is distributed among
threads...
Without specifying how to share work, all threads will redundantly
execute all the work (i.e. no speedup!)
The choice of work-share method is important for performance
OpenMP work-sharing constructs
– loop (“for” in C/C++; “do” in Fortran)
– sections
– single
33
Loop Construct (work sharing)
Clauses: #pragma omp parallel shared(n,a,b) private(i)
{ #pragma omp for
private for (i=0; i<n; i++)
firstprivate a[i]=i;
#pragma omp for
lastprivate for (i=0; i<n; i++)
b[i] = 2 * a[i];
reduction
}
ordered
!$omp parallel shared(n,a,b) private(i)
schedule
!$omp do
nowait do i=1,n
a(i)=i
end do
!$omp end do
...
34
Parallel Loop (C/C++)
Style 1 Style 2
35
Parallel Loop (Fortran)
Style 1 Style 2
36
Loop Scheduling
Scheduling types:
– static: each thread is assigned a fixed-size chunk (default)
– dynamic: work is assigned as a thread request it
– guided: big chunks first and smaller and smaller chunks later
– runtime: use environment variable to control scheduling
37
Static scheduling
Dynamic scheduling
Guided scheduling
Loop Schedule Example
41
C/C++
Sections #pragma omp sections
{
#pragma omp section
One thread executes one section { foo(); }
#pragma omp section
– If “too many” sections, some { bar(); }
threads execute more than one #pragma omp section
section (round-robin) { beer(); }
} // end of sections
– If “too few” sections, some
threads are idle Fortran
– We don’t know in advance $!omp sections
$!omp section
which thread will execute which
call foo()
section $!omp end section
$!omp section
call bar
$!omp end section
$!omp end sections
Each section is executed exactly once
42
Single
C/C++ Fortran
#pragma omp single $!omp single
{ a = 10;
a = 10; $!omp end single
}
#pragma omp for $!omp parallel do
{ for (i=0; i<N; i++) do i=1,n
b[i] = a; b(i) = a
} end do
$!omp end parallel do
43
Computing the Sum
We want to compute the sum of a[0] and a[N-1]:
C/C++ Fortran
sum = 0; sum = 0;
for (i=0; i<N; i++) do i=1,n
sum += a[i]; sum = sum + a(i)
end do
45
Computing the sum
The correct OpenMP-way:
sum = 0;
#pragma omp parallel shared(n,a,sum) private(sum_local)
{
sum_local = 0;
#pragma omp for
for (i=0; i<n; i++)
sum_local += a[i]; // form per-thread local sum
46
Reduction operation
sum example from previous slide: A cleaner solution:
sum = 0; sum = 0;
#pragma omp parallel \ #pragma omp parallel for \
shared(...) private(...) shared(...) private(...) \
{ reduction(+:sum)
sum_local = 0; {
#pragma omp for for (i=0; i<n; i++)
for (i=0; i<n; i++) sum += a[i];
sum_local += a[i]; }
#pragma omp critical
{
sum += sum_local;
} Reduction operations of +,*,-,&
} |, ^, &&, || are supported.
47
Barrier
int x = 2;
#pragma omp parallel shared(x)
{
int tid = omp_get_thread_num(); some threads may
if (tid == 0) still have x=2 here
x = 5;
else
printf("[1] thread %2d: x = %d\n",tid,x); cache flush + thread
synchronization
#pragma omp barrier
48
Resource Query Functions
Max number of threads
omp_get_max_threads()
Number of processors
omp_get_num_procs()
Number of threads (inside a parallel region)
omp_get_num_threads()
Get thread ID
omp_get_thread_num()
49
Query function example:
#include <omp.h> void bar(float *x, int istart, int ipts)
int main() {
{ for (int i=0; i<ipts; i++)
float *array = new float[10000]; x[istart+i] = 3.14159;
foo(array,10000); }
}
void foo(float *x, int npts)
{
int tid,ntids,ipts,istart;
#pragma omp parallel private(tid,ntids,ipts,istart)
{
tid = omp_get_thread_num(); // thread ID
ntids = omp_get_num_threads(); // total number of threads
ipts = npts / ntids;
istart = tid * ipts;
if (tid == ntids-1) ipts = npts - istart;
bar(x,istart,ipts); // each thread calls bar
}
}
50
Control the Number of Threads
Parallel region
#pragma omp parallel num_threads(integer)
Run-time function
omp_set_num_threads()
Environment variable higher priority
export OMP_NUM_THREADS=n
51
Which OpenMP version do I have?
GNU compiler on my desktop:
$ g++ --version
g++ (Ubuntu/Linaro 4.4.4-14ubuntu5) 4.4.5
53
Parallel Region in Subroutines
Main program is “sequential”
subroutines/functions are parallelized
54
Parallel Region in “main” Program
Main program is “sequential”
subroutines/functions are parallelized
55
Nested Parallel Regions
Need available hardware resources (e.g. CPUs) to gain
performance
57
Single Source Code
Use _OPENMP to separate sequential and parallel code within
the same source file
Redefine runtime library functions to avoid linking errors
#ifdef _OPENMP
#include <omp.h>
#else
#define omp_get_max_threads() 1
#define omp_get_thread_num() 0
#endif
58
Good Things about OpenMP
Simplicity
– In many cases, “the right way” to do it is clean and simple
Incremental parallelization possible
– Can incrementally parallelize a sequential code, one block at
a time
– Great for debugging & validation
Leave thread management to the compiler
It is directly supported by the compiler
– No need to install additional libraries (unlike MPI)
59
Other things about OpenMP
Data race condition can be hard to detect/debug
– The code may run correctly with a small number of threads!
– True for all thread programming, not only OpenMP
– Some tools may help
It may take some work to get parallel performance right
– In some cases, the performance is limited by memory
bandwidth (i.e. a hardware issue)
60
Other types of parallel programming
MPI
– works on both shared- and distributed memory systems
– relatively low level (i.e. lots of details)
– in the form of a library
PGAS languages
– Partitioned Global Address Space
– native compiler support for parallelization
– UPC, Co-array Fortran and several others
61
Summary
Identify compute-intensive, data parallel parts of your code
Use OpenMP constructs to parallelize your code
– Spawn threads (parallel regions)
– In parallel regions, distinguish shared variables from the
private ones
– Assign work to individual threads
• loop, schedule, etc.
– Watch out variable initialization before/after parallel region
– Single thread required? (single/critical)
Experiment and improve performance
62
Thank you.
63