0% found this document useful (0 votes)
80 views58 pages

Parallel Computing and Openmp Tutorial: Shao-Ching Huang

The document provides an overview of parallel computing concepts and an introduction to OpenMP. It discusses key concepts like shared and distributed memory models, data parallelism, and strong and weak scaling. It then demonstrates how to parallelize a simple SAXPY operation using OpenMP directives. Key OpenMP constructs like parallel regions and clauses like private and shared are explained.

Uploaded by

praphultmenon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views58 pages

Parallel Computing and Openmp Tutorial: Shao-Ching Huang

The document provides an overview of parallel computing concepts and an introduction to OpenMP. It discusses key concepts like shared and distributed memory models, data parallelism, and strong and weak scaling. It then demonstrates how to parallelize a simple SAXPY operation using OpenMP directives. Key OpenMP constructs like parallel regions and clauses like private and shared are explained.

Uploaded by

praphultmenon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Parallel Computing and

OpenMP Tutorial

Shao-Ching Huang

IDRE High Performance Computing Workshop

2013-02-11
Overview
 Part I: Parallel Computing Basic Concepts
– Memory models
– Data parallelism

 Part II: OpenMP Tutorial


– Important features
– Examples & programming tips

2
Part I : Basic Concepts
Why Parallel Computing?
 Bigger data
– High-res simulation
– Single machine too small to hold/process all data
 Utilize all resources to solve one problem
– All new computers are parallel computers
– Multi-core phones, laptops, desktops
– Multi-node clusters, supercomputers

4
Memory models

Parallel computing is about data processing.


In practice, memory models determine how we write parallel
programs.

Two types:
 Shared memory model
 Distributed memory model
Shared Memory
All CPUs have access to the (shared) memory
(e.g. Your laptop/desktop computer)

6
Distributed Memory

Each CPU has its own (local) memory, invisible to other CPUs

High speed networking (e.g. Infiniband) for good performance

7
Hybrid Model
 Shared-memory style within a node
 Distributed-memory style across nodes

For example, this is one node of Hoffman2 cluster

8
Parallel Scalability
 Strong scaling
– fixed the global problem size
Real code
– local size decreases as N is increased T
– ideal case: T*N=const (linear decay) ideal

N
 Weak scaling
– fixed the local problem size (per processor) Real code
– global size increases as N increases
T ideal
– ideal case: T=const.

T(N) = wall clock run time N


N = number of processors

9
Identify Data Parallelism – some typical examples
 “High-throughput” calculations
– Many independent jobs
 Mesh-based problems
– Structured or unstructured mesh
– Mesh viewed as a graph – partition the graph
– For structured mesh one can simply partition along coord. axes

 Particle-based problems
– Short-range interaction
• Group particles in cells – partition the cells
– Long-range interaction
• Parallel fast multipole method – partition the tree
10
Portal parallel programming – OpenMP example

 OpenMP
– Compiler support
– Works on ONE multi-core computer
Compile (with openmp support):
  $ ifort ­openmp foo.f90 
Run with 8 “threads”:
  $ export OMP_NUM_THREADS=8
  $ ./a.out
Typically you will see CPU utilization over 100% (because the
program is utilizing multiple CPUs)
11
Portal parallel programming – MPI example

 Works on any computers


Compile with MPI compiler wrapper:
  $ mpicc foo.c 
Run on 32 CPUs across 4 physical computers:
  $ mpirun ­n 32 ­machinefile mach ./foo
'mach' is a file listing the computers the program will run on, e.g.
  n25 slots=8

  n32 slots=8

  n48 slots=8

  n50 slots=8

The exact format of machine file may vary slightly in each MPI
implementation. More on this in MPI class...
12
Part II : OpenMP Tutorial

(thread programming)

14
What is OpenMP?

 API for shared-memory parallel programming


– compiler directives + functions

 Supported by mainstream compilers – portable code


– Fortran 77/9x/20xx
– C and C++

 Has a long history, standard defined by a consortium


– Version 1.0, released in 1997
– Version 2.5, released in 2005
– Version 3.0, released in 2008

– Version 3.1, released in 2011

 http://www.openmp.org
Elements of Shared-memory Programming
 Fork/join threads
 Synchronization
– barrier
– mutual exclusive (mutex)

 Assign/distribute work to threads


– work share
– task queue

 Run time control


– query/request available resources
– interaction with OS, compiler, etc.

16
OpenMP Execution Model

 We get speedup by running multiple threads simultaneously.

Source: wikipedia.org
saxpy operation (C)

Sequential code OpenMP code


const int n = 10000; const int n = 10000;
float x[n], y[n], a; float x[n], y[n], a;
int i; int i;

#pragma omp parallel for


for (i=0; i<n; i++) {
for (i=0; i<n; i++) {
y[i] = a * x[i] + y[i];
y[i] = a * x[i] + y[i];
}
}

gcc saxpy.c gcc saxpy.c -fopenmp

Enable OpenMP support


18
saxpy operation (Fortran)

Sequential Code OpenMP code

integer, paramter :: n=10000 integer, paramter :: n=10000


real :: x(n), y(n), a real :: x(n), y(n), a
Integer :: i integer :: i

do i=1,n !$omp parallel do


y(i) = a*x(i) + y(i) do i=1,n
end do y(i) = a*x(i) + y(i)
end do

gfortran saxpy.f90 gfortran saxpy.f90 -fopenmp

Enable OpenMP support


19
Private vs. shared – threads' point of view

 Loop index “i” is private


– each thread maintains its own “i” value and range
– private variable “i” becomes undefined after “parallel for”
 Everything else is shared
– all threads update y, but at different memory locations
– a,n,x are read-only (ok to share)
const int n = 10000;
float x[n], y[n], a = 0.5;
int i;
#pragma omp parallel for
for (i=0; i<n; i++) {
y[i] = a * x[i] + y[i];
}
Nested loop – outer loop is parallelized

#pragma omp parallel for !$omp parallel do


for (j=0; j<n; j++) { do j=1,n
for (i=0; i<n; i++) { do i=1,n
//… do some work here !… do some work here
} // i-loop end do
} // j-loop end do

 By default, only j (the outer loop) is private


 But we want both i and j to be private, i.e.
 Solution (overriding the OpenMP default):
#pragma omp parallel for private(i)  j is already private
by default
!$omp parallel do private(i)
21
OpenMP General Syntax
 Header file
 Clauses specifies the precise
#include <omp.h>
“behavior” of the parallel region
 Parallel region:

#pragma omp construct_name [clauses…]


{
C/C++ // … do some work here
} // end of parallel region/block

!$omp construct_name [clauses…]


Fortran !… do some work here
!$omp end construct_name

 Environment variables and functions (discussed later)

23
Parallel Region
 To fork a team of N threads, numbered 0,1,..,N-1
 Probably the most important construct in OpenMP
 Implicit barrier

C/C++ Fortran
//sequential code here (master thread) !sequential code here (master thread)

#pragma omp parallel [clauses] !$omp parallel [clauses]


{ ! parallel computing here
// parallel computing here !…
// … !$omp end parallel
}
! sequential code here (master thread)
// sequential code here (master thread)

24
Clauses for Parallel Construct

C/C++ #pragma omp parallel clauses, clauses, …


Fortran !$omp parallel clauses, clauses, …

Some commonly-used clauses:


 shared  private

 nowait  firstprivate

 if  num_threads

 reduction  default

 copyin

25
Clause “Private”
 The values of private data are undefined upon entry to and exit
from the specific construct
 To ensure the last value is accessible after the construct,
consider using “lastprivate”
 To pre-initialize private variables with values available prior to the
region, consider using “firstprivate”
 Loop iteration variable is private by default

26
Clause “Shared”
 Shared among the team of threads executing the region
 Each thread can read or modify shared variables
 Data corruption is possible when multiple threads attempt to
update the same memory location
– Data race condition
– Memory store operation not necessarily atomic
 Code correctness is user’s responsibility

27
nowait

C/C++ Fortran
#pragma omp for nowait !$omp do
! do-loop here
// for loop here !$omp end do nowait

#pragma omp for nowait !$omp do


... ! … some other code

In a big parallel region


 This is useful inside a big parallel region
 allows threads that finish earlier to proceed without waiting
– More flexibility for scheduling threads (i.e. less
synchronization – may improve performance)

28
If clause
 if (integer expression)
– determine if the region should run in parallel
– useful option when data is too small (or too large)
 Example
C/C++ Fortran

#pragma omp parallel if (n>100) !$omp parallel if (n>100)


{
//…some stuff //…some stuff
}
!$omp end parallel

29
Work Sharing
 We have not yet discussed how work is distributed among
threads...
 Without specifying how to share work, all threads will redundantly
execute all the work (i.e. no speedup!)
 The choice of work-share method is important for performance
 OpenMP work-sharing constructs
– loop (“for” in C/C++; “do” in Fortran)
– sections
– single

33
Loop Construct (work sharing)
Clauses: #pragma omp parallel shared(n,a,b) private(i)
{ #pragma omp for
 private for (i=0; i<n; i++)
 firstprivate a[i]=i;
#pragma omp for
 lastprivate for (i=0; i<n; i++)
b[i] = 2 * a[i];
 reduction
}
 ordered
!$omp parallel shared(n,a,b) private(i)
 schedule
!$omp do
 nowait do i=1,n
a(i)=i
end do
!$omp end do
...

34
Parallel Loop (C/C++)

Style 1 Style 2

#pragma omp parallel #pragma omp parallel for


{ for (i=0; i<N; i++)
// … {
#pragma omp for …
for (i=0; i<N; i++) }// end of for
{

}// end of for
}// end of parallel

35
Parallel Loop (Fortran)

Style 1 Style 2

$!omp parallel $!omp parallel do


{ do i=1,n
! ... ...
$!omp do end do
do i=1,n $!omp end parallel do
...
end do
$!omp end do
$!omp end parallel

36
Loop Scheduling

#pragma omp parallel for How is the loop divided


{ into separate threads?
for (i=0; i<1000; i++)
{ foo(i); }
}

Scheduling types:
– static: each thread is assigned a fixed-size chunk (default)
– dynamic: work is assigned as a thread request it
– guided: big chunks first and smaller and smaller chunks later
– runtime: use environment variable to control scheduling

37
Static scheduling
Dynamic scheduling
Guided scheduling
Loop Schedule Example

#pragma omp parallel for schedule(dynamic,5) \


shared(n) private(i,j)
for (i=0; i<n; i++) {
for (j=0; j<i; j++) {
foo(i,j);
} // j-loop
} // i-loop
} // end of parallel for

“dynamic” is useful when the amount of work in


foo(i,j) depends on i and j.

41
C/C++
Sections #pragma omp sections
{
#pragma omp section
One thread executes one section { foo(); }
#pragma omp section
– If “too many” sections, some { bar(); }
threads execute more than one #pragma omp section
section (round-robin) { beer(); }
} // end of sections
– If “too few” sections, some
threads are idle Fortran
– We don’t know in advance $!omp sections
$!omp section
which thread will execute which
call foo()
section $!omp end section
$!omp section
call bar
$!omp end section
$!omp end sections
 Each section is executed exactly once

42
Single

A “single” block is executed by one thread


– Useful for initializing shared variables
– We don’t know exactly which thread will execute the block
– Only one thread executes the “single” region; others bypass it.

C/C++ Fortran
#pragma omp single $!omp single
{ a = 10;
a = 10; $!omp end single
}
#pragma omp for $!omp parallel do
{ for (i=0; i<N; i++) do i=1,n
b[i] = a; b(i) = a
} end do
$!omp end parallel do

43
Computing the Sum
 We want to compute the sum of a[0] and a[N-1]:
C/C++ Fortran
sum = 0; sum = 0;
for (i=0; i<N; i++) do i=1,n
sum += a[i]; sum = sum + a(i)
end do

 A “naive” OpenMP implementation (incorrect):


C/C++ Fortran
sum = 0; sum = 0;
#pragma omp parallel for $!omp parallel do
for (i=0; i<N; i++) do i=1,n
sum += a[i]; sum = sum + a(i)
end do
$!omp end parallel do
Race condition!
44
Critical
C/C++ Fortran
#pragma omp critical $!omp critical
{ !...some stuff
//...some stuff $!omp end critical
}

 One thread at a time


– ALL threads will execute the region eventually
– Note the difference between “single” and “critical”
 Mutual exclusive

45
Computing the sum
The correct OpenMP-way:
sum = 0;
#pragma omp parallel shared(n,a,sum) private(sum_local)
{
sum_local = 0;
#pragma omp for
for (i=0; i<n; i++)
sum_local += a[i]; // form per-thread local sum

#pragma omp critical


{
sum += sum_local; // form global sum
}
}

46
Reduction operation
sum example from previous slide: A cleaner solution:
sum = 0; sum = 0;
#pragma omp parallel \ #pragma omp parallel for \
shared(...) private(...) shared(...) private(...) \
{ reduction(+:sum)
sum_local = 0; {
#pragma omp for for (i=0; i<n; i++)
for (i=0; i<n; i++) sum += a[i];
sum_local += a[i]; }
#pragma omp critical
{
sum += sum_local;
} Reduction operations of +,*,-,&
} |, ^, &&, || are supported.

47
Barrier

int x = 2;
#pragma omp parallel shared(x)
{
int tid = omp_get_thread_num(); some threads may
if (tid == 0) still have x=2 here
x = 5;
else
printf("[1] thread %2d: x = %d\n",tid,x); cache flush + thread
synchronization
#pragma omp barrier

printf("[2] thread %2d: x = %d\n",tid,x);


} all threads have x=5
here

48
Resource Query Functions
 Max number of threads
omp_get_max_threads()
 Number of processors
omp_get_num_procs()
 Number of threads (inside a parallel region)
omp_get_num_threads()
 Get thread ID
omp_get_thread_num()

 See OpenMP specification for more functions.

49
Query function example:
#include <omp.h> void bar(float *x, int istart, int ipts)
int main() {
{ for (int i=0; i<ipts; i++)
float *array = new float[10000]; x[istart+i] = 3.14159;
foo(array,10000); }
}
void foo(float *x, int npts)
{
int tid,ntids,ipts,istart;
#pragma omp parallel private(tid,ntids,ipts,istart)
{
tid = omp_get_thread_num(); // thread ID
ntids = omp_get_num_threads(); // total number of threads
ipts = npts / ntids;
istart = tid * ipts;
if (tid == ntids-1) ipts = npts - istart;
bar(x,istart,ipts); // each thread calls bar
}
}
50
Control the Number of Threads
 Parallel region
#pragma omp parallel num_threads(integer)
 Run-time function
omp_set_num_threads()
 Environment variable higher priority

export OMP_NUM_THREADS=n

 High-priority ones override low-priority ones.

51
Which OpenMP version do I have?
GNU compiler on my desktop:
$ g++ --version
g++ (Ubuntu/Linaro 4.4.4-14ubuntu5) 4.4.5

$ g++ version.cpp –fopenmp #include <iostream>


$ a.out using namespace std;
version : 200805 int main()
{
Intel compiler on Hoffman2: cout << "version : " << _OPENMP << endl;
$ icpc --version }
icpc (ICC) 11.1 20090630

$ icpc version.cpp -openmp


$ a.out Version Date
version : 200805 3.0 May 2008
2.5 May 2005
http://openmp.org 2.0 March 2002
52
OpenMP Environment Variables
 OMP_SCHEDULE
– Loop scheduling policy
 OMP_NUM_THREADS
– number of threads
 OMP_STACKSIZE

 See OpenMP specification for many others.

53
Parallel Region in Subroutines
 Main program is “sequential”
 subroutines/functions are parallelized

int main() void foo()


{ {
foo(); #pragma omp parallel
} {
// some fancy stuff here
}
}

54
Parallel Region in “main” Program
 Main program is “sequential”
 subroutines/functions are parallelized

void main() void foo(int i)


{ {
#pragma omp parallel // sequential code
{ }
i = some_index;
foo(i);
}
}

55
Nested Parallel Regions
 Need available hardware resources (e.g. CPUs) to gain
performance

void main() void foo()


{ {
#pragma omp parallel #pragma omp parallel
{ {
i = some_index; // some fancy stuff here
foo(i); }
} }
}

Each thread from main fork a team of threads.


56
Conditional Compilation

Check _OPENMP to see if OpenMP is supported by the compiler

#include <omp.h> $ g++ check_openmp.cpp -fopenmp


#include <iostream> $ a.out
using namespace std; Have OpenMP support
int main()
{
#ifdef _OPENMP $ g++ check_openmp.cpp
cout << "Have OpenMP support\n"; $ a.out
#else No OpenMP support
cout << "No OpenMP support\n";
#endif
return 0;
}

57
Single Source Code
 Use _OPENMP to separate sequential and parallel code within
the same source file
 Redefine runtime library functions to avoid linking errors

#ifdef _OPENMP
#include <omp.h>
#else
#define omp_get_max_threads() 1
#define omp_get_thread_num() 0
#endif

To simulate a single-thread run

58
Good Things about OpenMP
 Simplicity
– In many cases, “the right way” to do it is clean and simple
 Incremental parallelization possible
– Can incrementally parallelize a sequential code, one block at
a time
– Great for debugging & validation
 Leave thread management to the compiler
 It is directly supported by the compiler
– No need to install additional libraries (unlike MPI)

59
Other things about OpenMP
 Data race condition can be hard to detect/debug
– The code may run correctly with a small number of threads!
– True for all thread programming, not only OpenMP
– Some tools may help
 It may take some work to get parallel performance right
– In some cases, the performance is limited by memory
bandwidth (i.e. a hardware issue)

60
Other types of parallel programming
 MPI
– works on both shared- and distributed memory systems
– relatively low level (i.e. lots of details)
– in the form of a library
 PGAS languages
– Partitioned Global Address Space
– native compiler support for parallelization
– UPC, Co-array Fortran and several others

61
Summary
 Identify compute-intensive, data parallel parts of your code
 Use OpenMP constructs to parallelize your code
– Spawn threads (parallel regions)
– In parallel regions, distinguish shared variables from the
private ones
– Assign work to individual threads
• loop, schedule, etc.
– Watch out variable initialization before/after parallel region
– Single thread required? (single/critical)
 Experiment and improve performance

62
Thank you.

63

You might also like