0% found this document useful (0 votes)
40 views

Mpsoc Architectures Openmp

1) The document introduces OpenMP, a specification for multi-threaded programming that uses compiler directives and library routines to define parallel regions of code. 2) OpenMP aims to simplify parallel programming by allowing programmers to separate serial and parallel code regions, hiding complex threading details. It provides constructs for data sharing, synchronization, and loop parallelization. 3) The document uses examples to illustrate how OpenMP makes parallel programming easier than lower-level threading APIs like Pthreads, allowing serial code to be annotated for parallel execution with minimal changes to address threading issues.

Uploaded by

kaoutar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Mpsoc Architectures Openmp

1) The document introduces OpenMP, a specification for multi-threaded programming that uses compiler directives and library routines to define parallel regions of code. 2) OpenMP aims to simplify parallel programming by allowing programmers to separate serial and parallel code regions, hiding complex threading details. It provides constructs for data sharing, synchronization, and loop parallelization. 3) The document uses examples to illustrate how OpenMP makes parallel programming easier than lower-level threading APIs like Pthreads, allowing serial code to be annotated for parallel execution with minimal changes to address threading issues.

Uploaded by

kaoutar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

11/7/17

MPSoC Architectures
OpenMP

Alberto Bosio, Associate Professor – UM


Microelectronic Departement
[email protected]

Introduction to OpenMP
l What is OpenMP?
l Open specification for Multi-Processing
l “Standard” API for defining multi-threaded shared-

memory programs
–www.openmp.org – Talks, examples, forums, etc.

l High-level API
l Preprocessor (compiler) directives ( ~ 80% )
l Library Calls ( ~ 19% )
l Environment Variables ( ~ 1% )

1
11/7/17

A Programmer’s View of OpenMP


l OpenMP is a portable, threaded, shared-memory
programming specification with “light” syntax
l Exact behavior depends on OpenMP implementation!
l Requires compiler support (C or Fortran)

l OpenMP will:
l Allow a programmer to separate a program into serial
regions and parallel regions, rather than T concurrently-
executing threads.
l Hide stack management
l Provide synchronization constructs

l OpenMP will not:


l Parallelize (or detect!) dependencies
l Guarantee speedup
l Provide freedom from data races

Outline
l Introduction
l Motivating example
l Parallel Programming is Hard

l OpenMP Programming Model


l Easier than PThreads

l Microbenchmark Performance Comparison


l vs. PThreads

l Discussion
l specOMP

2
11/7/17

Current Parallel Programming


1. Start with a parallel algorithm
2. Implement, keeping in mind:
• Data races
• Synchronization
• Threading Syntax
3. Test & Debug
4. Debug
5. Debug

Motivation – Threading Library


void* SayHello(void *foo) {
printf( "Hello, world!\n" );
return NULL;
}

int main() {
pthread_attr_t attr;
pthread_t threads[16];
int tn;
pthread_attr_init(&attr);
pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM);
for(tn=0; tn<16; tn++) {
pthread_create(&threads[tn], &attr, SayHello, NULL);
}
for(tn=0; tn<16 ; tn++) {
pthread_join(threads[tn], NULL);
}
return 0;
}

3
11/7/17

Motivation

• Thread libraries are hard to use


– P-Threads/Solaris threads have many library calls
for initialization, synchronization, thread creation,
condition variables, etc.
– Programmer must code with multiple threads in
mind

• Synchronization between threads introduces a


new dimension of program correctness

Motivation

l Wouldn’t it be nice to write serial programs and


somehow parallelize them “automatically”?

l OpenMP can parallelize many serial programs with


relatively few annotations that specify parallelism
and independence

l OpenMP is a small API that hides cumbersome


threading calls with simpler directives

4
11/7/17

Better Parallel Programming


1. Start with some algorithm
• Embarrassing parallelism is helpful, but not
necessary
2. Implement serially, ignoring:
• Data Races
• Synchronization
• Threading Syntax
3. Test and Debug
4. Automatically (magically?) parallelize
• Expect linear speedup

Motivation – OpenMP

int main() {

// Do this part in parallel

printf( "Hello, World!\n" );

return 0;
}

5
11/7/17

Motivation – OpenMP

int main() {

omp_set_num_threads(16);

// Do this part in parallel


#pragma omp parallel
{
printf( "Hello, World!\n" );
}

return 0;
}

OpenMP Parallel Programming


1. Start with a parallelizable algorithm
• Embarrassing parallelism is good, loop-level
parallelism is necessary
2. Implement serially, mostly ignoring:
• Data Races
• Synchronization
• Threading Syntax
3. Test and Debug
4. Annotate the code with parallelization (and
synchronization) directives
• Hope for linear speedup
5. Test and Debug

6
11/7/17

Programming Model - Threading


l Serial regions by default,
annotate to create parallel
regions
l Generic parallel regions
Fork
l Parallelized loops
l Sectioned parallel regions

l Thread-like Fork/Join model Join

l Arbitrary number of logical


thread creation/ destruction
events

Programming Model - Threading


int main() {
// serial region
printf(“Hello…”);

// parallel region
Fork
#pragma omp parallel
{
printf(“World”);
}

// serial again Join


Hello…WorldWorldWorldWorld!
printf(“!”);
}

7
11/7/17

Programming Model – Nested


Threading
• Fork/Join can be nested

Fork
– Nesting complication handled
“automagically” at compile-time
Fork
– Independent of the number of
threads actually running Join

Join

Programming Model – Thread


Identification
Master Thread
• Thread with ID=0 0

• Only thread that exists in


sequential regions
Fork
• Depending on implementation,
may have special purpose inside 0 1 2 3 4 5 6 7
parallel regions
• Some special directives affect only
Join
the master thread (like master)
0

8
11/7/17

Example

int main() {

int tid, nthreads;


omp_set_num_threads(16);

// Do this part in parallel


#pragma omp parallel private(nthreads, tid)
{
printf( "Hello, World!\n" );
/* Obtain and print thread id */
tid = omp_get_thread_num();
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("I'm the master, Number of threads = %d\n", nthreads);
}

return 0;

Programming Model – Data/Control


Parallelism
l Data parallelism
l Threads perform similar
functions, guided by thread Fork
identifier

l Control parallelism
l Threads perform differing
functions
- One thread for I/O, one for
computation, etc… Join

9
11/7/17

Programming model: Summary

Memory Model
l Shared memory communication
l Threads cooperates by accessing shared variables
l The sharing is defined syntactically
l Any variable that is seen by two or more threads is
shared
l Any variable that is seen by one thread only is
private
l Race conditions possible
l Use synchronization to protect from conflicts
l Change how data is stored to minimize the
synchronization

10
11/7/17

Structure

Programming Model – Concurrent


Loops
l OpenMP easily parallelizes loops
l No data dependencies between
iterations!

l Preprocessor calculates loop


bounds for each thread directly
from serial source

#pragma omp parallel for


for( i=0; i < 25; i++ ) {
printf(“Foo”);
}

11
11/7/17

The problem
l Executes the same code as many times as there are
threads
l How many threads do we have? omp_set_num_threads(n)
What is the use of repeating the same work n times in
parallel? Can use omp_thread_num() to distribute the work
between threads.
l D is shared between the threads, i and sum are private

Programming Model – Concurrent


Loops

12
11/7/17

Programming Model – Concurrent


Loops

Programming Model – Concurrent


Loops
l Load balancing
l If all the iterations execute at the same speed, the
processors are used optimally If some iterations are faster
than others, some processors may get idle, reducing the
speedup
l We don't always know the distribution of work, may need to
re-distribute dynamically
l Granularity
l Thread creation and synchronization takes time Assigning
work to threads on per-iteration resolution may take more
time than the execution itself! Need to coalesce the work to
coarse chunks to overcome the threading overhead
l Trade-off between load balancing and granularity!

13
11/7/17

Controlling Granularity
l #pragma omp parallel if (expression)
l Can be used to disable parallelization in some
cases (when the input is determined to be too small
to be beneficially multithreaded)
l #pragma omp num_threads (expression)
l Control the number of threads used for this parallel
region

Programming Model – Loop


Scheduling
• schedule clause determines how loop iterations are
divided among the thread team
–static([chunk]) divides iterations statically between
threads
- Each thread receives [chunk] iterations, rounding as
necessary to account for all iterations
- Default [chunk] is ceil( # iterations / # threads
)
–dynamic([chunk]) allocates [chunk] iterations per
thread, allocating an additional [chunk] iterations when
a thread finishes
- Forms a logical work queue, consisting of all loop iterations
- Default [chunk] is 1
–guided([chunk]) allocates dynamically, but
[chunk] is exponentially reduced with each allocation

14
11/7/17

Programming Model – Loop


Scheduling

Example

l The function TestForPrime (usually) takes little


time But can take long, if the number is a prime
indeed
l Solution: use dynamic, but with chunks

15
11/7/17

Work sharing: Sections

Sections
l The SECTIONS directive is a non-iterative
work-sharing construct. It specifies that the
enclosed section(s) of code are to be divided
among the threads in the team.
l Independent SECTION directives are nested
within a SECTIONS directive.
l Each SECTION is executed once by a thread in the
team. Different sections may be executed by
different threads. It is possible that for a thread to
execute more than one section if it is quick enough
and the implementation permits such.

16
11/7/17

Example
#include <omp.h>
#define N 1000

main ()
{
int i;
float a[N], b[N], c[N], d[N];
/* Some initializations */
for (i=0; i < N; i++) {
a[i] = i * 1.5;
b[i] = i + 22.35;
}

Example
#pragma omp parallel shared(a,b,c,d) private(i)
{
#pragma omp sections
{
#pragma omp section
for (i=0; i < N; i++)
c[i] = a[i] + b[i];
#pragma omp section
for (i=0; i < N; i++)
d[i] = a[i] * b[i];
} /* end of sections */
} /* end of parallel section */
}

17
11/7/17

Data Sharing
l Shared Memory programming model
l Most variables are shared by default
l We can define a variable as private

// Do this part in parallel


#pragma omp parallel private(nthreads, tid)
{
printf( "Hello, World!\n" );
if (tid == 0)
{
….
}

Programming Model – Data Sharing


l Parallel programs often employ
two types of data int bigdata[1024];
l Shared data, visible to all threads,
similarly named
l Private data, visible to a single thread void* foo(void* bar) {
(often stack-allocated)
int tid;
• PThreads:
– Global-scoped variables are shared
– Stack-allocated variables are private #pragma omp parallel \
shared ( bigdata ) \
private ( tid )

• OpenMP: {
– shared variables are shared /* Calc. here */
– private variables are private
}
}

18
11/7/17

Programming Model – Data Sharing


l private:
l A copy of the variable is created for each thread.
l No connection between the original variable and the
private copies
l Can achieve the same using variables inside { }

Int i;
#pragma omp parallel for private(i)
for (i=0; i<n; i++) { ... }

Programming Model – Data Sharing


l Firstprivate:
l Same, but the initial value is copied from the main
copy
l Lastprivate:
l Same, but the last value is copied to the main copy

19
11/7/17

Thread private
l Similar to private, but defined per variable
l Declaration immediately after variable definition.
l Must be visible in all translation units. Persistent
between parallel sections
l Can be initialized from the master's copy with
l #pragma omp copyin
l More efficient than private, but a global variable!

Synchronization
l What should the result be (assuming 2
threads)?

X=0;
#pragma omp parallel
X = X+1;

20
11/7/17

Synchronization
l 2 is the expected answer But can be 1 with
unfortunate interleaving
l OpenMP assumes that the programmer knows
what he is doing
l Regions of code that are marked to run in
parallel are independent If access collisions are
possible, it is the programmer's responsibility to
insert protection

Synchronization
l Many of the existing mechanisms for shared
programming
l OpenMP Synchronization
l Nowait (turn synchronization off!)
l Single/Master execution
l Critical sections, Atomic updates
l Ordered
l Barriers
l Flush (memory subsystem synchronization)
l Reduction (special case)

21
11/7/17

Single/Master
l #pragma omp single
l Only one of the threads will execute the following
block of code
l The rest will wait for it to complete
l Good for non-thread-safe regions of code (such as
I/O)
l Must be used in a parallel region
l Applicable to parallel for sections

Single/Master
l #pragma omp master
l The following block will be executed by the master
thread
l No synchronization involved
l Applicable only to parallel sections
#pragma omp parallel
{
do_preprocessing () ;
#pragma omp single
read_input () ;
#pragma omp master
notify_input_consumed () ;
do_processing () ; }

22
11/7/17

Critical Sections
l #pragma omp critical [name]
l Standard critical section functionality
l Critical sections are global in the program
l Can be used to protect a single resource in different
functions
l Critical sections are identified by the name
l All the unnamed critical sections are mutually
exclusive throughout the program
l All the critical sections having the same name are
mutually exclusive between themselves

Critical Sections
int x=0;
#pragma omp parallel shared(x)
{
#pragma omp critical
x++;
}

23
11/7/17

Ordered
l #pragma omp ordered statement
l Executes the statement in the sequential order
of iterations
l Example:
#pragma omp parallel for ordered
for (j=0; j<N; j++) {
int result = j*j;
#pragma omp ordered
printf ("computation(%d) = %d\n" ,j ,
result ) ;
}

Barrier synchronization
l #pragma omp barrier
l Performs a barrier synchronization between all
the threads in a team at the given point.
l Example:
#pragma omp parallel
{
int result = heavy_computation_part1 ()
;
#pragma omp atomic
sum += result ;
#pragma omp barrier
heavy_computation_part2 (sum) ;
}

24
11/7/17

Explicit Locking
l Can be used to pass lock variables around
(unlike critical sections!)
l Can be used to implement more involved
synchronization constructs
l Functions:
l omp_init_lock(), omp_destroy_lock(),
omp_set_lock(), omp_unset_lock(), omp_test_lock()
The usual semantics
l Use #pragma omp flush to synchronize memory

Consistency Violation?

25
11/7/17

Consistency Violation?

Reduction
for (j=0; j<N; j++) {
sum =
sum+a[j]∗b[j];
}
l How to parallelize this code?
l sum is not private, but accessing it atomically is too
expensive
l Have a private copy of sum in each thread, then
add them up
l Use the reduction clause!
l #pragma omp parallel for reduction(+: sum)
l An operator must be used: +, -, *...

26
11/7/17

Synchronization Overhead
l Lost time waiting for locks
l Prefer to use structures that are as lock-free as
possible!

Summary
l OpenMP is a compiler-based technique to create
concurrent code from (mostly) serial code
l OpenMP can enable (easy) parallelization of loop-
based code
l Lightweight syntactic language extensions

l OpenMP performs comparably to manually-coded


threading
l Scalable
l Portable

l Not a silver bullet for all applications

27
11/7/17

More Information

• www.openmp.org
l OpenMP official site

• www.llnl.gov/computing/tutorials/openMP/
l A handy OpenMP tutorial

• www.nersc.gov/nusers/help/tutorials/openmp/
l Another OpenMP tutorial and reference

Backup Slides
Syntax, etc

28
11/7/17

lOpenMP Syntax

l General syntax for OpenMP directives


#pragma omp directive [clause…] CR

l Directive specifies type of OpenMP operation


l Parallelization
l Synchronization
l Etc.
l Clauses (optional) modify semantics of Directive

lOpenMP Syntax

l PARALLEL syntax
#pragma omp parallel [clause…] CR
structured_block

Ex:
#pragma omp parallel
Output:
Hello! (T=4)
Hello!
{
Hello!
printf(“Hello!\n”);
Hello!
} // implicit barrier

29
11/7/17

l OpenMP Syntax
l DO/for Syntax (DO-Fortran, for-C)
#pragma omp for [clause…] CR
for_loop

Ex:
#pragma omp parallel
{
#pragma omp for private(i) shared(x) \
schedule(static,x/N)
for(i=0;i<x;i++) printf(“Hello!\n”);
} // implicit barrier
Note: Must reside inside a parallel section

l OpenMP Syntax
More on Clauses
• private() – A variable in private list is private
to each thread
• shared() – Variables in shared list are visible
to all threads
l Implies no synchronization, or even consistency!
• schedule() – Determines how iterations will
be divided among threads
–schedule(static, C) – Each thread will be
given C iterations
- Usually T*C = Number of total iterations
–schedule(dynamic) – Each thread will be given
additional iterations as-needed
- Often less efficient than considered static allocation
• nowait – Removes implicit
CS Architecture Seminarbarrier from end of

block

30
11/7/17

OpenMP Syntax
l

l PARALLEL FOR (combines parallel and


for)
#pragma omp parallel for [clause…] CR
for_loop

Ex:
#pragma omp parallel for shared(x)\
private(i)
\

schedule(dynamic)
for(i=0;i<x;i++) {
printf(“Hello!\n”);

lExample: AddMatrix

Files:
(Makefile)
addmatrix.c // omp-
parallelized
matrixmain.c // non-omp
printmatrix.c // non-omp

31
11/7/17

lOpenMP Syntax
l ATOMIC syntax
#pragma omp atomic CR
simple_statement

Ex:
#pragma omp parallel shared(x)
{
#pragma omp atomic
x++;
} // implicit barrier

OpenMP Syntax

• CRITICAL syntax
#pragma omp critical CR
structured_block
Ex:
#pragma omp parallel shared(x)
{
#pragma omp critical
{
// only one thread in here
}
} // implicit barrier

32
11/7/17

l OpenMP Syntax
ATOMIC vs. CRITICAL

l Use ATOMIC for “simple statements”


l Can have lower overhead than CRITICAL if HW
atomics are leveraged (implementation dep.)

l Use CRITICAL for larger expressions


l May involve an unseen implicit lock

l OpenMP Syntax
l MASTER – only Thread 0 executes a block
#pragma omp master CR
structured_block

l SINGLE – onlyomp
#pragma one single
thread executes
CR a block
structured_block

l No implied synchronization

33
11/7/17

lOpenMP Syntax
l BARRIER
#pragma omp barrier CR

l Locks
l Locks are provided through omp.h library calls
–omp_init_lock()
–omp_destroy_lock()
–omp_test_lock()
–omp_set_lock()
–omp_unset_lock()

lOpenMP Syntax
l FLUSH
#pragma omp flush CR

l Guarantees that threads’ views of memory is


consistent
l Why? Recall OpenMP directives…
l Code is generated by directives at compile-time
- Variables are not always declared as volatile
- Using variables from registers instead of memory can
seem like a consistency violation
l Synch. Often has an implicit flush
- ATOMIC, CRITICAL

34
11/7/17

lOpenMP Syntax
l Functions
omp_set_num_threads()
omp_get_num_threads()
omp_get_max_threads()
omp_get_num_procs()
omp_get_thread_num()
omp_set_dynamic()
omp_[init destroy test set
unset]_lock()

Function for the environment


l omp_set_dynamic(int)
l omp_set_num_threads(int)
l omp_get_num_threads()
l omp_get_num_procs()
l omp_get_thread_num()
l omp_set_nested(int)
l omp_in_parallel()
l omp_get_wtime()

35

You might also like