0% found this document useful (0 votes)
50 views

Introduction To Open MP

The parallel region construct is the fundamental OpenMP construct. It creates a team of threads that execute the code block. It supports clauses like private, shared, and default to specify data scoping. The code block is executed by all threads concurrently.

Uploaded by

ahmad.nawaz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

Introduction To Open MP

The parallel region construct is the fundamental OpenMP construct. It creates a team of threads that execute the code block. It supports clauses like private, shared, and default to specify data scoping. The code block is executed by all threads concurrently.

Uploaded by

ahmad.nawaz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Introduction to Open MP

COMP1680
OpenMP
• OpenMP stands for
• Short version: Open Multi-Processing
• Long version: Open specifications for Multi-Processing via collaborative work between
interested parties from the hardware and software industry, government and academia.
• An Application Program Interface (API) that may be used to explicitly direct
multi-threaded, shared memory parallelism
• Comprised of three primary API components:
• Compiler Directives
• Runtime Library Routines
• Environment Variables

University of Greenwich 2
OpenMP
• Portable:
• The API is specified for C/C++ and Fortran
• Most major desktop platforms have been implemented including Unix/Linux
platforms and Windows

• Standardized:
• Jointly defined and endorsed by a group of major computer hardware and
software vendors
• Was expected to become an ANSI standard, but never did.

University of Greenwich 3
OpenMP is not:
• Meant for distributed memory parallel systems (by itself)
• Necessarily implemented identically by all vendors
• Guaranteed to make the most efficient use of shared memory
• Required to check for data dependencies, data conflicts, race conditions, or deadlocks
• Required to check for code sequences that cause a program to be classified as non-
conforming
• Meant to cover compiler-generated automatic parallelization and directives to the
compiler to assist such parallelization
• Designed to guarantee that input or output to the same file is synchronous when
executed in parallel. The programmer is responsible for synchronizing input and
output.

University of Greenwich 4
Goals of OpenMP
Standardization:
• Provide a standard among a variety of shared memory architectures/platforms
Lean and Mean:
• Establish a simple and limited set of directives for programming shared memory
machines. Significant parallelism can be implemented by using just 3 or 4 directives

University of Greenwich 5
Goals of OpenMP
Ease of Use:
• Provide capability to incrementally parallelize a serial program, unlike message-
passing libraries which typically require an all or nothing approach
• Provide the capability to implement both coarse-grain and fine-grain parallelism
Portability:
• Supports Fortran (77, 90, and 95), C, and C++
• Public forum for API and membership

University of Greenwich 6
Programming Model
Shared Memory, Thread Based Parallelism:
• OpenMP is based upon the existence of multiple threads in the shared memory
programming paradigm.
• A shared memory process consists of multiple threads

Explicit Parallelism:
• OpenMP is an explicit (not automatic) programming model, offering the programmer
full control over parallelization

University of Greenwich 7
Programming Model
Fork - Join Model:
OpenMP uses the fork-join model of parallel execution

All OpenMP programs begin as a single process called the master thread
The master thread executes sequentially until the first parallel region construct is
encountered.

University of Greenwich 8
Programming Model
Fork - Join Model continued:

• FORK - the master thread then creates a team of parallel threads


• The statements in the program that are enclosed by the parallel region construct are
then executed in parallel among the various team threads
• JOIN - When the team threads complete the statements in the parallel region
construct, they synchronize and terminate, leaving only the master thread
Programming Model
Compiler Directive Based:

• Most OpenMP parallelism is specified through the use of compiler directives which are embedded in
C/C++ or Fortran source code

Nested Parallelism Support:

• The API provides for the placement of parallel constructs inside of other parallel constructs.

• Implementations may or may not support this feature

Dynamic Threads:

• The API provides for dynamically altering the number of threads which may be used to execute
different parallel regions

• Implementations may or may not support this feature


Programming Model
I/O:

• OpenMP specifies nothing about parallel I/O.

• This is particularly important if multiple threads attempt to write/read from the same file

• If every thread conducts I/O to a different file, the issues are not as significant

• It is entirely up to the programmer to ensure that I/O is conducted correctly within the
context of a multi-threaded program
Programming Model
Memory Model: FLUSH Often?

• OpenMP provides a "relaxed-consistency" and "temporary" view of thread memory (in


their words)

• Threads can "cache" their data and are not required to maintain exact consistency
with real memory all of the time

• When it is critical that all threads view a shared variable identically, the programmer
is responsible for ensuring that the variable is FLUSHed by all threads as needed
C/C++
#include <omp.h>

main () {
Example int var1, var2, var3;
OpenMP Code
Structure .
.
Serial code

Beginning of parallel section. Fork a team of threads.


Specify variable scoping

#pragma omp parallel private(var1, var2) shared(var3)


{
Parallel section executed by all threads
.
.
All threads join master thread and disband
}

Resume serial code


.
.
}
Compiling an OpenMP Program
• To activate the OpenMP extensions for C/C++, the compile-time flag -fopenmp must
be specified for gcc

• This enables the OpenMP directive #pragma omp

• The flag also arranges for automatic linking of the OpenMP runtime library
i.e. an understanding of functions such as omp_get_thread_num() to get the
number for a given thread
Compiling an OpenMP Program
• To generate an executable from a c program file, for example a file called
omp_hello.c we use the following command in the shell

> gcc –fopenmp omp_hello.c –o omp_hello

• This will compile, link in the OpenMP library and generate a binary executable called
omp_hello
Running an OpenMP Program
• To execute an OpenMP program simply type in the executable name in the shell
command line, for example

./omp_hello
OpenMP Directive Format
#pragma omp directive-name [clause, ...] newline
Required for all Required Optional Required
OpenMP C/C++ A valid OpenMP Clauses can be in Precedes the
directives. directive. Must any order, and structured block
appear after the repeated as which is enclosed
pragma and before necessary unless by this directive.
any clauses. otherwise restricted.

Example:
#pragma omp parallel default(shared) private(beta,pi)
(be careful when commenting these out)
OpenMP Directive Rules
• Case sensitive

• Directives follow conventions of the C/C++ standards for compiler directives

• Only one directive-name may be specified per directive

• Each directive applies to at most one succeeding statement, which must be a structured
block

• Long directive lines can be "continued" on succeeding lines by escaping the newline
character with a backslash ("\") at the end of a directive line
OpenMP Directive Scope
Why Is This Important?
• OpenMP specifies a number of scoping rules on how
directives may associate (bind) and nest within each other
• Illegal and/or incorrect programs may result if the OpenMP
binding and nesting rules are ignored
• Binding will be discussed after directives and clauses have
been covered
OpenMP Directives
parallel Region Construct task Construct
Work-Sharing Constructs Synchronization Constructs
for Directive master Directive
sections Directive critical Directive
single Directive barrier Directive
atomic Directive
ordered Directive
Combined Parallel Work-Sharing threadprivate Directive
Constructs
parallel for Directive
parallel sections Directive
OpenMP Directives
parallel Region Construct task Construct
Work-Sharing Constructs Synchronization Constructs
for Directive master Directive
sections Directive critical Directive
single Directive barrier Directive
atomic Directive
ordered Directive
Combined Parallel Work-Sharing threadprivate Directive
Constructs
parallel for Directive
parallel sections Directive
parallel Region Construct
Purpose:
• A parallel region is a block of code that will be executed by multiple threads
• This is the fundamental OpenMP parallel construct
Format: #pragma omp parallel [clause ...] newline
if (scalar_expression)
private (list)
shared (list)
default (shared | none)
firstprivate (list)
reduction (operator: list)
copyin (list)
num_threads (integer-expression)

structured_block
parallel Region Construct
Notes:
• When a thread reaches a PARALLEL directive, it creates a team of threads and becomes
the master of the team. The master is a member of that team and has thread number 0
within that team
• Starting from the beginning of this parallel region, the code is duplicated and all threads
will execute that code
• There is an implied barrier at the end of a parallel section
• Only the master thread continues execution past this point
• If any thread terminates within a parallel region, all threads in the team will terminate, and
the work done up until that point is undefined
parallel Region Construct
How Many Threads?
• The number of threads in a parallel region is determined by the following factors, in
order of precedence:
1. Evaluation of the if clause
2. Setting of the num_threads clause
3. Use of the omp_set_num_threads() library function
4. Setting of the OMP_NUM_THREADS environment variable
5. Implementation default - usually the number of CPUs on a node, though it
could be dynamic
• Threads are numbered from 0 (master thread) to N-1
parallel Region Construct
Dynamic Threads:
• Use the omp_get_dynamic() library function to determine if dynamic
threads are enabled.
• If supported, the two methods available for enabling dynamic threads
are:
• The omp_set_dynamic() library routine
• Setting of the OMP_DYNAMIC environment variable to TRUE
parallel Region Construct
Nested Parallel Regions:
• Use the omp_get_nested() library function to determine if nested parallel regions are
enabled
• The two methods available for enabling nested parallel regions (if supported) are:
• The omp_set_nested() library routine
• Setting of the OMP_NESTED environment variable to TRUE
• If not supported, a parallel region nested within another parallel region results in
the creation of a new team, consisting of one thread, by default
parallel Region Construct
Restrictions:
• A parallel region must be a structured block that does not span multiple routines or code
files
• It is illegal to branch into or out of a parallel region
• Only a single if clause is permitted
• When the if clause is present, it must evaluate to non-zero in order for a team of
threads to be created. Otherwise, the region is executed serially by the master
thread
• Only a single num_threads clause is permitted
Note – other clauses will be covered later
parallel Region Construct Example
#include <omp.h>
• Every thread executes all code enclosed
#define THREADS 4
in the parallel section
main () {
int tid;
omp_set_num_threads(THREADS);
/* Fork a team of threads with each thread having a private tid variable */ • OpenMP library routines are used to
#pragma omp parallel private(tid)
obtain thread identifiers and total number
{
/* Obtain and print thread id */ of threads
tid = omp_get_thread_num();
printf("Hello World from thread = %d\n", tid);
/* Only master thread does this */
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
} /* All threads join master thread and terminate */
}
OpenMP Directives
parallel Region Construct task Construct
Work-Sharing Constructs Synchronization Constructs
for Directive master Directive
sections Directive critical Directive
single Directive barrier Directive
atomic Directive
ordered Directive
Combined Parallel Work-Sharing threadprivate Directive
Constructs
parallel for Directive
parallel sections Directive
Work-Sharing Constructs
• A work-sharing construct divides the execution of the enclosed code region among
the members of the team that encounter it
• Work-sharing constructs do not launch new threads
• There is no implied barrier upon entry to a work-sharing construct, however there is
an implied barrier at the end of a work sharing construct
Work-Sharing Constructs
for - shares iterations of a loop sections - breaks work into single - serializes a section of code
across the team of threads. separate, discrete sections. For
Represents a type of "data example, each section of code is
parallelism“ MOST COMMONLY executed by a different thread. Can
USED be used to implement a type of
"functional parallelism".
Work-Sharing Constructs
Restrictions:
• A work-sharing construct must be enclosed dynamically within a parallel region in
order for the directive to execute in parallel
• Work-sharing constructs must be encountered by all members of a team or none at all
• Successive work-sharing constructs must be encountered in the same order by all
members of a team
OpenMP Directives
parallel Region Construct task Construct
Work-Sharing Constructs Synchronization Constructs
for Directive master Directive
sections Directive critical Directive
single Directive barrier Directive
atomic Directive
ordered Directive
Combined Parallel Work-Sharing threadprivate Directive
Constructs
parallel for Directive
parallel sections Directive
for Directive
Purpose:
• The for directive specifies that the iterations of the loop immediately following it must be
executed in parallel by the team
• This assumes a parallel region has already been initiated, otherwise it executes in serial
on a single processor

Format: #pragma omp for [clause ...] newline


schedule (type [,chunk])
ordered
private (list)
firstprivate (list)
lastprivate (list)
shared (list)
reduction (operator: list)
collapse (n)
nowait
for_loop
for Directive
Clauses:
nowait
• If specified, then threads do not synchronize at the end of the parallel loop
ordered
• Specifies that the iterations of the loop must be executed as they would be in a serial program
collapse
• Specifies how many loops in a nested loop should be collapsed into one large iteration space and
divided according to the schedule clause (next slide).
• The sequential execution of the iterations in all associated loops determines the order of the
iterations in the collapsed iteration space
for Directive
Clauses:
schedule
• Describes how iterations of the loop are divided among the threads in the team
• The default schedule is implementation dependent
• The different schedule types are static, dynamic, guided, runtime and
auto
• Described in next few slides with an illustration of how 4 threads might be used to
process a for loop with 21 iterations (numbered 0-20):
for Directive
static
• Loop iterations are divided into pieces of size chunk and then statically assigned to threads
• If chunk is not specified, the iterations are evenly (if possible) divided contiguously among the threads
e.g. chunk not specified 0 1 2 3
0-5 6-10 11-15 16-20

dynamic
• Loop iterations are divided into pieces of size chunk, and dynamically scheduled among the threads
• When a thread finishes one chunk, it is dynamically assigned another. The default chunk size is 1.
e.g. chunk=3 0 1 2 3
0-2 3-5 6-8 12-14
18-20 9-11 15-17
for Directive
guided
• Similar to dynamic scheduling, but the chunk size starts off large and decreases to
better handle load imbalance between iterations
• The optional chunk parameter specifies the minimum size chunk to use
• By default the chunk size is approximately loop_count /number_of_threads
for Directive
runtime
• The scheduling decision is deferred until runtime by the environment variable
OMP_SCHEDULE
• It is illegal to specify a chunk size for this clause
e.g. setenv OMP_SCHEDULE dynamic
auto
• The scheduling decision is delegated to the compiler and/or runtime system
for Directive
Restrictions:
• The loop cannot be one without loop control (must know the number of iterations). The
loop iteration variable must be an integer and the loop control parameters must be the
same for all threads

• Program correctness must not depend upon which thread executes a particular iteration

• It is illegal to branch out of a loop associated with a for directive

• The chunk size must be specified as a loop invariant integer expression, as there is no
synchronization during its evaluation by different threads

• ordered, collapse and schedule clauses may appear once each


for Directive Example
Simple vector-add program
• Arrays a, b, c, and variable N will be shared by all threads
• Variable i will be private to each thread (this means each thread will have its own
unique copy)
• The iterations of the loop will be distributed dynamically in chunk sized pieces
• Threads will not synchronize upon completing their individual pieces of work (use of
nowait clause)
#include <omp.h>
#define CHUNKSIZE 100
#define N 1000

main ()
{

for Directive int i, chunk;


Example float a[N], b[N], c[N];

/* Some initializations */
for (i=0; i < N; i++) {
a[i] = i * 1.0;
b[i] = a[i];
}
chunk = CHUNKSIZE;

#pragma omp parallel shared(a,b,c,chunk) private(i)


{
#pragma omp for schedule(dynamic,chunk) nowait
for (i=0; i < N; i++)
c[i] = a[i] + b[i];
} /* end of parallel section */

You might also like