0% found this document useful (0 votes)

148 views37 pages

Unit 4 Shared-Memory Parallel Programming With Openmp

This document provides an introduction to shared-memory parallel programming with OpenMP. It discusses key OpenMP concepts like parallel regions, threads, fork-join model, data scoping, and work-sharing directives. Examples are given to illustrate parallelizing a for-loop using OpenMP directives and clauses like private, firstprivate, reduction. It also touches on loop scheduling and tasking in OpenMP.

Uploaded by

Sudha Palani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

148 views37 pages

Unit 4 Shared-Memory Parallel Programming With Openmp

Uploaded by

Sudha Palani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 37

Unit 4

Shared-memory parallel programming

with OpenMP
Objectives

A very brief introduction to OpenMP

An application programming interface (API) based mostly on a
set of compiler directives
Today’s most widely used approach for shared-memory parallel
programming
Basic OpenMP programming through examples

(More advanced OpenMP programming will be taught in later

chapters.)
First things first

The applicable hardware context: shared memory

All processors can directly access all data in a shared memory,

no need for explicit communication between the processors
OpenMP: A parallel programming standard for shared-memory
parallel computers
A set of compiler directives (with additional clauses) A
small number of library functions
A few environment variables
Advantages of compiler
directives

An OpenMP compiler directive is a “parallelization hint”

intended for a compatible compiler to automatically create
parallel code, will be ignored by non-OpenMP compilers
otherwise.
#pragma omp . . .
It is possible to maintain a same code for both the serial
implementation and the parallel OpenMP implementation.
Allows incremental parallelization—simplifies coding
effort—easier to debug
Threads in OpenMP

The central execution entities in an OpenMP program are

threads—lightweight processes.
The OpenMP threads share a common address space and
mutually access data.
Spawning a thread is much less costly than forking a new
process, because threads share everything except the
instruction pointer, stack pointer and register state.
If wanted, each thread can have a few “private variables” (by
means of the local stack pointer).
“Fork-join”
model
In any OpenMP program, a single thread, called master
thread, runs immediately after startup.
The master thread can spawn (also called fork) a number of
additional threads when entering a so-called parallel region.
Inside a parallel region, the master thread and the spawned
threads execute instruction streams concurrently.
Each thread has a unique ID.
Different threads work on different parts of the shared data or
carry out different tasks.
OpenMP has compiler directives for dividing the work among
the threads.
At the end of a parallel region, the threads are joined: all the
threads are terminated but the master thread.
There can be multiple parallel regions in an OpenMP program.
OpenMP’s parallel execution model
Hello world in
OpenMP
#include <stdio.h>
#include <omp.h>

int main (int nargs, char **args) { #pragma

omp parallel
{
printf("Hello world!\n");
}

return 0;
}

Example of compilation: gcc

-fopenmp hello_world.c
Parallel execution: . / a . o u t
What do you get?
Control the number of threads at runtime
Example on a Linux system:
gcc -fopenmp hello_world.c
export OMP_NUM_THREADS=6
./a.out
export OMP_NUM_THREADS=8
./a.out
OMP_NUM_THREADS is an environment variable (understood by
OpenMP)

It is also possible to hard-code the number of threads inside an

OpenMP program:

#pragma omp parallel num_threads(6)

{
// ......
}

But this approach is normally not recommended!

Hello world in OpenMP (a bit more interesting example)

#include <stdio.h>
#include <omp.h>

int main (int nargs, char

**args) {

printf("I’m
#pragma omp the parallel
master
{ thread, I’m
alone.\n");
int num_threads, thread_id; num_threads =
omp_get_num_threads(); thread_id =
omp_get_thread_num();
printf("Hello world! I’m thread No.% d out of %d
threads.\n",thread_id,num_threads);
}

return 0;
}
“Manual” loop parallelization

Now, let’s try to “manually” parallelize a for-loop, that is, divide the
iterations evenly among the threads.
Example:

for (i=0; i<N; i++) a[i] =

b[i] + c[i];

Important observation: Assuming that the arrays a, b and c do

not overlap (that is, no aliasing), then the iterations of this for-loop
are independent of each other, thus safe to be executed by multiple
threads concurrently.
“Manual” loop parallelization (2)

How to divide the iterations evenly among the threads?

Given num_threads as the total number of threads, one way to
divide N iterations for thread with ID thread_id is as follows:

int blen, bstart; blen =

N/num_threads;

if (thread_id < (N% num_threads)) { blen =

blen + 1;
bstart = blen * thread_id;
}
else {
bstart = blen * thread_id + (N%
num_threads);
}

Why is this a fair division?

“Manual” loop parallelization (3)
OpenMP coding

#pragma omp parallel

{
int num_threads, thread_id; int
blen, bstart, bend, i;

num_threads = omp_get_num_threads();
thread_id = omp_get_thread_num();

blen = N/num_threads;
if (thread_id < (N% num_threads)) { blen =
blen + 1;
bstart = blen * thread_id;
}
else {
bstart = blen * thread_id + (N%
num_threads);
}
bend = bstart + blen;

for (i=bstart; i<bend; i++) a[i] =

b[i] + c[i];

} // end of parallel region

Data scoping
Any variables that existed before a parallel region still exist inside
the parallel region, and are by default shared between all threads.
Often it will be necessary for the threads to have some private
variables.

Each thread can either declare new local variables inside the
parallel region, these variables are private “by birth”;
Or, each thread can “privatize” some of the shared variables
that already existed before a parallel region (using the
p r i v a t e clause):
int blen, bstart, bend;
#pragma omp parallel private(blen, bstart, bend)
{
// ...
}

Each “privatized” variable has one (uninitialized) instance per

thread;
The private variables’ scope is until the end of the parallel
region.
Actually, parallelizing a for-loop is easy in
OpenMP
Parallelizing for-loops is OpenMP’s main work-sharing
mechanism.
OpenMP has several built-in strategies for dividing the
iterations among the threads.
No need to manually calculate each thread’s loop
bounds

#pragma omp parallel

{
#pragma omp for
for (i=0; i<N; i++) a[i] =
b[i] + c[i];
} // end of parallel region

or simply

#pragma omp parallel for for

(i=0; i<N; i++)
a[i] = b[i] + c[i];
More remarks

The loop can not contain break, return, e x i t statements.

The continue statement is allowed.
The index update has to be an increment (or decrement) by a
fixed amount.
The loop index variable is automatically private, and changes
to it inside the loop are not allowed.
Numerical integration
x1
How to numerically calculate ∫ f (x )dx ?
x0
Numerical integration for calculating π

Serial implementation:

int N, i;
double w = 1.0/N, x, approx_pi; double
sum = 0.;

for (i=1; i<=N; i++) { x =

w*(i-0.5);
sum = sum + 4.0/(1.0+x*x);
}

approx_pi = w*sim;
A naive OpenMP implementation

int N, i;
double w = 1.0/N, x, approx_pi = 0.; double
sum = 0.;

#pragma omp parallel private(x)

firstprivate(sum)
{

#pragma omp for

for (i=1; i<=N; i++) { x =
w*(i-0.5);
sum = sum + 4.0/(1.0+x*x);
}

#pragma omp critial

{
approx_pi = approx_pi +
w*sim;
}

} // end of the parallel region

OpenMP critical regions

Concurrent write accesses to a shared variable must be

avoided by all means to circumvent race conditions.
An OpenMP c r i t i c a l code block is executed by one thread
at a time. This is one way to avoid race conditions. (There are
other ways.)
The variable approx_pi in the above example is a shared
variable, to which all the threads will write. Thus, “protection”
is provided by a c r i t i c a l code block.
Use of the c r i t i c a l directive will incur overhead.
Improper use of the c r i t i c a l directive may lead to deadlock.
Use of OpenMP’s reduction
clause

Actually, using c r i t i c a l to prevent concurrent writes to the

variable approx_pi is an “over-kill”. The reduction clause of
OpenMP is designed for this particular purpose:

int N, i;
double sum = 0.;
double w = 1.0/N, x, approx_pi;

#pragma omp parallel for private(x) reduction(+:sum) for (i=1;

i<=N; i++) {
x = w*(i-0.5);
sum = sum + 4.0/(1.0+x*x);
}

approx_pi = w*sim;
Another example of using the reduction
clause

#pragma omp parallel for reduction(+:s) for (i=0;

i<N; i++)
s = s + a[i]*a[i];
Loop scheduling

#pragma omp parallel for for

(i=0; i<N; i++)
a[i] = b[i] + c[i];

How are the loop

iterations exactly
divided among the
threads?

Mapping of loop iterations to threads is configurable in

OpenMP.
The “secret” is the schedule clause:
#pragma omp parallel for schedule(static|dynamic|guided [,chunk])

Default scheduling is s t a t i c (no need to specify), which

divides the iterations into contiguous chunks of (roughly)
equal size.
Other alternatives of scheduling: dynamic and guided
Examples of different
schedulers
Tasking

A task can be defined by OpenMP’s t a s k directive, containing

the code to be executed.
When a thread encounters a task construct, it may execute it
right away or set up the appropriate data environment and
defer its execution. The task is then ready to be executed later
by any thread of the team.
An example of OpenMP tasks

#pragma omp parallel private(r, i)

{

#pragma single
{
for (i=0; i<N; i++) {
r = rand(); // a randomly generated number if (p[i] > r) {
#pragma task
{
do_some_work (p[i]);
}
} // enf of if-test
} // end of for-loop
} // end of the single
directive

} // end of the parallel region

The actual number of calls to do_some_work is unknown, so

tasking is a natural choice for work division.
single and
master

A “si ngle” code block in OpenMP will be entered by one thread

only, namely the therad that reaches the s i n g l e directive first. All
others skip the code and wait at the end of the s i n g l e block due
to an implicit barrier.
A “master” code block is only entered by the master thread, all the
other threads skip over without waiting for the master thread to
finish.
OpenMP has a separate “ b a r r i e r ” directive for explicit
synchronization among the threads. (Use with care!!!)
Yet another
example
int myid, numthreads;

#pragma omp parallel private(myid)

{
my_id = omp_get_thread_num();

#pragma omp single

{
numthreads = omp_get_num_threads();
}

#pragma omp critical // not strictly necessary

{
printf("This is thread No.% d out of %d threads\n", my_id,
numthreads);
}

} // end of the parallel region

Jacobi algorithm

Serial C implementation (slightly different from that of Chapter 3):

double maxdelta = 1.0, eps = 1.0e-14; while

(maxdelta > eps) {

maxdelta = 0.;

for (k=1; k<kmax-1; k++) for (i=1;

i<imax-1; i++) {
phi_new[k][i] = (phi[k-1][i]+ph[k]
[i-1]
+phi[k][i+1]+phi[k+1][i])*0.25; maxdelta =
max(maxdelta, abs(phi_new[k][i]-phi[k][i]));
}

/*
pointer swapping */
temp_ptr = phi_new;
phi_new = phi;
phi =
temp_ptr;
}
OpenMP-parallel Jacobi algorithm

double maxdelta = 1.0, eps = 1.0e-14;

while (maxdelta > eps)

{ maxdelta = 0.;

#pragma omp parallel for

reduction(max: maxdelta)
private(i)
{
for (k=1; k<kmax-1; k++) for (i=1;
i<imax-1; i++) {
phi_new[k][i] = (phi[k-1][i]+ph[k]
[i-1]
+phi[k][i+1]+phi[k+1][i])*0.25; maxdelta =
max(maxdelta, abs(phi_new[k][i]-phi[k][i]));
}
} // end of the parallel region

/*
pointer swapping */
temp_ptr = phi_new;
phi_new = phi;
phi =
temp_ptr;
OpenMP-parallel Jacobi algorithm (version 2)
double maxdelta = 1.0, eps = 1.0e-14;

#pragma omp parallel

{
while (maxdelta > eps) {

#pragma omp single

{
maxdelta = 0.;
}

#pragma omp for reduction(max: maxdelta) private(i)

{
for (k=1; k<kmax-1; k++)
for (i=1; i<imax-1; i++) {
phi_new[k][i] = (phi[k-1][i]+ph[k][i-1]
+phi[k][i+1]+phi[k+1][i])*0.25; maxdelta =
max(maxdelta, abs(phi_new[k][i]-phi[k][i]));
}
}

#pragma omp master

{
/*
pointer swapping */
temp_ptr = phi_new;
phi_new = phi;
phi =
temp_ptr;
}
} // end of
while loop

} // end of
Challenge: Parallelizing 3D Gauss-Seidel algorithm

What if the iterations of a triple loop nest are not entirely

independent? (There are loop-carried independences.)
Example: 3D Gauss-Seidel algorithm (computational
core)

for (k=1; k<kmax-1; k++) for

(j=1; j<jmax-1; j++)
for (i=1; i<imax-1; i++)
phi[k][j][i] = (phi[k-1][j][i] + phi[k][j-1][i]
+phi[k][j][i-1] + phi[k][j][i+1]
+phi[k][j+1][i] + phi[k+1][j][i])/6.0;

We cannot just add #pragma omp p a r a l l e l f o r

before the
k-indexed loop.

Note: The upper limits of k, j and i are different from those given
in Chapter 6 of the textbook.
Wavefront parallelization

Although not as simple as the Jacobi algorithm, it is still possible to

parallelize the Gauss-Seidel algorithm with OpenMP.
The key idea is to find a way of traversing the 3D lattice that
fulfills the dependency constraints imposed by the stencil update.
A wavefront travels in the k direction. The dimension along which
to parallelize is j . Each of the threads, T0, T1, . . ., T t − 1 , is
assinged a consecutive chuck of the j indices.
Wavefront parallelization (2)
Wavefront parallelization (3)
Important observations

The k index goes between 1 and kmax-2.

All the j indices 1 , 2 , . . . j m a x - 2 are divided evenly into
consecutive chucks: J0, J1, . . . , J t − 1 (one chunk per thread).
Total number of wavefronts: (kmax-2)+t − 1, for computing
through the entire 3D lattice
Wavefront W1 has only one block (k=1, J0)
Wavefront W2 has two concurrent blocks (k=1, J1) and (k=2,
J0 )
Wavefront W3 has three concurrent blocks (k=1, J2), (k=2,
J1) and (k=3, J0)
···
For wavefronts W ,t W t+1 , . . . , kmax-2 , each has t
concurrent
W blocks
Wavefronts Wkmax-1, . . . , Wkmax-2+t−1 have fewer and
fewer concurrent blocks (the wind-down phase).
OpenMP wavefront parallelization

#pragma omp parallel private(k,j,i)

{
int numthreads, threadID, jstart, jend, m;

numthreads = omp_get_num_threads();
threadID = omp_get_thread_num();
jstart = ((jmax-2)*threadID)/numthreads + 1;
jend = ((jmax-2)*(threadID+1))/numthreads;

for (m=1; m<=kmax+numthreads-3; m++) { // the wavefronts k = m -

threadID;
if (k>=1 && k<=kmax-2) {
for (j=jstart; j<=jend; j++) for (i=1;
i<imax-1; i++)
phi[k][j][i] = (phi[k-1][j][i] + phi[k][j-1][i]
+phi[k][j][i-1] + phi[k][j][i+1]
+phi[k][j+1][i] + phi[k+1][j][i])/6.0;
}
#pragma omp barrier
}
} // end of the parallel region

Samarth
No ratings yet
Samarth
21 pages
4549 1688754073246 Unit 35 System Analysis and Design
No ratings yet
4549 1688754073246 Unit 35 System Analysis and Design
85 pages
ABAP Data Dictionary Interview Questions
No ratings yet
ABAP Data Dictionary Interview Questions
9 pages
CGM V Sem Lab Manual
No ratings yet
CGM V Sem Lab Manual
17 pages
Open MPLecture
No ratings yet
Open MPLecture
54 pages
Edition-Based Redefinition: Julian Dyke Independent Consultant
No ratings yet
Edition-Based Redefinition: Julian Dyke Independent Consultant
18 pages
SMT and CMP Architectures
100% (3)
SMT and CMP Architectures
19 pages
Multiprocessor Architecture System
100% (1)
Multiprocessor Architecture System
10 pages
The Java Thread Model
100% (3)
The Java Thread Model
6 pages
PDF
100% (2)
PDF
39 pages
Subject Name Parallel and Distributed Computing
100% (1)
Subject Name Parallel and Distributed Computing
3 pages
CS6456-Object Oriented Programming
No ratings yet
CS6456-Object Oriented Programming
15 pages
Unit Iv Distributed Memory Programming With Mpi
No ratings yet
Unit Iv Distributed Memory Programming With Mpi
19 pages
Enable Disable Constraint
No ratings yet
Enable Disable Constraint
10 pages
Hash Pointers
No ratings yet
Hash Pointers
10 pages
SPINS: Security Protocols For Sensor Networks
No ratings yet
SPINS: Security Protocols For Sensor Networks
29 pages
Uniform Cost Search
No ratings yet
Uniform Cost Search
3 pages
Cs2203 Object Oriented Programming Iiird Sem Question Bank Unit - I Part - A (2 Marks)
No ratings yet
Cs2203 Object Oriented Programming Iiird Sem Question Bank Unit - I Part - A (2 Marks)
6 pages
Concurrent Process
No ratings yet
Concurrent Process
21 pages
Unit 1 Introduction: Network Hardware, Network Software, References Models. The Physical Layer: The
No ratings yet
Unit 1 Introduction: Network Hardware, Network Software, References Models. The Physical Layer: The
17 pages
Pass 1:: 9. Define Macro. Write A C Program With A Macro To Find Out Biggest of Two Numbers
No ratings yet
Pass 1:: 9. Define Macro. Write A C Program With A Macro To Find Out Biggest of Two Numbers
12 pages
Unit I
No ratings yet
Unit I
53 pages
Openmp Tutorial: Seung-Jai Min
No ratings yet
Openmp Tutorial: Seung-Jai Min
30 pages
Guidelines For Technical Seminar Report
No ratings yet
Guidelines For Technical Seminar Report
6 pages
SAP CRM Transaction Launcher
No ratings yet
SAP CRM Transaction Launcher
19 pages
Artificial Intelligence Aakash
No ratings yet
Artificial Intelligence Aakash
129 pages
Embedded System & IoT
No ratings yet
Embedded System & IoT
27 pages
CP4253 Map Unit I
No ratings yet
CP4253 Map Unit I
31 pages
Implementation Techniques - Unit 4
No ratings yet
Implementation Techniques - Unit 4
29 pages
Design Issues: SMT and CMP Architectures
No ratings yet
Design Issues: SMT and CMP Architectures
9 pages
RM & IPR Module 2
No ratings yet
RM & IPR Module 2
32 pages
Cross-Site Scripting PDF
No ratings yet
Cross-Site Scripting PDF
18 pages
Organization Management PDF
No ratings yet
Organization Management PDF
25 pages
Csi 3131 Midterm W13 Soln
No ratings yet
Csi 3131 Midterm W13 Soln
10 pages
Unit 1 Modern Processors
No ratings yet
Unit 1 Modern Processors
52 pages
CP4253 Map Unit Iii
No ratings yet
CP4253 Map Unit Iii
26 pages
CP4253 Map Unit Ii
No ratings yet
CP4253 Map Unit Ii
23 pages
Angular 2
100% (5)
Angular 2
342 pages
Parallel Computing
No ratings yet
Parallel Computing
57 pages
Collision Free Scheduling
No ratings yet
Collision Free Scheduling
18 pages
Unit 1
No ratings yet
Unit 1
104 pages
IT2202 - OPERATING SYSTEMS Handout
No ratings yet
IT2202 - OPERATING SYSTEMS Handout
5 pages
Ad3251 Unit 2 Notes Edu Engg
No ratings yet
Ad3251 Unit 2 Notes Edu Engg
35 pages
William Stallings Computer Organization and Architecture 8 Edition
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition
55 pages
Ds Unit 1 Data Structures
No ratings yet
Ds Unit 1 Data Structures
27 pages
Cs3353 Foundations of Data Science Unit V
No ratings yet
Cs3353 Foundations of Data Science Unit V
13 pages
ML Unit 1
No ratings yet
ML Unit 1
25 pages
UNIT I Part 1 Notes
No ratings yet
UNIT I Part 1 Notes
28 pages
Unit 2
No ratings yet
Unit 2
178 pages
I Bcom Ca C PRG
No ratings yet
I Bcom Ca C PRG
17 pages
Unit - I MP&MC
No ratings yet
Unit - I MP&MC
30 pages
HPC Unit 3
No ratings yet
HPC Unit 3
31 pages
Data Structure Syllabus
No ratings yet
Data Structure Syllabus
4 pages
Minimization of DFA
No ratings yet
Minimization of DFA
5 pages
Cs3591-Unit 4
No ratings yet
Cs3591-Unit 4
19 pages
Unit 2 Basic Optimization Techniques For Serial Code
No ratings yet
Unit 2 Basic Optimization Techniques For Serial Code
31 pages
Green Lantern Automation Framework For Testcomplete: Guide
No ratings yet
Green Lantern Automation Framework For Testcomplete: Guide
13 pages
Unit III Data Analysis and Reporting
No ratings yet
Unit III Data Analysis and Reporting
15 pages
Online Ticket Reservation System
No ratings yet
Online Ticket Reservation System
43 pages
Unit 2
No ratings yet
Unit 2
36 pages
Unit 2 (Process Synchronization) 1
No ratings yet
Unit 2 (Process Synchronization) 1
79 pages
DVT Paper
No ratings yet
DVT Paper
1 page
Streamprocessing Labmanual
No ratings yet
Streamprocessing Labmanual
48 pages
Principles of Concurrency
No ratings yet
Principles of Concurrency
7 pages
Cp4291 Iot Lab Manual
No ratings yet
Cp4291 Iot Lab Manual
35 pages
CP4292-Multicore Lab
No ratings yet
CP4292-Multicore Lab
39 pages
Group Assignment - DDW
No ratings yet
Group Assignment - DDW
2 pages
Causal Ordering of Messages in Distributed System
No ratings yet
Causal Ordering of Messages in Distributed System
4 pages
23ma2101 Advanced Mathematics For Scientific Computing
No ratings yet
23ma2101 Advanced Mathematics For Scientific Computing
10 pages
Java Q&a
No ratings yet
Java Q&a
7 pages
Blue Prism Interview Questions Archives - Programming Tutorials - Interview Questions - Coding Compiler
No ratings yet
Blue Prism Interview Questions Archives - Programming Tutorials - Interview Questions - Coding Compiler
56 pages
MC4103 Python Programming - Unit-Ii
No ratings yet
MC4103 Python Programming - Unit-Ii
72 pages
Dell Compellent Storage Center Charting Viewer Administrators Guide
No ratings yet
Dell Compellent Storage Center Charting Viewer Administrators Guide
14 pages
Assignment-2 Program
No ratings yet
Assignment-2 Program
5 pages
Paper Computer For Xii (Unit-8) Objective: Chapter-Wise Test Series Computer (2 Year)
No ratings yet
Paper Computer For Xii (Unit-8) Objective: Chapter-Wise Test Series Computer (2 Year)
1 page
Manuals - DCP-T226 - India - Brother
No ratings yet
Manuals - DCP-T226 - India - Brother
2 pages
AIML Notes Unit-5
No ratings yet
AIML Notes Unit-5
15 pages
Eiot Notes
No ratings yet
Eiot Notes
129 pages
M2M Architecture in IOT
No ratings yet
M2M Architecture in IOT
5 pages
Cn-Unit 3
No ratings yet
Cn-Unit 3
59 pages
Docker Image Basics
No ratings yet
Docker Image Basics
7 pages
CS3492 Database Management Systems Apr May 2024 Question Paper Download
No ratings yet
CS3492 Database Management Systems Apr May 2024 Question Paper Download
2 pages
Agile SDLC
No ratings yet
Agile SDLC
28 pages
IYSM. Thirty Years of IFPUG. Software Economics and Function Point Metrics Capers Jones
No ratings yet
IYSM. Thirty Years of IFPUG. Software Economics and Function Point Metrics Capers Jones
62 pages
BigData Mining and Analytics
No ratings yet
BigData Mining and Analytics
2 pages
CSC 212
No ratings yet
CSC 212
4 pages
22MIS7236 IWT Assignment - 3
No ratings yet
22MIS7236 IWT Assignment - 3
14 pages
Complete Python Interview Questions
No ratings yet
Complete Python Interview Questions
5 pages
Parallel Programming Unit 2
No ratings yet
Parallel Programming Unit 2
71 pages
Iot Question Bank
No ratings yet
Iot Question Bank
2 pages
3.multicore Architecture and Programming
0% (1)
3.multicore Architecture and Programming
3 pages
OS Unit 3 Part 2
No ratings yet
OS Unit 3 Part 2
32 pages
Unit II - SCADA and RFID Protocols
0% (1)
Unit II - SCADA and RFID Protocols
6 pages
CISA - Domain 3
No ratings yet
CISA - Domain 3
95 pages
Documentation - Aras Innovator 12.0 - Batch Loader
No ratings yet
Documentation - Aras Innovator 12.0 - Batch Loader
25 pages

Unit 4 Shared-Memory Parallel Programming With Openmp

Uploaded by

Unit 4 Shared-Memory Parallel Programming With Openmp

Uploaded by

Unit 4

Shared-memory parallel programming

A very brief introduction to OpenMP

(More advanced OpenMP programming will be taught in later

The applicable hardware context: shared memory

All processors can directly access all data in a shared memory,

An OpenMP compiler directive is a “parallelization hint”

The central execution entities in an OpenMP program are

int main (int nargs, char **args) { #pragma

Example of compilation: gcc

It is also possible to hard-code the number of threads inside an

#pragma omp parallel num_threads(6)

But this approach is normally not recommended!

int main (int nargs, char

for (i=0; i<N; i++) a[i] =

Important observation: Assuming that the arrays a, b and c do

How to divide the iterations evenly among the threads?

int blen, bstart; blen =

if (thread_id < (N% num_threads)) { blen =

Why is this a fair division?

#pragma omp parallel

for (i=bstart; i<bend; i++) a[i] =

} // end of parallel region

Each “privatized” variable has one (uninitialized) instance per

#pragma omp parallel

#pragma omp parallel for for

The loop can not contain break, return, e x i t statements.

for (i=1; i<=N; i++) { x =

#pragma omp parallel private(x)

#pragma omp for

#pragma omp critial

} // end of the parallel region

Concurrent write accesses to a shared variable must be

Actually, using c r i t i c a l to prevent concurrent writes to the

#pragma omp parallel for private(x) reduction(+:sum) for (i=1;

#pragma omp parallel for reduction(+:s) for (i=0;

#pragma omp parallel for for

How are the loop

Mapping of loop iterations to threads is configurable in

Default scheduling is s t a t i c (no need to specify), which

A task can be defined by OpenMP’s t a s k directive, containing

#pragma omp parallel private(r, i)

} // end of the parallel region

The actual number of calls to do_some_work is unknown, so

A “si ngle” code block in OpenMP will be entered by one thread

#pragma omp parallel private(myid)

#pragma omp single

#pragma omp critical // not strictly necessary

} // end of the parallel region

Serial C implementation (slightly different from that of Chapter 3):

double maxdelta = 1.0, eps = 1.0e-14; while

(maxdelta > eps) {

for (k=1; k<kmax-1; k++) for (i=1;

double maxdelta = 1.0, eps = 1.0e-14;

while (maxdelta > eps)

#pragma omp parallel for

#pragma omp parallel

#pragma omp single

#pragma omp for reduction(max: maxdelta) private(i)

#pragma omp master

What if the iterations of a triple loop nest are not entirely

for (k=1; k<kmax-1; k++) for

We cannot just add #pragma omp p a r a l l e l f o r

Although not as simple as the Jacobi algorithm, it is still possible to

The k index goes between 1 and kmax-2.

#pragma omp parallel private(k,j,i)

for (m=1; m<=kmax+numthreads-3; m++) { // the wavefronts k = m -

You might also like