DS1822-Parallel Computing - Unit2
DS1822-Parallel Computing - Unit2
OpenMP was explicitly designed to allow programmers to incrementally parallelize existing serial
programs; this is virtually impossible with MPI and fairly difficult with Pthreads.
Program Structure:
#pragma
Pragmas (like all preprocessor directives) are, by default, one line in length, so if a pragma won’t
fit on a single line, the newline needs to be “escaped”—that is, preceded by a backslash n.
The details of what follows the #pragma depend entirely on which extensions are being used.
Let’s take a look at a very simple example, a “hello, world” program that uses OpenMP.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
void Hello(void); / Thread function /
int main(int argc, char argv[]) {
/* Get number of threads from command line */
int thread_count = strtol(argv[1], NULL, 10);
# pragma omp parallel num_threads(thread count)
Hello();
return 0;
} /* main */
void Hello(void) {
int my_rank = omp_get_thread_num();
int thread_count = omp_get_num_threads();
printf("Hello from thread %d of %d\n", my rank, thread count);
} /* Hello */
1. Compiling and running OpenMP programs:
To run the program, we specify the number of threads on the command line. For example,
we might run the program with four threads and type
# pragma
OpenMP Clauses:
OpenMP (Open Multi-Processing) is a widely-used API in parallel computing that supports
multi-platform shared-memory multiprocessing.
It includes directives for parallel programming in C, C++, and Fortran. OpenMP clauses
are used to control the behavior and attributes of parallel constructs.
Here's an overview of some key clauses:
Data Scope Clauses
private(var-list): Each thread gets its own copy of the variables in var-list, which are
uninitialized.
shared(var-list): Variables in var-list are shared among all threads.
firstprivate(var-list): Like private, but each thread's copy is initialized with the value of the
variable before the parallel region.
lastprivate(var-list): Like private, but the variable is updated with the value from the last
iteration of the loop.
Work-sharing Clauses
schedule(type[, chunk]): Specifies how iterations of a loop are divided among threads. Types
include static, dynamic, guided, and runtime.
ordered: Ensures that iterations are executed in the order in which they would have been executed
sequentially.
nowait: Threads do not synchronize at the end of the construct.
Synchronization Clauses
critical: A block of code that must be executed by only one thread at a time.
atomic: Ensures that a specific memory location is updated atomically.
barrier: All threads must reach this point before any can proceed.
flush: Ensures all threads have a consistent view of memory.
Reduction Clauses
reduction(op: var-list): Performs a reduction on variables in var-list using the specified operator
op. Common operators include +, *, &&, and ||.
Tasking Clauses
task: Defines a unit of work that can be executed independently.
taskwait: Waits for the completion of child tasks.
depend: Specifies dependencies between tasks.
These clauses allow for fine-grained control over parallel execution and data sharing, making
OpenMP a powerful tool for parallelizing code.
Reduction Clause:
A reduction operator is a binary operation (such as addition or multiplication)
and a reduction is a computation that repeatedly applies the same reduction operator to a
sequence of operands in order to get a single result.
Furthermore, all of the intermediate results of the operation should be stored in the same
variable: the reduction variable. For example, if A is an array of n ints, the computation
int sum = 0;
for (i = 0; i < n; i++)
sum += A[i];
is a reduction in which the reduction operator is addition.
In OpenMP it may be possible to specify that the result of a reduction is a reduction
variable.
To do this, a reduction clause can be added to a parallel directive.
global_result = 0.0;
# pragma omp parallel num threads(thread count) n reduction(+: global result)
global_result += Local_trap(double a, double b, int n);
The code specifies that global result is a reduction variable and the plus sign (“+”) indicates that
the reduction operator is addition.
OpenMP creates a private variable for each thread, and the run-time system stores each thread’s
result in this private variable.
OpenMP also creates a critical section and the values stored in the private variables are added in
this critical section. Thus, the calls to Local trap can take place in parallel.
Local trap is a function that has no critical section.
Rather, each thread would return its part of the calculation, the final value of its my result variable.
The syntax of the reduction clause is
1.atomic
Specifies that a memory location that will be updated atomically.
#pragma omp atomic
expression
Parameters
expression
The statement that has the lvalue, whose memory location you want to protect against more
than one write.
2.barrier :
Synchronizes all threads in a team; all threads pause at the barrier, until all threads execute the
barrier.
The barrier directive supports no clauses.
#pragma omp barrier
3.critical:
4.flush:
Specifies that all threads have the same view of memory for all shared objects.
#pragma omp flush [(var)]
Parameters
var
(Optional) A comma-separated list of variables that represent objects you want to synchronize.
If var isn't specified, all memory is flushed.
5.for
Causes the work done in a for loop inside a parallel region to be divided among threads.
#pragma omp [parallel] for [clauses]
for_statement
Parameters
clauses
(Optional) Zero or more clauses, see the Remarks section.
for_statement
A for loop. Undefined behavior will result if user code in the for loop changes the index variable.
The for directive supports the following clauses:
private
firstprivate
lastprivate
reduction
ordered
schedule
nowait
If parallel is also specified, clauses can be any clause accepted by the parallel or for directives,
except nowait.
6.master:
Specifies that only the main thread should execute a section of the program.
#pragma omp master
{
code_block
}
7.ordered:
Specifies that code under a parallelized for loop should be executed like a sequential loop.
#pragma omp ordered
structured-block
The ordered directive must be within the dynamic extent of a for or parallel for construct with
an ordered clause.
8.parallel:
Defines a parallel region, which is code that will be executed by multiple threads in parallel.
#pragma omp parallel [clauses]
{
code_block
}
Parameters
clauses
(Optional) Zero or more clauses, see the Remarks section.
The parallel directive supports the following clauses:
if
private
firstprivate
default
shared
copyin
reduction
num_threads
parallel can also be used with the for and sections directives.
9.sections:
Parameters
clauses
(Optional) Zero or more clauses, see the Remarks section.
Remarks
private
firstprivate
lastprivate
reduction
nowait
If parallel is also specified, clauses can be any clause accepted by the parallel or sections directives,
except nowait.
10.single:
Lets you specify that a section of code should be executed on a single thread, not necessarily the
main thread.
#pragma omp single [clauses]
{
code_block
}
Parameters
clauses
(Optional) Zero or more clauses, see the Remarks section.
Remarks
The single directive supports the following clauses:
private
firstprivate
copyprivate
nowait
11.threadprivate:
In OpenMP, assigning iterations to threads is called scheduling, and the schedule clause can be
used to assign iterations in either a parallel for or a for directive.
o The static schedule type: The iterations can be assigned to the threads before the loop is
executed.
o The dynamic or guided schedule types: The iterations are assigned to the threads while
the loop is executing, so after a thread completes its current set of iterations, it can
request . more from the run-time system.
o The runtime schedule type: The schedule is determined at run-time.
The chunksize is a positive integer.
In OpenMP, a chunk of iterations is a block of iterations that would be executed consecutively in
the serial loop.
The number of iterations in the block is the chunksize.
Only static, dynamic, and guided schedules can have a chunksize.
This determines the details of the schedule, but its exact interpretation depends on the type.
For a static schedule, the system assigns chunks of chunksize iterations to each thread in a round-
robin fashion.
As an example, suppose we have 12 iterations, 0, 1, ...., 11, and three threads.
Then if schedule(static,1) is used in the parallel for or for directive, we’ve already seen that the
iterations will be assigned as
Thread 0 : 0, 3, 6, 9
Thread 1 : 1, 4, 7, 10
Thread 2 : 2, 5, 8, 11
If schedule(static,2) is used, then the iterations will be assigned as
Thread 0 : 0, 1, 6, 7
Thread 1 : 2, 3, 8, 9
Thread 2 : 4, 5, 10, 11
Thus the clause schedule(static, total iterations/thread_count) is more or less equivalent to the
default schedule used by most implementations of OpenMP.
The chunksize can be omitted.
If it is omitted, the chunksize is approximately total_iterations/thread count.
In a dynamic schedule, the iterations are also broken up into chunks of chunksize consecutive
iterations.
Each thread executes a chunk, and when a thread finishes a chunk, it requests another one from
the run-time system.
This continues until all the iterations are completed.
The chunksize can be omitted.
When it is omitted, a chunksize of 1 is used.
Environment variables are named values that can be accessed by a running program.
That is, they’re available in the program’s environment.
Some commonly used environment variables are PATH, HOME, and SHELL.
The PATH variable specifies which directories the shell should search when it’s looking for an
executable. It’s usually defined in both Unix and Win-dows.
The HOME variable specifies the location of the user’s home directory.
The SHELL variable specifies the location of the executable for the user’s shell. These are usually
defined in Unix systems. In both Unix-like systems (e.g., Linux and Mac OS X) and Windows,
environment variables can be examined and specified on the command line.
In Unix-like systems, you can use the shell’s command line.
1. Queues
Queue is a list abstract datatype in which new elements are inserted at the “rear” of the queue and
elements are removed from the “front” of the queue.
A queue can thus be viewed as an abstraction of a line of customers waiting to pay for their
groceries in a supermarket.
The elements of the list are the customers.
New customers go to the end or “rear” of the line, and the next customer to check out is the
customer standing at the “front” of the line.
When a new entry is added to the rear of a queue, we sometimes say that the entry has been
“enqueued,” and when an entry is removed from the front of a queue, we sometimes say that the
entry has been “dequeued.”
Queues occur frequently in computer science.
A queue is also a natural data structure to use in many multithreaded applications.
For example, suppose we have several “producer” threads and several “consumer” threads.
The producer threads might “produce” requests for data from a server for example, current stock
prices while the consumer threads might “consume” the request by finding or generating the
requested data the current stock prices.
The producer threads could enqueue the requested prices, and the consumer threads could dequeue
them.
In this example, the process wouldn’t be completed until the consumer threads had given the
requested data to the producer threads.
2. Message-passing:
3. Sending messages:
Pseudocode for the Send_msg() function might look something like this:
mesg = random();
and the only thread that will update dequeued is the owner of the queue.
Observe that one thread can update enqueued at the same time that another thread is using it to
compute queue_size.
To see this, let’s suppose thread q is computing queue_size. It will either get the old value of
enqueued or the new value.
It may therefore compute a queue_size of 0 or 1 when queue_size should actually be 1 or 2,
respectively, but in our program this will only cause a modest delay.
Thread q will try again later if queue_size is 0 when it should be 1, and it will execute the critical
section directive unnecessarily if queue size is 1 when it should be 2.
queue_size = enqueued-dequeued;
if (queue_size == 0)
return TRUE;
else
return FALSE;
If thread u executes this code, it’s entirely possible that some thread—call it thread v—will
send a message to thread u after u has computed queue_size = 0. Of course, after
thread u computes queue_size = 0, it will terminate and the message sent by thread v will
never be received.
However, in our program, after each thread has completed the for loop, it won’t send any new
messages.
Thus, if we add a counter done sending, and each thread increments this after completing
its for loop, then we can implement Done as follows:
When the program begins execution, a single thread, the master thread, will get command-line
arguments and allocate an array of message queues, one for each thread.
This array needs to be shared among the threads, since any thread can send to any other
thread, and hence any thread can enqueue a message in any of the queues.
Given that a message queue will (at a minimum) store
o a list of messages,
o a pointer or index to the rear of the queue,
o a pointer or index to the front of the queue,
o a count of messages enqueued, and a count of messages dequeued.
o After completing its sends, each thread increments done sending before proceeding to its final
loop of receives.
o Clearly, incrementing done sending is a critical section, and we could protect it with
a critical directive.
o However, OpenMP provides a potentially higher performance directive: the atomic directive:
# pragma omp atomic
8.locks:
A lock consists of a data structure and functions that allow the programmer to explicitly enforce
mutual exclusion in a critical section.
The use of a lock can be roughly described by the following pseudocode:
/* Executed by one thread */
Initialize the lock data structure;
...
/* Executed by multiple threads */
Attempt to lock or set the lock data structure;
Critical section;
Unlock or unset the lock data structure;
.…
/* Executed by one thread */
Destroy the lock data structure;
The lock data structure is shared among the threads that will execute the critical section.
One of the threads (e.g., the master thread) will initialize the lock, and when all the threads are
done using the lock, one of the threads should destroy it.
Cache Coherence:
CPU caches are managed by system hardware: programmers don’t have direct control over them.
This has several important consequences for shared-memory systems.
To understand these issues, suppose we have a shared-memory system with two cores, each of
which has its own private data cache.
As long as the two cores only read shared data, there is no problem.
For example, suppose that x is a shared variable that has been initialized to 2, y0 is private and
owned by core 0, and y1 and z1 are private and owned by core 1.
Now suppose the following statements are executed at the indicated times:
Then the memory location for y0 will eventually get the value 2, and the memory location
for y1 will eventually get the value 6.
However, it’s not so clear what value z1 will get.
It might at first appear that since core 0 updates x to 7 before the assign-ment to z1, z1 will get the
value 4 7 = 28.
However, at time 0, x is in the cache of core 1.
So unless for some reason x is evicted from core 0’s cache and then reloaded into core 1’s cache,
it actually appears that the original value x = 2 may be used, and z1 will get the value 4 2 = 8.
We can parallelize this by dividing the iterations in the outer loop among the cores.
If we have core count cores, we might assign the first m/core count iterations to the first core, the
next m/core count iterations to the second core, and so on.
Now suppose our shared-memory system has two cores, m = 8, doubles are eight bytes, cache
lines are 64 bytes, and y[0] is stored at the beginning of a cache line.
A cache line can store eight doubles, and y takes one full cache line.
What happens when core 0 and core 1 simultaneously execute their codes?
Since all of y is stored in a single cache line, each time one of the cores executes the statement y[i]
+= f(i,j), the line will be invalidated, and the next time the other core tries to execute this
statement it will have to fetch the updated line from memory!
So if n is large, we would expect that a large percentage of the assignments y[i] += f(i,j) will
access main memory—in spite of the fact that core 0 and core 1 never access each others’
elements of y.
This is called false sharing, because the system is behaving as if the elements of y were being
shared by the cores.