0% found this document useful (0 votes)

25 views25 pages

DS1822-Parallel Computing - Unit2

Uploaded by

Nisha Rajini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views25 pages

DS1822-Parallel Computing - Unit2

Uploaded by

Nisha Rajini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 25

UNIT 2 SHARED MEMORY PROGRAMMING WITH OPENMP

OpenMP Program Structure – OpenMP Clauses and Directives – Scheduling Primitives –

Synchronization Primitives – Performance Issues with Caches – Case Study – Tree Search.

I. OpenMP Program Structure

 OpenMP is an API for shared-memory parallel programming. The “MP” in OpenMP stands for
“multiprocessing,”.
 OpenMP is designed for systems in which each thread or process can potentially have access to all
available memory, and, when we’re programming with OpenMP, we view our system as a
collection of cores or CPUs, all of which have access to main memory.

 OpenMP was explicitly designed to allow programmers to incrementally parallelize existing serial
programs; this is virtually impossible with MPI and fairly difficult with Pthreads.

Program Structure:

 OpenMP provides what’s known as a “directives-based” shared-memory API.

 In C and C++, this means that there are special preprocessor instructions known as pragmas.
 Pragmas in C and C++ start with

#pragma
 Pragmas (like all preprocessor directives) are, by default, one line in length, so if a pragma won’t
fit on a single line, the newline needs to be “escaped”—that is, preceded by a backslash n.
 The details of what follows the #pragma depend entirely on which extensions are being used.
 Let’s take a look at a very simple example, a “hello, world” program that uses OpenMP.

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
void Hello(void); / Thread function /
int main(int argc, char argv[]) {
/* Get number of threads from command line */
int thread_count = strtol(argv[1], NULL, 10);
# pragma omp parallel num_threads(thread count)
Hello();
return 0;
} /* main */

void Hello(void) {
int my_rank = omp_get_thread_num();
int thread_count = omp_get_num_threads();
printf("Hello from thread %d of %d\n", my rank, thread count);
} /* Hello */
1. Compiling and running OpenMP programs:

 To compile this with gcc we need to include the fopenmp option:

 To run the program, we specify the number of threads on the command line. For example,
we might run the program with four threads and type

 If we do this, the output might be

Hello from thread 0 of 4
Hello from thread 1 of 4
Hello from thread 2 of 4
Hello from thread 3 of 4
 However, it should be noted that the threads are competing for access to stdout, so there’s
no guarantee that the output will appear in thread-rank order. For example, the output
might also be
Hello from thread 1 of 4
Hello from thread 0 of 4
Hello from thread 3 of 4
Hello from thread 2 of 4
 If we want to run the program with just one thread, we can type
2.Program Explanation:

 In addition to a collection of directives, OpenMP consists of a library of functions and macros, so

we usually need to include a header file with prototypes and macro definitions.
 The OpenMP header file is omp.h, and we include it in Line 3.
 The number of threads is specified on the command line.
 In Line 9 we therefore use the strtol function from stdlib.h to get the number of threads.
long strtol( const char* number p /* in */,char end p /* out */, int base /*in */);
 The first argument is a string—in our example, it’s the command-line argument and the last
argument is the numeric base in which the string is represented in our example, it’s base 10.
 We won’t make use of the second argument, so we’ll just pass in a NULL pointer.
 When we start the program from the command line, the operating system starts a single-threaded
process and the process executes the code in the main function. How-ever, things get interesting in
Line 11.
 This is our first OpenMP directive, and we’re using it to specify that the program should start
some threads.
 Each thread that’s forked should execute the Hello function, and when the threads return from the
call to Hello, they should be terminated, and the process should then terminate when it executes
the return statement.
 We used a for loop to start each thread, and we used another for loop to terminate the threads.
 We’ve already seen that pragmas in C and C++ start with

# pragma

OpenMP pragmas always begin with

# pragma omp
# pragma omp parallel
 A clause in OpenMP is just some text that modifies a directive. The num threads clause can be
added to a parallel directive.
 It allows the programmer to specify the number of threads that should execute the following
block:
# pragma omp parallel num_threads(thread_count)
II. OpenMP Clauses and Directives

OpenMP Clauses:
 OpenMP (Open Multi-Processing) is a widely-used API in parallel computing that supports
multi-platform shared-memory multiprocessing.
 It includes directives for parallel programming in C, C++, and Fortran. OpenMP clauses
are used to control the behavior and attributes of parallel constructs.
 Here's an overview of some key clauses:
Data Scope Clauses
 private(var-list): Each thread gets its own copy of the variables in var-list, which are
uninitialized.
 shared(var-list): Variables in var-list are shared among all threads.
 firstprivate(var-list): Like private, but each thread's copy is initialized with the value of the
variable before the parallel region.
 lastprivate(var-list): Like private, but the variable is updated with the value from the last
iteration of the loop.
Work-sharing Clauses
 schedule(type[, chunk]): Specifies how iterations of a loop are divided among threads. Types
include static, dynamic, guided, and runtime.
 ordered: Ensures that iterations are executed in the order in which they would have been executed
sequentially.
 nowait: Threads do not synchronize at the end of the construct.
Synchronization Clauses
 critical: A block of code that must be executed by only one thread at a time.
 atomic: Ensures that a specific memory location is updated atomically.
 barrier: All threads must reach this point before any can proceed.
 flush: Ensures all threads have a consistent view of memory.
Reduction Clauses
 reduction(op: var-list): Performs a reduction on variables in var-list using the specified operator
op. Common operators include +, *, &&, and ||.
Tasking Clauses
 task: Defines a unit of work that can be executed independently.
 taskwait: Waits for the completion of child tasks.
 depend: Specifies dependencies between tasks.
These clauses allow for fine-grained control over parallel execution and data sharing, making
OpenMP a powerful tool for parallelizing code.

Reduction Clause:
 A reduction operator is a binary operation (such as addition or multiplication)
and a reduction is a computation that repeatedly applies the same reduction operator to a
sequence of operands in order to get a single result.
 Furthermore, all of the intermediate results of the operation should be stored in the same
variable: the reduction variable. For example, if A is an array of n ints, the computation
int sum = 0;
for (i = 0; i < n; i++)
sum += A[i];
is a reduction in which the reduction operator is addition.
 In OpenMP it may be possible to specify that the result of a reduction is a reduction
variable.
 To do this, a reduction clause can be added to a parallel directive.
global_result = 0.0;
# pragma omp parallel num threads(thread count) n reduction(+: global result)
global_result += Local_trap(double a, double b, int n);
 The code specifies that global result is a reduction variable and the plus sign (“+”) indicates that
the reduction operator is addition.
 OpenMP creates a private variable for each thread, and the run-time system stores each thread’s
result in this private variable.
 OpenMP also creates a critical section and the values stored in the private variables are added in
this critical section. Thus, the calls to Local trap can take place in parallel.
 Local trap is a function that has no critical section.
 Rather, each thread would return its part of the calculation, the final value of its my result variable.
 The syntax of the reduction clause is

reduction(<operator>: <variable list>)

OpenMP Directives:
1. atomic
2. barrier
3. critical
4. flush
5. for
6. master
7. ordered
8. parallel
9. sections
10. single
11. threadprivate

1.atomic
 Specifies that a memory location that will be updated atomically.
#pragma omp atomic
expression
Parameters
expression
The statement that has the lvalue, whose memory location you want to protect against more
than one write.
2.barrier :

 Synchronizes all threads in a team; all threads pause at the barrier, until all threads execute the
barrier.
 The barrier directive supports no clauses.
#pragma omp barrier

3.critical:

 Specifies that code is only be executed on one thread at a time.

#pragma omp critical [(name)]
{
code_block
}
Parameters
name
(Optional) A name to identify the critical code. The name must be enclosed in parentheses.

4.flush:
 Specifies that all threads have the same view of memory for all shared objects.
#pragma omp flush [(var)]
Parameters
var
(Optional) A comma-separated list of variables that represent objects you want to synchronize.
If var isn't specified, all memory is flushed.

5.for

 Causes the work done in a for loop inside a parallel region to be divided among threads.
#pragma omp [parallel] for [clauses]
for_statement
Parameters
clauses
(Optional) Zero or more clauses, see the Remarks section.
for_statement
A for loop. Undefined behavior will result if user code in the for loop changes the index variable.
The for directive supports the following clauses:
 private
 firstprivate
 lastprivate
 reduction
 ordered
 schedule
 nowait
If parallel is also specified, clauses can be any clause accepted by the parallel or for directives,
except nowait.

6.master:

 Specifies that only the main thread should execute a section of the program.
#pragma omp master
{
code_block
}

7.ordered:

 Specifies that code under a parallelized for loop should be executed like a sequential loop.
#pragma omp ordered
structured-block
 The ordered directive must be within the dynamic extent of a for or parallel for construct with
an ordered clause.

8.parallel:

 Defines a parallel region, which is code that will be executed by multiple threads in parallel.
#pragma omp parallel [clauses]
{
code_block
}
Parameters
clauses
(Optional) Zero or more clauses, see the Remarks section.
The parallel directive supports the following clauses:
 if
 private
 firstprivate
 default
 shared
 copyin
 reduction
 num_threads
parallel can also be used with the for and sections directives.
9.sections:

 Identifies code sections to be divided among all threads.

#pragma omp [parallel] sections [clauses]
{
#pragma omp section
{
code_block
}
}

Parameters

clauses
(Optional) Zero or more clauses, see the Remarks section.

Remarks

The sections directive can contain zero or more section directives.

The sections directive supports the following clauses:

 private
 firstprivate
 lastprivate
 reduction
 nowait
If parallel is also specified, clauses can be any clause accepted by the parallel or sections directives,
except nowait.

10.single:

 Lets you specify that a section of code should be executed on a single thread, not necessarily the
main thread.
#pragma omp single [clauses]
{
code_block
}
Parameters
clauses
(Optional) Zero or more clauses, see the Remarks section.
Remarks
The single directive supports the following clauses:
 private
 firstprivate
 copyprivate
 nowait
11.threadprivate:

 Specifies that a variable is private to a thread.

#pragma omp threadprivate(var)
Parameters
var
A comma-separated list of variables that you want to make private to a thread. var must be either a
global- or namespace-scoped variable or a local static variable.

III. Scheduling Primitives

 Most OpenMP implementations use roughly a block partitioning: if there are n iterations in the
serial loop, then in the parallel loop the first n/thread_count are assigned to thread 0, the
next n/thread_count are assigned to thread 1, and so on.
 It’s not difficult to think of situations in which this assignment of iterations to threads would be
less than optimal. For example, suppose we want to parallelize the loop
sum = 0.0;
for (i = 0; i <= n; i++)
sum += f(i);
 Also suppose that the time required by the call to f is proportional to the size of the argument i.
 Then a block partitioning of the iterations will assign much more work to thread thread count - 1
than it will assign to thread 0.
 A better assignment of work to threads might be obtained with a cyclic partitioning of the
iterations among the threads.
 In a cyclic partitioning, the iterations are assigned, one at a time, in a “round-robin” fashion to the
threads. Suppose t = thread_count.
 Then a cyclic partitioning will assign the iterations as follows:

 In OpenMP, assigning iterations to threads is called scheduling, and the schedule clause can be
used to assign iterations in either a parallel for or a for directive.

1. The schedule clause

sum = 0.0;
# pragma omp parallel for num_threads(thread_count) \
reduction(+:sum)
for (i = 0; i <= n; i++) sum += f(i);
 To get a cyclic schedule, we can add a schedule clause to the parallel for directive:
sum = 0.0;
# pragma omp parallel for num threads(thread count) \
reduction(+:sum) schedule(static,1)
for (i = 0; i <= n; i++) sum += f(i);
 In general, the schedule clause has the form
schedule(<type> [, <chunksize>])
 The type can be any one of the following:

o The static schedule type: The iterations can be assigned to the threads before the loop is
executed.
o The dynamic or guided schedule types: The iterations are assigned to the threads while
the loop is executing, so after a thread completes its current set of iterations, it can
request . more from the run-time system.
o The runtime schedule type: The schedule is determined at run-time.
 The chunksize is a positive integer.
 In OpenMP, a chunk of iterations is a block of iterations that would be executed consecutively in
the serial loop.
 The number of iterations in the block is the chunksize.
 Only static, dynamic, and guided schedules can have a chunksize.
 This determines the details of the schedule, but its exact interpretation depends on the type.

The static schedule type:

 For a static schedule, the system assigns chunks of chunksize iterations to each thread in a round-
robin fashion.
 As an example, suppose we have 12 iterations, 0, 1, ...., 11, and three threads.
 Then if schedule(static,1) is used in the parallel for or for directive, we’ve already seen that the
iterations will be assigned as
Thread 0 : 0, 3, 6, 9
Thread 1 : 1, 4, 7, 10
Thread 2 : 2, 5, 8, 11
If schedule(static,2) is used, then the iterations will be assigned as
Thread 0 : 0, 1, 6, 7
Thread 1 : 2, 3, 8, 9
Thread 2 : 4, 5, 10, 11
 Thus the clause schedule(static, total iterations/thread_count) is more or less equivalent to the
default schedule used by most implementations of OpenMP.
 The chunksize can be omitted.
 If it is omitted, the chunksize is approximately total_iterations/thread count.

The dynamic schedule type:

 In a dynamic schedule, the iterations are also broken up into chunks of chunksize consecutive
iterations.
 Each thread executes a chunk, and when a thread finishes a chunk, it requests another one from
the run-time system.
 This continues until all the iterations are completed.
 The chunksize can be omitted.
 When it is omitted, a chunksize of 1 is used.

The guided schedule type:

 In a guided schedule, each thread also executes a chunk, and when a thread finishes a chunk, it
requests another one.
 However, in a guided schedule, as chunks are completed, the size of the new chunks decreases.
 In a guided schedule, if no chunksize is specified, the size of the chunks decreases down to 1.
 If chunksize is specified, it decreases down to chunksize, with the exception that the very last
chunk can be smaller than chunksize.

The runtime schedule type:

 Environment variables are named values that can be accessed by a running program.
 That is, they’re available in the program’s environment.
 Some commonly used environment variables are PATH, HOME, and SHELL.
 The PATH variable specifies which directories the shell should search when it’s looking for an
executable. It’s usually defined in both Unix and Win-dows.
 The HOME variable specifies the location of the user’s home directory.
 The SHELL variable specifies the location of the executable for the user’s shell. These are usually
defined in Unix systems. In both Unix-like systems (e.g., Linux and Mac OS X) and Windows,
environment variables can be examined and specified on the command line.
 In Unix-like systems, you can use the shell’s command line.

IV. Synchronization Primitives

Producers And Consumers:

1. Queues

 Queue is a list abstract datatype in which new elements are inserted at the “rear” of the queue and
elements are removed from the “front” of the queue.
 A queue can thus be viewed as an abstraction of a line of customers waiting to pay for their
groceries in a supermarket.
 The elements of the list are the customers.
 New customers go to the end or “rear” of the line, and the next customer to check out is the
customer standing at the “front” of the line.
 When a new entry is added to the rear of a queue, we sometimes say that the entry has been
“enqueued,” and when an entry is removed from the front of a queue, we sometimes say that the
entry has been “dequeued.”
 Queues occur frequently in computer science.
 A queue is also a natural data structure to use in many multithreaded applications.
 For example, suppose we have several “producer” threads and several “consumer” threads.
 The producer threads might “produce” requests for data from a server for example, current stock
prices while the consumer threads might “consume” the request by finding or generating the
requested data the current stock prices.
 The producer threads could enqueue the requested prices, and the consumer threads could dequeue
them.
 In this example, the process wouldn’t be completed until the consumer threads had given the
requested data to the producer threads.

2. Message-passing:

 Another natural application would be implementing message-passing on a shared-memory system.

 Each thread could have a shared message queue, and when one thread wanted to “send a message”
to another thread, it could enqueue the message in the destination thread’s queue.
 A thread could receive a message by dequeuing the message at the head of its message queue.
for (sent_msgs = 0; sent_msgs < send_max; sent_msgs++)
{
Send_msg();
Try_receive();
}
while (!Done())
Try_receive();

3. Sending messages:

 Accessing a message queue to enqueue a message is probably a critical section.

Pseudocode for the Send_msg() function might look something like this:
mesg = random();

dest = random() % thread count;

#pragma omp critical
Enqueue(queue, dest, my_rank, mesg);
 Note that this allows a thread to send a message to itself.
4. Receiving messages

 The synchronization issues for receiving a message are a little different.

 Only the owner of the queue (that is, the destination thread) will dequeue from a given message
queue.
 As long as we dequeue one message at a time, if there are at least two messages in the queue, a
call to Dequeue can’t possibly conflict with any calls to Enqueue, so if we keep track of the size of
the queue, we can avoid any synchronization, as long as there are at least two messages.
 “What about the variable storing the size of the queue?” This would be a problem if we simply
store the size of the queue. However, if we store two variables, enqueued and dequeued, then the
number of messages in the queue is

Queue_size = enqueued – dequeued

and the only thread that will update dequeued is the owner of the queue.

 Observe that one thread can update enqueued at the same time that another thread is using it to
compute queue_size.
 To see this, let’s suppose thread q is computing queue_size. It will either get the old value of
enqueued or the new value.
 It may therefore compute a queue_size of 0 or 1 when queue_size should actually be 1 or 2,
respectively, but in our program this will only cause a modest delay.
 Thread q will try again later if queue_size is 0 when it should be 1, and it will execute the critical
section directive unnecessarily if queue size is 1 when it should be 2.

Thus, we can implement Try_receive as follows:

queue_size = enqueued- dequeued;

if (queue_size == 0) return;
else if (queue_size == 1)
# pragma omp critical
Dequeue(queue, &src, &mesg);
else
Dequeue(queue, &src, &mesg);
Print_message(src, mesg);
5. Termination detection:

 We also need to think about implementation of the Done function.

 First note that the following “obvious” implementation will have problems:

queue_size = enqueued-dequeued;
if (queue_size == 0)
return TRUE;
else
return FALSE;
 If thread u executes this code, it’s entirely possible that some thread—call it thread v—will
send a message to thread u after u has computed queue_size = 0. Of course, after
thread u computes queue_size = 0, it will terminate and the message sent by thread v will
never be received.
 However, in our program, after each thread has completed the for loop, it won’t send any new
messages.
 Thus, if we add a counter done sending, and each thread increments this after completing
its for loop, then we can implement Done as follows:

queue size = enqueued - dequeued;

if (queue_size == 0 && done_sending == thread_count)
return TRUE;
else
return FALSE;
6. Startup:

 When the program begins execution, a single thread, the master thread, will get command-line
arguments and allocate an array of message queues, one for each thread.
 This array needs to be shared among the threads, since any thread can send to any other
thread, and hence any thread can enqueue a message in any of the queues.
 Given that a message queue will (at a minimum) store
o a list of messages,
o a pointer or index to the rear of the queue,
o a pointer or index to the front of the queue,
o a count of messages enqueued, and a count of messages dequeued.

7. The atomic directive:

o After completing its sends, each thread increments done sending before proceeding to its final
loop of receives.
o Clearly, incrementing done sending is a critical section, and we could protect it with
a critical directive.
o However, OpenMP provides a potentially higher performance directive: the atomic directive:
# pragma omp atomic

8.locks:

 A lock consists of a data structure and functions that allow the programmer to explicitly enforce
mutual exclusion in a critical section.
 The use of a lock can be roughly described by the following pseudocode:
/* Executed by one thread */
Initialize the lock data structure;
...
/* Executed by multiple threads */
Attempt to lock or set the lock data structure;
Critical section;
Unlock or unset the lock data structure;
.…
/* Executed by one thread */
Destroy the lock data structure;
 The lock data structure is shared among the threads that will execute the critical section.
 One of the threads (e.g., the master thread) will initialize the lock, and when all the threads are
done using the lock, one of the threads should destroy it.

V. Performance Issues with Caches

 Processors have been able to execute operations much faster than they can access data in main
memory, so if a processor must read data from main memory for each operation, it will spend
most of its time simply waiting for the data from memory to arrive.
 Also recall that in order to address this problem, chip designers have added blocks of relatively
fast memory to processors. This faster memory is called cache memory.
 The design of cache memory takes into consideration the principles of temporal and spatial
locality: if a processor accesses main memory location x at time t, then it is likely that at times
close to t, it will access main memory locations close to x.
 Thus, if a processor needs to access main memory location x, rather than transferring only the
contents of x to/from main memory, a block of memory containing x is transferred from/to the
processor’s cache.
 Such a block of memory is called a cache line or cache block.

Cache Coherence:
 CPU caches are managed by system hardware: programmers don’t have direct control over them.
 This has several important consequences for shared-memory systems.
 To understand these issues, suppose we have a shared-memory system with two cores, each of
which has its own private data cache.
 As long as the two cores only read shared data, there is no problem.
 For example, suppose that x is a shared variable that has been initialized to 2, y0 is private and
owned by core 0, and y1 and z1 are private and owned by core 1.
 Now suppose the following statements are executed at the indicated times:

 Then the memory location for y0 will eventually get the value 2, and the memory location
for y1 will eventually get the value 6.
 However, it’s not so clear what value z1 will get.
 It might at first appear that since core 0 updates x to 7 before the assign-ment to z1, z1 will get the
value 4 7 = 28.
 However, at time 0, x is in the cache of core 1.
 So unless for some reason x is evicted from core 0’s cache and then reloaded into core 1’s cache,
it actually appears that the original value x = 2 may be used, and z1 will get the value 4 2 = 8.

Fig : Shared memory system with 2 cores and 2 caches

 Note that this unpredictable behavior will occur regardless of whether the system is using a write-
through or a write-back policy.
 If it’s using a write-through policy, the main memory will be updated by the assignment x = 7.
However, this will have no effect on the value in the cache of core 1.
 If the system is using a write-back policy, the new value of x in the cache of core 0 probably
won’t even be available to core 1 when it updates z1.
Cache Coherence Problem:
 Clearly, this is a problem.
 The programmer doesn’t have direct control over when the caches are updated, so her program
cannot execute these apparently innocuous statements and know what will be stored in z1.
 There are several problems here, but the one we want to look at right now is that the caches we
described for single processor systems provide no mechanism for insuring that when the caches of
multiple processors store the same variable, an update by one processor to the cached variable is
“seen” by the other processors.
 That is, that the cached value stored by the other processors is also updated. This is called
the cache coherence problem.
Two main approaches:
 There are two main approaches to insuring cache coherence:
o snooping cache coherence
o directory-based cache coherence.
Snooping Cache Coherence:
 The idea behind snooping comes from bus-based systems:
 When the cores share a bus, any signal transmitted on the bus can be “seen” by all the cores
connected to the bus.
 Thus, when core 0 updates the copy of x stored in its cache, if it also broadcasts this information
across the bus, and if core 1 is “snooping” the bus, it will see that x has been updated and it can
mark its copy of x as invalid.
 This is more or less how snooping cache coherence works.
Directory-based cache coherence:
 Snooping cache coherence requires a broadcast every time a variable is updated.
 So snooping cache coherence isn’t scalable, because for larger systems it will cause performance
to degrade.
 Directory-based cache coherence protocols attempt to solve this problem through the use of a
data structure called a directory.
 The directory stores the status of each cache line. Typically, this data structure is distributed.
 Thus, when a line is read into, say, core 0’s cache, the directory entry corresponding to that line
would be updated indicating that core 0 has a copy of the line.
 When a variable is updated, the directory is consulted, and the cache controllers of the cores that
have that variable’s cache line in their caches are invalidated.
False sharing:
 As an example, suppose we want to repeatedly call a function f(i,j) and add the computed values
into a vector:

 We can parallelize this by dividing the iterations in the outer loop among the cores.
 If we have core count cores, we might assign the first m/core count iterations to the first core, the
next m/core count iterations to the second core, and so on.

 Now suppose our shared-memory system has two cores, m = 8, doubles are eight bytes, cache
lines are 64 bytes, and y[0] is stored at the beginning of a cache line.
 A cache line can store eight doubles, and y takes one full cache line.
 What happens when core 0 and core 1 simultaneously execute their codes?
 Since all of y is stored in a single cache line, each time one of the cores executes the statement y[i]
+= f(i,j), the line will be invalidated, and the next time the other core tries to execute this
statement it will have to fetch the updated line from memory!
 So if n is large, we would expect that a large percentage of the assignments y[i] += f(i,j) will
access main memory—in spite of the fact that core 0 and core 1 never access each others’
elements of y.
 This is called false sharing, because the system is behaving as if the elements of y were being
shared by the cores.

VI. Case Study – Tree Search

 Many problems can be solved using a tree search. As a simple example, consider the traveling
salesperson problem, or TSP.
 In TSP, a salesperson is given a list of cities she needs to visit and a cost for traveling between
each pair of cities.
 Her problem is to visit each city once, returning to her hometown, and she must do this with the
least possible cost.
 A route that starts in her hometown, visits each city once and returns to her hometown is called
a tour; thus, the TSP is to find a minimum-cost tour.
 Unfortunately, TSP is what’s known as an NP-complete problem. From a practical standpoint,
this means that there is no algorithm known for solving it that, in all cases, is significantly better
than exhaustive search.
 Exhaustive search means examining all possible solutions to the problem and choosing the best.
 The number of possible solutions to TSP grows exponentially as the number of cities is increased.
For example, if we add one additional city to an n-city problem, we’ll increase the number of
possible solutions by a factor of n-1.
 Thus, although there are only six possible solutions to a four-city problem, there are 4 x 6 = 24 to
a five-city problem, 5x 24 = 120 to a six-city problem, 6 x 120 = 720 to a seven-city problem, and
so on. In fact, a 100-city problem has far more possible solutions than the number of atoms in the
universe!
 So how can we solve TSP?
 It’s a very simple form of tree search.
 The idea is that in searching for solutions, we build a tree.
 The leaves of the tree correspond to tours, and the other tree nodes correspond to “partial” tours—
routes that have visited some, but not all, of the cities.
 Each node of the tree has an associated cost, that is, the cost of the partial tour.
 We can use this to eliminate some nodes of the tree.
 Thus, we want to keep track of the cost of the best tour so far, and, if we find a partial tour or node
of the tree that couldn’t possibly lead to a less expensive complete tour, we shouldn’t bother
searching the children of that node.
 In above figure we’ve represented a four-city TSP as a labeled, directed graph.
 A graph (not to be confused with a graph in calculus) is a collection of vertices and edges or line
segments joining pairs of vertices.
 In a directed graph or digraph, the edges are oriented—one end of each edge is the tail, and the
other is the head.
 A graph or digraph is labeled if the vertices and/or edges have labels. In our example, the vertices
of the digraph correspond to the cities in an instance of the TSP, the edges correspond to routes
between the cities, and the labels on the edges correspond to the costs of the routes. For example,
there’s a cost of 1 to go from city 0 to city 1 and a cost of 5 to go from city 1 to city 0.
 If we choose vertex 0 as the salesperson’s home city, then the initial partial tour consists only of
vertex 0, and since we’ve gone nowhere, it’s cost is 0.
 Thus, the root of the tree in Figure has the partial tour consisting only of the vertex 0 with cost 0.
From 0 we can first visit 1, 2, or 3, giving us three two-city partial tours with costs 1, 3, and 8,
respectively. In the Figure this gives us three children of the root.
 Continuing, we get six three-city partial tours, and since there are only four cities, once we’ve
chosen three of the cities, we know what the complete tour is.
 Now, to find a least-cost tour, we should search the tree.
 There are many ways to do this, but one of the most commonly used is called depth-first search.
 In depth-first search, we probe as deeply as we can into the tree.
 After we’ve either reached a leaf or found a tree node that can’t possibly lead to a least-cost tour,
we back up to the deepest “ancestor” tree node with unvisited children, and probe one of its
children as deeply as possible.
 In our example, we’ll start at the root, and branch left until we reach the leaf labeled
0 -> 1 -> 2 -> 3 -> 0, Cost 20.
 Then we back up to the tree node labeled 0 ! 1, since it is the deepest ancestor node with unvisited
children, and we’ll branch down to get to the leaf labeled
0 -> 1 -> 3 -> 2 -> 0, Cost 20.
 Continuing, we’ll back up to the root and branch down to the node labeled 0 ! 2. When we visit its
child, labeled
0 -> 2 -> 1, Cost 21,
 we’ll go no further in this subtree, since we’ve already found a complete tour with cost less than
21. We’ll back up to 0 ! 2 and branch down to its remaining unvisited child. Continuing in this
fashion, we eventually find the least-cost tour
0 -> 3 -> 1 -> 2 -> 0, Cost 15.

Lect11 Openmp1
No ratings yet
Lect11 Openmp1
35 pages
Unit 3
No ratings yet
Unit 3
13 pages
OpenMP P1
No ratings yet
OpenMP P1
32 pages
Openmp 1
No ratings yet
Openmp 1
38 pages
Openmp HPC Ass1
No ratings yet
Openmp HPC Ass1
43 pages
OpenMP for Parallel Programming
No ratings yet
OpenMP for Parallel Programming
40 pages
Programming Shared-Memory Platforms With Openmp: John Mellor-Crummey
No ratings yet
Programming Shared-Memory Platforms With Openmp: John Mellor-Crummey
46 pages
OpenMP Examples
No ratings yet
OpenMP Examples
12 pages
Beginning OpenMP
No ratings yet
Beginning OpenMP
20 pages
Lec 12 OpenMP
No ratings yet
Lec 12 OpenMP
152 pages
Xe 62011 Open MP
No ratings yet
Xe 62011 Open MP
46 pages
Omp Hands On SC08 PDF
No ratings yet
Omp Hands On SC08 PDF
153 pages
OpenMP Tutorial: Hands-On Introduction
No ratings yet
OpenMP Tutorial: Hands-On Introduction
153 pages
OpenMP 2
No ratings yet
OpenMP 2
3 pages
Parallel Programming Module 2
No ratings yet
Parallel Programming Module 2
112 pages
OpenMP Parallel Programming Guide
No ratings yet
OpenMP Parallel Programming Guide
25 pages
OMP Common Core-Voss
No ratings yet
OMP Common Core-Voss
217 pages
CS-3006 8 UsingOpenMP SharedMemoryProgramming
No ratings yet
CS-3006 8 UsingOpenMP SharedMemoryProgramming
61 pages
Shared Memory: Openmp Environment and Synchronization
No ratings yet
Shared Memory: Openmp Environment and Synchronization
32 pages
Openmp Overview
No ratings yet
Openmp Overview
74 pages
Updated - CS8083 MCP UNIT III Notes
No ratings yet
Updated - CS8083 MCP UNIT III Notes
26 pages
CP4253 Map Unit Iii
No ratings yet
CP4253 Map Unit Iii
26 pages
OpenMP Shared Memory Guide
No ratings yet
OpenMP Shared Memory Guide
35 pages
OpenMP Shared Memory Programming Guide
No ratings yet
OpenMP Shared Memory Programming Guide
65 pages
Open MPLecture
No ratings yet
Open MPLecture
54 pages
Lecture Open MP
No ratings yet
Lecture Open MP
25 pages
OpenMP Intro
No ratings yet
OpenMP Intro
52 pages
M4: Shared Memory Programming With Openmp
No ratings yet
M4: Shared Memory Programming With Openmp
63 pages
Open MP
No ratings yet
Open MP
30 pages
4.OpenMP Done
No ratings yet
4.OpenMP Done
3 pages
OpenMP Guide for Parallel Computing
No ratings yet
OpenMP Guide for Parallel Computing
32 pages
OpenMP for Parallel Programming
No ratings yet
OpenMP for Parallel Programming
29 pages
Introduction To OpenMP
No ratings yet
Introduction To OpenMP
46 pages
OpenMP Shared-Memory Programming Guide
No ratings yet
OpenMP Shared-Memory Programming Guide
37 pages
OpenMP Shared Memory Programming Guide
No ratings yet
OpenMP Shared Memory Programming Guide
24 pages
Num Tech
No ratings yet
Num Tech
39 pages
3unit3 Mca Pecnotes
No ratings yet
3unit3 Mca Pecnotes
23 pages
High Performance Computing (HPC) - Lec3
No ratings yet
High Performance Computing (HPC) - Lec3
35 pages
Introduction To Open MP
No ratings yet
Introduction To Open MP
42 pages
OpenMP for Shared Memory Programming
No ratings yet
OpenMP for Shared Memory Programming
30 pages
Lecture - 06 (Shared Memory Programming With OpenMP)
No ratings yet
Lecture - 06 (Shared Memory Programming With OpenMP)
65 pages
OpenMP Multithreading Tutorial
100% (1)
OpenMP Multithreading Tutorial
82 pages
PDC Lecture 7
No ratings yet
PDC Lecture 7
10 pages
OPENMP1
No ratings yet
OPENMP1
67 pages
A Tutorial On Parallel Computing On Shared Memory Systems
No ratings yet
A Tutorial On Parallel Computing On Shared Memory Systems
23 pages
PDSOpen MP
No ratings yet
PDSOpen MP
22 pages
Lecture Open MP
No ratings yet
Lecture Open MP
35 pages
21th 22th Lecture
No ratings yet
21th 22th Lecture
22 pages
OpenMP and MPI Multiple Choice Questions (MCQS) For Exam Preparation
No ratings yet
OpenMP and MPI Multiple Choice Questions (MCQS) For Exam Preparation
13 pages
OpenMPSlides Tamu SC PDF
No ratings yet
OpenMPSlides Tamu SC PDF
74 pages
OpenMP for Parallel Programming
No ratings yet
OpenMP for Parallel Programming
8 pages
Chap4 OpenMP
No ratings yet
Chap4 OpenMP
35 pages
Unit III
No ratings yet
Unit III
15 pages
Mcap-Lab Manual 1
No ratings yet
Mcap-Lab Manual 1
19 pages
PDC Lecture 7
No ratings yet
PDC Lecture 7
11 pages
CS8083 UNIT III Notes
No ratings yet
CS8083 UNIT III Notes
26 pages
OpenMP for Parallel Programming
No ratings yet
OpenMP for Parallel Programming
51 pages
Unit4 CV
No ratings yet
Unit4 CV
24 pages
DS1703 CV Unit1
No ratings yet
DS1703 CV Unit1
36 pages
DS1703 CV Unit2
No ratings yet
DS1703 CV Unit2
21 pages
DS1822 - Parallel Computing - Unit 1
No ratings yet
DS1822 - Parallel Computing - Unit 1
23 pages
Frontmatter
No ratings yet
Frontmatter
24 pages
1 - The Essence of Metaheuristics
No ratings yet
1 - The Essence of Metaheuristics
15 pages
Interoperability Notes EBS 12.0 and 12.1 With Database 11gR2 (Doc ID 1058763.1)
No ratings yet
Interoperability Notes EBS 12.0 and 12.1 With Database 11gR2 (Doc ID 1058763.1)
9 pages
Ict-Grade 7-Revision Sheet-Solved
No ratings yet
Ict-Grade 7-Revision Sheet-Solved
5 pages
Secure Email Setup for S7 PLCs
No ratings yet
Secure Email Setup for S7 PLCs
51 pages
C Module 5
No ratings yet
C Module 5
23 pages
Diploma in Ethical Hacking - Syllabus - Hacking Club
No ratings yet
Diploma in Ethical Hacking - Syllabus - Hacking Club
33 pages
Ankush Patil 25
No ratings yet
Ankush Patil 25
11 pages
UP Squared V2
No ratings yet
UP Squared V2
2 pages
Powerflex Manager Compatibility Matrix
No ratings yet
Powerflex Manager Compatibility Matrix
3 pages
Integration2 - BioStar 2 Device SDK and G-SDK
No ratings yet
Integration2 - BioStar 2 Device SDK and G-SDK
31 pages
E Tech Presentation.
No ratings yet
E Tech Presentation.
3 pages
2nd Last File
No ratings yet
2nd Last File
20 pages
Vagish Sir 2 (CCNA)
No ratings yet
Vagish Sir 2 (CCNA)
79 pages
MRO-L Quick Installation Guide
No ratings yet
MRO-L Quick Installation Guide
5 pages
EPSON WF-C20590 Service Manual - Page201-250
No ratings yet
EPSON WF-C20590 Service Manual - Page201-250
50 pages
AMBER18 & AMBERTOOLS18 Installation-Linux System
No ratings yet
AMBER18 & AMBERTOOLS18 Installation-Linux System
16 pages
TDA Cyber Snapshot 2022
No ratings yet
TDA Cyber Snapshot 2022
20 pages
10.RF Planning and Optimization in CDMA Network
No ratings yet
10.RF Planning and Optimization in CDMA Network
4 pages
Dhingra Classes Computer Notes 1 1
No ratings yet
Dhingra Classes Computer Notes 1 1
12 pages
CIC Edi
No ratings yet
CIC Edi
262 pages
Terrform Guide
No ratings yet
Terrform Guide
38 pages
KIET Group of Institutions: Artificial Intelligence)
No ratings yet
KIET Group of Institutions: Artificial Intelligence)
14 pages
Pam Lab 2 1725806846
No ratings yet
Pam Lab 2 1725806846
15 pages
Computer Hardware Skills Required of ND Secretaries in Imo State
No ratings yet
Computer Hardware Skills Required of ND Secretaries in Imo State
50 pages
Nepal Telecom Call Details Stolen by Chinese Hackers
0% (1)
Nepal Telecom Call Details Stolen by Chinese Hackers
2 pages
Bdcom 2500 B Serie
No ratings yet
Bdcom 2500 B Serie
5 pages
Disk Operating System
No ratings yet
Disk Operating System
3 pages
IKEA-Manual: Seeing Shape Assembly Step by Step: Work Done When Working at Autodesk AI Lab
No ratings yet
IKEA-Manual: Seeing Shape Assembly Step by Step: Work Done When Working at Autodesk AI Lab
13 pages
N9000B CXA X-Series Signal Analyzer, Multi-Touch
No ratings yet
N9000B CXA X-Series Signal Analyzer, Multi-Touch
9 pages
Practical No 29
No ratings yet
Practical No 29
3 pages

DS1822-Parallel Computing - Unit2

Uploaded by

DS1822-Parallel Computing - Unit2

Uploaded by

UNIT 2 SHARED MEMORY PROGRAMMING WITH OPENMP

OpenMP Program Structure – OpenMP Clauses and Directives – Scheduling Primitives –

I. OpenMP Program Structure

 OpenMP provides what’s known as a “directives-based” shared-memory API.

 To compile this with gcc we need to include the fopenmp option:

 If we do this, the output might be

 In addition to a collection of directives, OpenMP consists of a library of functions and macros, so

OpenMP pragmas always begin with

reduction(<operator>: <variable list>)

 Specifies that code is only be executed on one thread at a time.

 Identifies code sections to be divided among all threads.

The sections directive can contain zero or more section directives.

The sections directive supports the following clauses:

 Specifies that a variable is private to a thread.

III. Scheduling Primitives

1. The schedule clause

The static schedule type:

The dynamic schedule type:

The guided schedule type:

The runtime schedule type:

IV. Synchronization Primitives

Producers And Consumers:

 Another natural application would be implementing message-passing on a shared-memory system.

 Accessing a message queue to enqueue a message is probably a critical section.

dest = random() % thread count;

 The synchronization issues for receiving a message are a little different.

Queue_size = enqueued – dequeued

Thus, we can implement Try_receive as follows:

queue_size = enqueued- dequeued;

 We also need to think about implementation of the Done function.

queue size = enqueued - dequeued;

7. The atomic directive:

V. Performance Issues with Caches

Fig : Shared memory system with 2 cores and 2 caches

VI. Case Study – Tree Search

You might also like