Parallelism
Parallelism
server
MPI(Message Passing Interface) is the
standard API to do this
Steps for Writing Parallel Program
If you are starting with an existing serial program,
debug the serial code completely
Identify which parts of the program can be
executed concurrently:
Requires a thorough understanding of the algorithm
Exploit any parallelism which may exist
May require restructuring of the program and/or
algorithm. May require an entirely new algorithm.
Decompose the program:
Functional Parallelism
Data Parallelism
Combination of both
Steps for Writing Parallel Program
Code development
Code may be influenced/determined by
machine architecture
Choose a programming paradigm
Determine communication
Add code to accomplish process control and
communications
Compile, Test, Debug
Optimization
Measure Performance
Locate Problem Areas
Improve them
Program Decomposition
There are three methods for
decomposing a problem into smaller
processes to be performed in
parallel: Functional Decomposition,
Domain Decomposition, or a
combination of both
Functional Decomposition (Functional
Parallelism)
The Problem
The Problem
… … …
Coarse-grain Parallelism
Typified by long computations consisting
of large numbers of instructions between
communication synchronization points
High computation to communication ratio
Implies more opportunity for performance
increase
Harder to load balance efficiently
Imagine that the computation work load is a 10
kg. of material:
Sand = fine-grain
Cinder blocks = coarse grain
Which is easier to distribute?
Coarse-grain Parallelism
Computation Computation Computation
… … …
Considerations: Data Dependency
Data dependency exists when there is multiple
use of the same storage location
Types of data dependencies
Flow Dependent: Process 2 uses a variable
computed by Process 1. Process 1 must store/send
the variable before Process 2 fetches
Output Dependent: Process 1 and Process 2 both
compute the same variable and Process 2's value
must be stored/sent after Process 1's
Control Dependent: Process 2's execution depends
upon a conditional statement in Process 1. Process 1
must complete before a decision can be made about
executing Process 2
Considerations: Data Dependency
How to handle data dependencies?
Distributed memory
Communicate required data at synchroniz
ation points
Shared memory
Synchronize read/write operations betwee
n processes
Considerations: Communication
Patterns and Bandwidth
For some problems, increasing the number of
processors will:
Decrease the execution time attributable to
computation
But also, increase the execution time attributable to
communication
Communication patterns also affect the
computation to communication ratio.
For example, gather-scatter communications
between a single processor and N other
processors will be impacted more by an increase
in latency than N processors communicating only
with nearest neighbors
They have to wait until all have reached a certain
point
Considerations: I/O Operation
I/O operations are generally regarded as
inhibitors to parallelism
In an environment where all processors
see the same file space, write operations
will result in file overwriting
Read operations will be affected by the
fileserver's ability to handle multiple read
requests at the same time
I/O which must be conducted over the
network (non-local) can cause severe
bottlenecks
Considerations: I/O Operation
Some alternatives:
Reduce overall I/O as much as possible
Confine I/O to specific serial portions of the job
For example, process 0 could read an input file and then c
ommunicate required data to other processes. Likewise, p
rocess 1 could perform write operation after receiving req
uired data from all other processes.
Create unique filenames for each processes' input/output f
ile(s)
For distributed memory systems with shared file space, pe
rform I/O in local, non-shared file space
For example, each processor may have /tmp filespace whi
ch can used. This is usually much more efficient than perf
orming I/O over the network to one's home directory
Considerations: Fault Tolerance and
Restarting
checkpoint
restarting
Considerations: Deadlock
Deadlock describes a condition where two or
more processes are waiting for an event or
communication from one of the other processes.
The simplest example is demonstrated by two
processes which are both programmed to
read/receive from the other before
writing/sending.
Process 1 Process 2
X=1 Y = 10
Recv (Process 2, Y) Recv (Process 1, X)
Send (Process 2, X) Send (Process 1, Y)
Z=X+Y Z=X+Y
… …
Considerations: Debugging
Debugging parallel programs is
significantly more of a challenge than
debugging serial programs
Debug the program as soon as the
development start
Use a modular approach to program
development
Pay as close attention to communication
details as to computation details
Essentials of Loop Parallelism
Problems that has a loop construct forms the mai
n computational component of the code. Loops ar
e a main target for parallelizing and vectorizing co
de. A program often spends much of its time in lo
ops. When it can be done, parallelizing these secti
ons of code can have dramatic benefits.
A step-wise refinement procedure for developing t
he parallel algorithms will be employed. An initial
solution for each problem will be presented and i
mproved by considering performance issues
Essentials of Loop Parallelism
Pseudo-code will be used to describe the
solutions. The solutions will address the following
issues:
identification of parallelism
program decomposition
load balancing (static vs. dynamic)
task granularity in the case of dynamic load
balancing
communication patterns - overlapping
communication and computation
Note the difference in approaches between
message passing and data parallel programming.
Message passing explicitly parallelizes the loops
where data parallel replaces loops by working on
entire arrays in parallel
Example:
Calculation (Serial)
Problem is:
Computationally intensive
Minimal communication
The value of PI can be calculated in a number of ways,
many of which are easily parallelized
Consider the following method of approximating PI
Inscribe a circle in a square
Randomly generate points in the square
Determine the number of points in the square that are
also in the circle
Let r be the number of points in the circle divided by the
number of points in the square
PI ~ 4 r
Note that the more points generated, the better the
approximation
Example:
Calculation (Serial)
2r
Example:
Calculation (Serial)
Serial pseudo code for this procedure:
npoints = 10000
circle_count = 0
do j = 1,npoints
generate 2 random numbers between 0 and 1
xcoordinate = random1
ycoordinate = random2
if (xcoordinate, ycoordinate) inside circle
then circle_count = circle_count + 1
end do
PI = 4.0*circle_count/npoints
Note that most of the time in running this progra
m would be spent executing the loop
Example:
Calculation (Parallel)
Parallel strategy: break the loop into portions
which can be executed by the processors.
For the task of approximating PI:
each processor executes its portion of the loop a
number of times
each processor can do its work without requiring
any information from the other processors (there
are no data dependencies). This situation is known
as Embarrassingly Parallel
Use SPMD (Single Processor/Multiple Data) Model –
One process acts as master and collects the results
Example:
Calculation (Parallel)
Message passing pseudo code:
npoints = 10000
circle_count = 0
p = number of processors
num = npoints/p
do j = 1,num
generate 2 random numbers between 0 and 1
xcoordinate = random1; ycoordinate = random2
if (xcoordinate, ycoordinate) inside circle
then circle_count = circle_count + 1
end do
if I am master
receive from workers their circle_counts
compute PI (use master and workers calculations)
else if I am worker
send to master circle_count
endif
Example:
Calculation (Parallel)
Data parallel solution:
The data parallel solutions processes entire
arrays at the same time.
No looping is used.
Arrays automatically distributed to processors.
All message passing is done behind the
scenes. In data parallel, one node, a sort of
master, usually holds all scalar values. The
SUM function does a reduction and leaves the
value in a scalar variable.
A temporary array, COUNTER, with the same
size as RANDOM is created for the sum
operation
Example:
Calculation (Parallel)
Data parallel pseudo code:
fill RANDOM with 2 random numbers between 0 and
1
end do
end do
Message Passing Solution:
With Fortran storage scheme, perform block cyclic d
istribution of array.
Implement as SPMD model.
Master process initializes array, sends info to worker
processes and receives results.
Worker process receives info, performs its share of c
omputation and sends results to master.
Example:
Array Elements Calculation (Parallel)
Message Passing Pseudo code:
find out if I am master or worker
if I am master
initialize the array
send each worker info on part of array it owns
send each worker its portion of initial array
receive from each worker results
else if I am worker
receive from master info on part of array I own
receive from master my portion of initial array
U
x, y+1
U U U
x-1, y x, y x+1, y
U
x, y-1
interior elements
border elements
Example:
Simple Heat Equation (Parallel)
First Pseudo code solution:
find out if I am master or worker
if I am master
initialize array
send each worker starting info
do until all workers have converged
gather from all workers convergence data
broadcast to all workers convergence signal
end do
receive results from each worker
else if I am worker
receive from master starting info
do until all workers have converged
update time
send neighbors my border info
receive from neighbors their border info
update my portion of solution array
determine if my solution has converged
send master convergence data
receive from master convergence signal
end do
send master results
endif
Example:
Simple Heat Equation (Parallel)
Data Parallel
Loops are not used. The entire array is processed in
parallel.
The distribute statements layout the data in parallel.
A SHIFT is used to increment or decrement an array
element.
DISTRIBUTE u1 (block,cyclic)
DISTRIBUTE u2 (block,cyclic)
u2 = u1 +
cx * (SHIFT (u1,1,dim 1) + SHIFT (u1,-1,dim 1) - 2.*
u1) +
cy * (SHIFT (u1,1,dim 2) + SHIFT (u1,-1,dim 2) - 2.*
u1)
Example: Simple Heat Equation
(Overlapping Communication and
Computation)
Previous examples used blocking
communications, which waits for the
communication process to complete.
Computing times can often be reduced by using
non-blocking communication.
Work can be performed while communication is in
progress.
In the heat equation problem, neighbor processes
communicated border data, then each process
updated its portion of the array.
Each process could update the interior of its part
of the solution array while the communication of
border data is occurring, and update its border
after communication has completed.
Example: Simple Heat Equation
(Overlapping Communication and
Computation)
Second Pseudo code:
find out if I am master or worker
if I am master
initialize array
send each worker starting info
do until solution converged
gather from all workers convergence data
broadcast to all workers convergence signal
end do
receive results from each worker
else if I am worker
receive from master starting info
do until solution converged
update time
non-blocking send neighbors my border info
non-blocking receive neighbors border info
update interior of my portion of solution array
wait for non-blocking communication complete
update border of my portion of solution array
determine if my solution has converged
send master convergence data
receive from master convergence signal
end do
send master results
endif
END