0% found this document useful (0 votes)
10 views67 pages

Parallelism

Uploaded by

mohammedgalalvpn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views67 pages

Parallelism

Uploaded by

mohammedgalalvpn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 67

Parallelism

Why need Parallelism?


 Faster, of course
 Finish the work earlier
 Same work in less time
 Do more work
 More work in the same time
How to Parallelize an Application?
 Break down the computational part
into small pieces
 Assign the small jobs to the parallel
running processes
 May become complicated when the
small piece of jobs depend upon
others
Easy Case: Parameter Set
 You are running experiments to support
your claims and/or better understand a
problem
 Experiment here means an application that
you are interesting in the results by running it
with different input parameters
 The pieces of computation are the same
program with different parameters
 Each piece is independent from each other
Parameter Set using Scripts
 Your experiment should be able to run in
batch
 Read all parameters (and other inputs) from
the command line and files
 Write all output to a file (whose name you can
specify as an input)
 Use ssh to start the experiment in many
machines
 If there is no common file system, use scp to
stage the inputs and collect the results
 Use nice
Parameter Set via TDG Cluster
 A simple script that uses ssh to start
experiments in many machines will save
you a lot of time
 However, it is possible to do better by
carefully considering resource selection,
work distribution, input staging, output
collection, and the like
 That is, scheduling can really help in this
scenario, using PBS
Hard Case: Dependent Pieces of
Computation

 If you are running one huge


simulation
 the pieces of computation are not
independent anymore
 The processes that form the application

will have to communicate these


dependencies
Hard Case: Dependent Pieces of
Computation

 Think how to break the application


apart in parallel-running processes
 Consider carefully if parallelizing
your application is really worth
 Parallelizeit only if your application
really takes too much to run and is
going to be used many times
Programming Alternatives
 Shared Memory
 Does not scale that well
 Message Passing
 Sockets
 too low-level
 Usually parallel applications are not client-

server
 MPI(Message Passing Interface) is the
standard API to do this
Steps for Writing Parallel Program
 If you are starting with an existing serial program,
debug the serial code completely
 Identify which parts of the program can be
executed concurrently:
 Requires a thorough understanding of the algorithm
 Exploit any parallelism which may exist
 May require restructuring of the program and/or
algorithm. May require an entirely new algorithm.
 Decompose the program:
 Functional Parallelism
 Data Parallelism
 Combination of both
Steps for Writing Parallel Program
 Code development
 Code may be influenced/determined by
machine architecture
 Choose a programming paradigm
 Determine communication
 Add code to accomplish process control and
communications
 Compile, Test, Debug
 Optimization
 Measure Performance
 Locate Problem Areas
 Improve them
Program Decomposition
 There are three methods for
decomposing a problem into smaller
processes to be performed in
parallel: Functional Decomposition,
Domain Decomposition, or a
combination of both
Functional Decomposition (Functional
Parallelism)

 Decomposing the problem into


different processes which can be
distributed to multiple processors
for simultaneous execution
 Good to use when there is not static
structure or fixed determination of
number of calculations to be
performed
Functional Decomposition (Functional
Parallelism)

The Problem

Machine 1 Machine 2 Machine 3 Machine 4


Domain Decomposition (Data
Parallelism)
 Partitioning the problem's data domain
and distributing portions to multiple
processors for simultaneous execution
 Good to use for problems where:
 data is static (factoring and solving large
matrix or finite difference calculations)
 dynamic data structure tied to single entity
where entity can be subset (large multi-body
problems)
 domain is fixed but computation within various
regions of the domain is dynamic (fluid vortices
models)
Domain Decomposition (Data
Parallelism)

The Problem

Machine 1 Machine 2 Machine 3 Machine 4


Other Decomposition Methods –
One Dimensional Data Distribution
 Block Distribution
 Cyclic Distribution
Other Decomposition Methods –
Two Dimensional Data Distribution
 Block Block Distrib
ution
Other Decomposition Methods –
Two Dimensional Data Distribution
 Block Cyclic
Distribution
Other Decomposition Methods –
Two Dimensional Data Distribution
 Cyclic Block
Distribution
Programming
 Understanding the inter-processor
communications of your program is essential
 Message Passing communication is programmed
explicitly. The programmer must understand and
code the communication
 Data Parallel compilers and run-time systems do
all communications behind the scenes. The
programmer need not understand the underlying
communications. On the other hand to get good
performance from your code you should write
your algorithm with the best communication
possible
Considerations: Amdahl's Law
 It states that potential
program speedup is
defined by the fraction
of code (f) which can
be parallelized
 If none of the code 1
can be parallelized, f speedup 
= 0 and the speedup
= 1 (no speedup). If 1 f
all of the code is
parallelized, f = 1 and
the speedup is infinite
(in theory)
Considerations: Amdahl's Law
 Introducing the
number of processors
performing the
parallel fraction of
work, the relationship 1
can be modeled by speedup 
the equation where: P
 P: parallel fraction S
 N: number of N
processors
 S: serial fraction
Considerations: Amdahl's Law
 It is obvious that there are limits to the
scalability of parallelism. For example, at
P = .50, .90 and .99 (50%, 90% and 99%
of the code is parallelizable)
Speedup
N P=0.50 P=0.90 P=0.99
10 1.82 5.26 9.17
100 1.98 9.17 50.25
1000 1.998 9.91 90.99
10000 1.9998 9.991 99.02
Considerations: Amdahl's Law
 Problems which increase the
percentage of parallel time with
their size are more "scalable" than
problems with a fixed percentage of
parallel time
Considerations: Load Balancing
 Load balancing refers to the ways to distribute
processes so as to insure the most time efficient
parallel execution
 If processes are not distributed in a balanced way,
some processes are waiting while other processes are
idle
 Performance can be increased if work can be more
evenly distributed
 For example, if there are many processes of varying sizes,
it may be more efficient to maintain a process pool and
distribute to processors as each finishes
 Consider a heterogeneous environment where there
are machines of widely varying power and user load
versus a homogeneous environment with identical
processors running one job per processor
Considerations: Granularity
 In order to coordinate between different
processors working on the same problem, some
form of communication between them is required
 The ratio between computation and
communication is known as granularity
 The most efficient granularity is dependent on the
algorithm and the hardware environment in which
it runs
 In most cases overhead associated with
communications and synchronization is high
relative to execution speed so it is advantageous
to have coarse granularity
Fine-grain Parallelism
 All processes execute a small number of
instructions between communication cycles
 Facilitates load balancing
 Low computation to communication ratio
 Implies high communication overhead and less
opportunity for performance enhancement
 If granularity is too fine it is possible that the
overhead required for communications and
synchronization between processes takes longer
than the computation
Fine-grain Parallelism

Computation Computation Computation

Communication Communication Communication

Computation Computation Computation

Communication Communication Communication

Computation Computation Computation

… … …
Coarse-grain Parallelism
 Typified by long computations consisting
of large numbers of instructions between
communication synchronization points
 High computation to communication ratio
 Implies more opportunity for performance
increase
 Harder to load balance efficiently
 Imagine that the computation work load is a 10
kg. of material:
 Sand = fine-grain
 Cinder blocks = coarse grain
 Which is easier to distribute?
Coarse-grain Parallelism
Computation Computation Computation

Communication Communication Communication

Computation Computation Computation

Communication Communication Communication

… … …
Considerations: Data Dependency
 Data dependency exists when there is multiple
use of the same storage location
 Types of data dependencies
 Flow Dependent: Process 2 uses a variable
computed by Process 1. Process 1 must store/send
the variable before Process 2 fetches
 Output Dependent: Process 1 and Process 2 both
compute the same variable and Process 2's value
must be stored/sent after Process 1's
 Control Dependent: Process 2's execution depends
upon a conditional statement in Process 1. Process 1
must complete before a decision can be made about
executing Process 2
Considerations: Data Dependency
 How to handle data dependencies?
 Distributed memory
 Communicate required data at synchroniz
ation points
 Shared memory
 Synchronize read/write operations betwee
n processes
Considerations: Communication
Patterns and Bandwidth
 For some problems, increasing the number of
processors will:
 Decrease the execution time attributable to
computation
 But also, increase the execution time attributable to
communication
 Communication patterns also affect the
computation to communication ratio.
 For example, gather-scatter communications
between a single processor and N other
processors will be impacted more by an increase
in latency than N processors communicating only
with nearest neighbors
 They have to wait until all have reached a certain
point
Considerations: I/O Operation
 I/O operations are generally regarded as
inhibitors to parallelism
 In an environment where all processors
see the same file space, write operations
will result in file overwriting
 Read operations will be affected by the
fileserver's ability to handle multiple read
requests at the same time
 I/O which must be conducted over the
network (non-local) can cause severe
bottlenecks
Considerations: I/O Operation
 Some alternatives:
 Reduce overall I/O as much as possible
 Confine I/O to specific serial portions of the job
 For example, process 0 could read an input file and then c
ommunicate required data to other processes. Likewise, p
rocess 1 could perform write operation after receiving req
uired data from all other processes.
 Create unique filenames for each processes' input/output f
ile(s)
 For distributed memory systems with shared file space, pe
rform I/O in local, non-shared file space
 For example, each processor may have /tmp filespace whi
ch can used. This is usually much more efficient than perf
orming I/O over the network to one's home directory
Considerations: Fault Tolerance and
Restarting

 In parallel programming, it is usually


the programmer's responsibility to
handle events such as:
 machine failures
 task failures

 checkpoint

 restarting
Considerations: Deadlock
 Deadlock describes a condition where two or
more processes are waiting for an event or
communication from one of the other processes.
 The simplest example is demonstrated by two
processes which are both programmed to
read/receive from the other before
writing/sending.

Process 1 Process 2

X=1 Y = 10
Recv (Process 2, Y) Recv (Process 1, X)
Send (Process 2, X) Send (Process 1, Y)
Z=X+Y Z=X+Y
… …
Considerations: Debugging
 Debugging parallel programs is
significantly more of a challenge than
debugging serial programs
 Debug the program as soon as the
development start
 Use a modular approach to program
development
 Pay as close attention to communication
details as to computation details
Essentials of Loop Parallelism
 Problems that has a loop construct forms the mai
n computational component of the code. Loops ar
e a main target for parallelizing and vectorizing co
de. A program often spends much of its time in lo
ops. When it can be done, parallelizing these secti
ons of code can have dramatic benefits.
 A step-wise refinement procedure for developing t
he parallel algorithms will be employed. An initial
solution for each problem will be presented and i
mproved by considering performance issues
Essentials of Loop Parallelism
 Pseudo-code will be used to describe the
solutions. The solutions will address the following
issues:
 identification of parallelism
 program decomposition
 load balancing (static vs. dynamic)
 task granularity in the case of dynamic load
balancing
 communication patterns - overlapping
communication and computation
 Note the difference in approaches between
message passing and data parallel programming.
Message passing explicitly parallelizes the loops
where data parallel replaces loops by working on
entire arrays in parallel
Example:
 Calculation (Serial)
 Problem is:
 Computationally intensive
 Minimal communication
 The value of PI can be calculated in a number of ways,
many of which are easily parallelized
 Consider the following method of approximating PI
 Inscribe a circle in a square
 Randomly generate points in the square
 Determine the number of points in the square that are
also in the circle
 Let r be the number of points in the circle divided by the
number of points in the square
 PI ~ 4 r
 Note that the more points generated, the better the
approximation
Example:
 Calculation (Serial)

Asquare (2r ) 2 4r 2


Acircle r 2
Acircle
 4 
Asquare

2r
Example:
 Calculation (Serial)
 Serial pseudo code for this procedure:
 npoints = 10000
 circle_count = 0
 do j = 1,npoints
 generate 2 random numbers between 0 and 1
 xcoordinate = random1
 ycoordinate = random2
 if (xcoordinate, ycoordinate) inside circle
 then circle_count = circle_count + 1
 end do
 PI = 4.0*circle_count/npoints
 Note that most of the time in running this progra
m would be spent executing the loop
Example:
 Calculation (Parallel)
 Parallel strategy: break the loop into portions
which can be executed by the processors.
 For the task of approximating PI:
 each processor executes its portion of the loop a
number of times
 each processor can do its work without requiring
any information from the other processors (there
are no data dependencies). This situation is known
as Embarrassingly Parallel
 Use SPMD (Single Processor/Multiple Data) Model –
One process acts as master and collects the results
Example:
 Calculation (Parallel)
 Message passing pseudo code:
 npoints = 10000
 circle_count = 0
 p = number of processors
 num = npoints/p

 find out if I am master or worker

 do j = 1,num
 generate 2 random numbers between 0 and 1
 xcoordinate = random1; ycoordinate = random2
 if (xcoordinate, ycoordinate) inside circle
 then circle_count = circle_count + 1
 end do

 if I am master
 receive from workers their circle_counts
 compute PI (use master and workers calculations)
 else if I am worker
 send to master circle_count
 endif
Example:
 Calculation (Parallel)
 Data parallel solution:
 The data parallel solutions processes entire
arrays at the same time.
 No looping is used.
 Arrays automatically distributed to processors.
All message passing is done behind the
scenes. In data parallel, one node, a sort of
master, usually holds all scalar values. The
SUM function does a reduction and leaves the
value in a scalar variable.
 A temporary array, COUNTER, with the same
size as RANDOM is created for the sum
operation
Example:
 Calculation (Parallel)
 Data parallel pseudo code:
 fill RANDOM with 2 random numbers between 0 and
1

 where (the values of RANDOM are inside the circle)


 COUNTER = 1
 else where
 COUNTER = 0
 end where

 circle_count = sum (COUNTER)


 PI = 4.0*circle_count/npoints
Example:
Array Elements Calculation (Serial)
 This example shows calculations on array elements tha
t require very little communication.
 Elements of 2-dimensional array are calculated.
 The calculation of elements is independent of one anot
her - leads to embarrassingly parallel situation.
 The problem should be computation intensive.
 Serial code could be of the form:
 do j = 1,n
 do i = 1,n
 a(i,j) = fcn(i,j)
 end do
 end do
 The serial program calculates one element at a time in
the specified order
Example:
Array Elements Calculation (Parallel)
 Message Passing
 Arrays are distributed so that each processor owns a
portion of an array.
 Independent calculation of array elements insures no
communication amongst processors is needed.
 Distribution scheme is chosen by other criteria, e.g. unit
stride through arrays.
 Desirable to have unit stride through arrays, then the
choice of a distribution scheme depends on the
programming language.
 Fortran: block cyclic distribution
 C: cyclic block distribution
 After the array is distributed, each processor executes the
portion of the loop corresponding to the data it owns.
 Notice only the loop variables are different from the serial
solution
Example:
Array Elements Calculation (Parallel)
 For example, with Fortran and a block cyclic distribu
tion:
 do j = mystart, myend
 do i = 1,n
 a(i,j) = fcn(i,j)

 end do
 end do
 Message Passing Solution:
 With Fortran storage scheme, perform block cyclic d
istribution of array.
 Implement as SPMD model.
 Master process initializes array, sends info to worker
processes and receives results.
 Worker process receives info, performs its share of c
omputation and sends results to master.
Example:
Array Elements Calculation (Parallel)
 Message Passing Pseudo code:
 find out if I am master or worker
 if I am master
 initialize the array
 send each worker info on part of array it owns
 send each worker its portion of initial array
 receive from each worker results
 else if I am worker
 receive from master info on part of array I own
 receive from master my portion of initial array

 # calculate my portion of array


 do j = my first column,my last column
 do i = 1,n
 a(i,j) = fcn(i,j)
 end do
 end do
 send master results
 endif
Example:
Array Elements Calculation (Parallel)
 Data Parallel
 A trivial problem for a data parallel language.
 Data parallel languages often have compiler dir
ectives to do data distribution.
 Loops are replaced by a "for all elements" cons
truct which performs the operation in parallel.
 Good example of ease in programming versus
message passing.
 Pseudo code solution:
 DISTRIBUTE a (block, cyclic)
 for all elements (i,j)
 a(i,j) = fcn (i,j)
Example: Array Elements Calculation
(Dynamic Load Balancing)
 We've looked at problems that are static load balanced.
 each processor has fixed amount of work to do
 may be significant idle time for faster or more lightly loaded processors.
 Usually is not a major concern with dedicated usage. i.e. load leveler.
 If you have a load balance problem, you can use a “dynamic load
balancing" scheme. This solution only available in message passing.
 Two processes are employed:
 Master Process:
 holds pool of tasks for worker processes to do
 sends worker a task when requested
 collects results from workers
 Worker Process: repeatedly does the following
 gets task from master process
 performs computation
 sends results to master
 Worker processes do not know before runtime which portion of array
they will handle or how many tasks they will perform.
 The fastest process will get more tasks to do.
Example: Array Elements Calculation
(Dynamic Load Balancing)
 Solution:
 Calculate an array element
 Worker process gets task from master, performs work, sends resul
ts to master, and gets next task
 Pseudo code solution:
 find out if I am master or worker
 if I am master
 do until no more jobs
 send to worker next job
 receive results from worker
 end do
 tell workers no more jobs
 else if I am worker
 do until no more jobs
 receive from master next job
 calculate array element: a(i,j) = fcn(i,j)
 send results to master
 end do
 endif
Example: Array Elements Calculation
(Dynamic Load Balancing)
 Static load balancing can result in significant idle time for
faster processors.
 Dynamic load balancing offers a potential solution - the faster
processors do more work.
 In the dynamic load balancing solution, the workers
calculated array elements, resulting in:
 optimal load balancing: all processors complete work at the same
time
 fine granularity: small unit of computation, master and worker
communicate after every element
 fine granularity may cause very high communications cost
 Alternate Parallel Solution:
 give processors more work - columns or rows rather than
elements
 more computation and less communication results in larger
granularity
 reduced communication may improve performance
Example: Simple Heat Equation (Seria
l)
 Most problems in parallel computing require communication
among the processors.
 Common problem requires communication with "neighbor"
processor.
 The heat equation describes the temperature change over
time, given initial temperature distribution and boundary
conditions.
 A finite differencing scheme is employed to solve the heat
equation numerically on a square region.
 The initial temperature is zero on the boundaries and high in
the middle.
 The boundary temperature is held at zero.
 For the fully explicit problem, a time stepping algorithm is
used. The elements of a 2-dimensional array represent the
temperature at points on the square
Example: Simple Heat Equation (Seria
l)
Example: Simple Heat Equation (Seria
l)

U
x, y+1
U U U
x-1, y x, y x+1, y
U
x, y-1

U x , y U x , y  C x (U x 1, y  U x  1, y  2 U x , y ) C y (U x , y 1  U x , y  1  2 U x , y )


Example: Simple Heat Equation (Seria
l)
 The calculation of an element is dependen
t on neighbor element values.
 A serial program would contain code like
 do iy = 2, ny - 1
 do ix = 2, nx - 1
 u2(ix, iy) =
 u1(ix, iy)
 + cx * (u1(ix+1,iy) + u1(ix-1,iy) - 2.*u1(ix,iy))
 + cy * (u1(ix,iy+1) + u1(ix,iy-1) - 2.*u1(ix,iy))
 end do
 end do
Example:
Simple Heat Equation (Parallel)
 Arrays are distributed so that each processor owns a
portion of the arrays.
 Determine data dependencies
 interior elements belonging to a processor are
independent of other processors'
 border elements are dependent upon a neighbor
processor's data, communication is required.
 Message Passing
 First Parallel Solution:
 Fortran storage scheme, block cyclic distribution
 Implement as SPMD model
 Master process sends initial info to workers, checks for
convergence and collects results
 Worker process calculates solution, communicating as
necessary with neighbor processes
Example:
Simple Heat Equation (Parallel)

interior elements

border elements
Example:
Simple Heat Equation (Parallel)
 First Pseudo code solution:
 find out if I am master or worker
 if I am master
 initialize array
 send each worker starting info
 do until all workers have converged
 gather from all workers convergence data
 broadcast to all workers convergence signal
 end do
 receive results from each worker
 else if I am worker
 receive from master starting info
 do until all workers have converged
 update time
 send neighbors my border info
 receive from neighbors their border info
 update my portion of solution array
 determine if my solution has converged
 send master convergence data
 receive from master convergence signal
 end do
 send master results
 endif
Example:
Simple Heat Equation (Parallel)
 Data Parallel
 Loops are not used. The entire array is processed in
parallel.
 The distribute statements layout the data in parallel.
 A SHIFT is used to increment or decrement an array
element.
 DISTRIBUTE u1 (block,cyclic)
 DISTRIBUTE u2 (block,cyclic)
 u2 = u1 +
 cx * (SHIFT (u1,1,dim 1) + SHIFT (u1,-1,dim 1) - 2.*
u1) +
 cy * (SHIFT (u1,1,dim 2) + SHIFT (u1,-1,dim 2) - 2.*
u1)
Example: Simple Heat Equation
(Overlapping Communication and
Computation)
 Previous examples used blocking
communications, which waits for the
communication process to complete.
 Computing times can often be reduced by using
non-blocking communication.
 Work can be performed while communication is in
progress.
 In the heat equation problem, neighbor processes
communicated border data, then each process
updated its portion of the array.
 Each process could update the interior of its part
of the solution array while the communication of
border data is occurring, and update its border
after communication has completed.
Example: Simple Heat Equation
(Overlapping Communication and
Computation)
 Second Pseudo code:
 find out if I am master or worker
 if I am master
 initialize array
 send each worker starting info
 do until solution converged
 gather from all workers convergence data
 broadcast to all workers convergence signal
 end do
 receive results from each worker
 else if I am worker
 receive from master starting info
 do until solution converged
 update time
 non-blocking send neighbors my border info
 non-blocking receive neighbors border info
 update interior of my portion of solution array
 wait for non-blocking communication complete
 update border of my portion of solution array
 determine if my solution has converged
 send master convergence data
 receive from master convergence signal
 end do
 send master results
 endif
END

You might also like